uppdf
bib
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Christos Christodoulopoulos
|
Tanmoy Chakraborty
|
Carolyn Rose
|
Violet Peng
pdf
bib
abs
Towards Automated Error Discovery: A Study in Conversational AI
Dominic Petrak
|
Thy Thy Tran
|
Iryna Gurevych
Although LLM-based conversational agents demonstrate strong fluency and coherence, they still produce undesirable behaviors (errors) that are challenging to prevent from reaching users during deployment. Recent research leverages large language models (LLMs) to detect errors and guide response-generation models toward improvement. However, current LLMs struggle to identify errors not explicitly specified in their instructions, such as those arising from updates to the response-generation model or shifts in user behavior. In this work, we introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI, and propose SEEED (Soft Clustering Extended Encoder-Based Error Detection), as an encoder-based approach to its implementation. We enhance the Soft Nearest Neighbor Loss by amplifying distance weighting for negative samples and introduce Label-Based Sample Ranking to select highly contrastive examples for better representation learning. SEEED outperforms adapted baselines—including GPT-4o and Phi-4—across multiple error-annotated dialogue datasets, improving the accuracy for detecting unknown errors by up to 8 points and demonstrating strong generalization to unknown intent detection.
pdf
bib
abs
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs
Mohsinul Kabir
|
Ajwad Abrar
|
Sophia Ananiadou
A large number of studies rely on closed-style multiple-choice surveys to evaluate cultural alignment in Large Language Models (LLMs). In this work, we challenge this constrained evaluation paradigm and explore more realistic, unconstrained approaches. Using the World Values Survey (WVS) and Hofstede Cultural Dimensions as case studies, we demonstrate that LLMs exhibit stronger cultural alignment in less constrained settings, where responses are not forced. Additionally, we show that even minor changes, such as reordering survey choices, lead to inconsistent outputs, exposing the limitations of closed-style evaluations. Our findings advocate for more robust and flexible evaluation frameworks that focus on specific cultural proxies, encouraging more nuanced and accurate assessments of cultural alignment in LLMs.
pdf
bib
abs
Biased Tales: Cultural and Topic Bias in Generating Children’s Stories
Donya Rooein
|
Vilém Zouhar
|
Debora Nozza
|
Dirk Hovy
Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.
pdf
bib
abs
Large Language Models as Realistic Microservice Trace Generators
Donghyun Kim
|
Sriram Ravula
|
Taemin Ha
|
Alex Dimakis
|
Daehyeok Kim
|
Aditya Akella
Workload traces are essential to understand complex computer systems’ behavior and manage processing and memory resources. Since real-world traces are hard to obtain, synthetic trace generation is a promising alternative. This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we propose to train LLMs to generate recursively, making call graph generation a sequence of more manageable steps. To further enforce learning constraints on the traces and generate uncommon situations, we apply additional instruction tuning steps to align our model with the desired trace features. With this method, we train TraceLLM, an LLM for microservice trace generation, and demonstrate that it produces diverse, realistic traces under varied conditions, outperforming existing approaches in both accuracy and validity. The synthetically generated traces can effectively replace real data to optimize important microservice management tasks. Additionally, TraceLLM adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
pdf
bib
abs
JUDGEBERT: Assessing Legal Meaning Preservation Between Sentences
David Beauchemin
|
Michelle Albert-Rochette
|
Richard Khoury
|
Pierre-Luc Déziel
Simplifying text while preserving its meaning is a complex yet essential task, especially in sensitive domain applications like legal texts. When applied to a specialized field, like the legal domain, preservation differs significantly from its role in regular texts. This paper introduces FrJUDGE, a new dataset to assess legal meaning preservation between two legal texts. It also introduces JUDGEBERT, a novel evaluation metric designed to assess legal meaning preservation in French legal text simplification. JUDGEBERT demonstrates a superior correlation with human judgment compared to existing metrics. It also passes two crucial sanity checks, while other metrics did not: For two identical sentences, it always returns a score of 100%; on the other hand, it returns 0% for two unrelated sentences. Our findings highlight its potential to transform legal NLP applications, ensuring accuracy and accessibility for text simplification for legal practitioners and lay users.
pdf
bib
abs
QFrCoLA: a Quebec-French Corpus of Linguistic Acceptability Judgments
David Beauchemin
|
Richard Khoury
Large and Transformer-based language models perform outstandingly in various downstream tasks. However, there is limited understanding regarding how these models internalize linguistic knowledge, so various linguistic benchmarks have recently been proposed to facilitate syntactic evaluation of language models across languages. This paper introduces QFrCoLA (Quebec-French Corpus of Linguistic Acceptability Judgments), a normative binary acceptability judgments dataset comprising 25,153 in-domain and 2,675 out-of-domain sentences. Our study leverages the QFrCoLA dataset and seven other linguistic binary acceptability judgment corpora to benchmark seven language models. The results demonstrate that, on average, fine-tuned Transformer-based LM are strong baselines for most languages and that zero-shot binary classification large language models perform poorly on the task. However, for the QFrCoLA benchmark, on average, a fine-tuned Transformer-based LM outperformed other methods tested. It also shows that pre-trained cross-lingual LLMs selected for our experimentation do not seem to have acquired linguistic judgment capabilities during their pre-training for Quebec French. Finally, our experiment results on QFrCoLA show that our dataset, built from examples that illustrate linguistic norms rather than speakers’ feelings, is similar to linguistic acceptability judgment; it is a challenging dataset that can benchmark LM on their linguistic judgment capabilities.
pdf
bib
abs
Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?
Siqi Shen
|
Mehar Singh
|
Lajanugen Logeswaran
|
Moontae Lee
|
Honglak Lee
|
Rada Mihalcea
The value orientation of Large Language Models (LLMs) has been extensively studied, as it can shape user experiences across demographic groups.However, two key challenges remain: (1) the lack of systematic comparison across value probing strategies, despite the Multiple Choice Question (MCQ) setting being vulnerable to perturbations, and (2) the uncertainty over whether probed values capture in-context information or predict models’ real-world actions.In this paper, we systematically compare three widely used value probing methods: token likelihood, sequence perplexity, and text generation.Our results show that all three methods exhibit large variances under non-semantic perturbations in prompts and option formats, with sequence perplexity being the most robust overall.We further introduce two tasks to assess expressiveness: demographic prompting, testing whether probed values adapt to cultural context; and value–action agreement, testing the alignment of probed values with value-based actions.We find that demographic context has little effect on the text generation method, and probed values only weakly correlate with action preferences across all methods.Our work highlights the instability and the limited expressive power of current value probing methods, calling for more reliable LLM value representations.
pdf
bib
abs
A Systematic Analysis of Base Model Choice for Reward Modeling
Kian Ahrabian
|
Pegah Jandaghi
|
Negar Mokhberian
|
Sai Praneeth Karimireddy
|
Jay Pujara
Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection (+18% on average in the top 5-10). Lastly, we illustrate the impact of different post-training steps on the final performance and explore using estimated data distributions to reduce performance prediction error.
pdf
bib
abs
Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance
Branislav Pecher
|
Ivan Srba
|
Maria Bielikova
When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we answer an important question – how many labelled samples are required for the specialised small models to outperform general large models, while taking the performance variance into consideration. By observing the behaviour of fine-tuning, instruction-tuning, prompting and in-context learning on 8 language models, we identify such performance break-even points across 8 representative text classification tasks of varying characteristics. We show that the specialised models often need only few samples (on average 100) to be on par or better than the general ones. At the same time, the number of required labels strongly depends on the dataset or task characteristics, with fine-tuning on binary datasets requiring significantly more samples. When performance variance is taken into consideration, the number of required labels increases on average by 100 - 200%. Finally, larger models do not consistently lead to better performance and lower variance, with 4-bit quantisation having negligible impact.
pdf
bib
abs
Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
Melanie Subbiah
|
Akankshya Mishra
|
Grace Kim
|
Liyan Tang
|
Greg Durrett
|
Kathleen McKeown
Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
pdf
bib
abs
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
Jakub Macina
|
Nico Daheim
|
Ido Hakimi
|
Manu Kapur
|
Iryna Gurevych
|
Mrinmaya Sachan
Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
pdf
bib
abs
Preemptive Detection and Correction of Misaligned Actions in LLM Agents
Haishuo Fang
|
Xiaodan Zhu
|
Iryna Gurevych
Deploying LLM-based agents in real-life applications often faces a critical challenge: the misalignment between agents’ behavior and user intent. Such misalignment may lead agents to unintentionally execute some critical actions that carry negative outcomes (e.g., accidentally triggering a buy-now in web shopping), resulting in undesirable or even irreversible consequences. Although addressing these issues is crucial, the preemptive detection and correction of misaligned actions remains relatively underexplored. To fill this gap, we introduce InferAct, a novel approach that leverages the belief reasoning ability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions. Once the misalignment is detected, InferAct alerts users for timely correction, preventing adverse outcomes and enhancing the reliability of LLM agents’ decision-making processes. Experiments on three widely used tasks demonstrate InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection. An in-depth evaluation of misalignment correction further highlights InferAct‘s effectiveness in improving agent alignment.
pdf
bib
abs
Fingerprinting LLMs through Survey Item Factor Correlation: A Case Study on Humor Style Questionnaire
Simon Münker
LLMs increasingly engage with psychological instruments, yet how they represent constructs internally remains poorly understood. We introduce a novel approach to “fingerprinting” LLMs through their factor correlation patterns on standardized psychological assessments to deepen the understanding of LLMs constructs representation. Using the Humor Style Questionnaire as a case study, we analyze how six LLMs represent and correlate humor-related constructs to survey participants. Our results show that they exhibit little similarity to human response patterns. In contrast, participants’ subsamples demonstrate remarkably high internal consistency. Exploratory graph analysis further confirms that no LLM successfully recovers the four constructs of the Humor Style Questionnaire. These findings suggest that despite advances in natural language capabilities, current LLMs represent psychological constructs in fundamentally different ways than humans, questioning the validity of application as human simulacra.
pdf
bib
abs
Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng
|
Yifan Zhang
|
Xiang An
|
Ziyong Feng
|
Kaicheng Yang
|
Qichuan Ding
Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks. The data and pre-trained models are released at https://github.com/Multimodal-Representation-Learning-MRL/GA-DMS.
pdf
bib
abs
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning
David Dinucu-Jianu
|
Jakub Macina
|
Nico Daheim
|
Ido Hakimi
|
Iryna Gurevych
|
Mrinmaya Sachan
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model’s instructional planning.
pdf
bib
abs
CompKBQA: Component-wise Task Decomposition for Knowledge Base Question Answering
Yuhang Tian
|
Dandan Song
|
Zhijing Wu
|
Pan Yang
|
Changzhi Zhou
|
Jun Yang
|
Hao Wang
|
Huipeng Ma
|
Chenhao Li
|
Luan Zhang
Knowledge Base Question Answering (KBQA) aims to extract accurate answers from the Knowledge Base (KB). Traditional Semantic Parsing (SP)-based methods are widely used but struggle with complex queries. Recently, large language models (LLMs) have shown promise in improving KBQA performance. However, the challenge of generating error-free logical forms remains, as skeleton, topic Entity, and relation Errors still frequently occur. To address these challenges, we propose CompKBQA(Component-wise Task Decomposition for Knowledge Base Question Answering), a novel framework that optimizes the process of fine-tuning a LLM for generating logical forms by enabling the LLM to progressively learn relevant sub-tasks like skeleton generation, topic entity generation, and relevant relations generation. Additionally, we propose R3, which retrieves and incorporates KB information into the process of logical form generation. Experimental evaluations on two benchmark KBQA datasets, WebQSP and CWQ, demonstrate that CompKBQA achieves state-of-the-art performance, highlighting the importance of task decomposition and KB-aware learning.
pdf
bib
abs
Permutative Preference Alignment from Listwise Ranking of Human Judgments
Yang Zhao
|
Yixin Wang
|
Mingzhang Yin
Aligning Large Language Models (LLMs) with human preferences is crucial in ensuring desirable and controllable model behaviors. Current methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the likelihood of pairwise choices. However, when multiple responses are available, the B-T model fails to guarantee an accurate list ranking of the responses. To address this issue, we propose Permutative Preference Alignment (PPA), a novel offline listwise approach that incorporates the Normalized Discounted Cumulative Gain (NDCG)—a widely-used ranking metric—as an alternative training objective for LLM alignment. We develop an end-to-end alignment algorithm by approximating NDCG with a differentiable surrogate loss. Experiments demonstrate that PPA outperforms existing pairwise and listwise methods on evaluation sets and general benchmarks such as AlpacaEval. Furthermore, we show that NDCG-based approaches improve ranking accuracy more effectively than B-T-based methods and provide a theoretical explanation for this improvement.
pdf
bib
abs
ToneCraft: Cantonese Lyrics Generation with Harmony of Tones and Pitches
Junyu Cheng
|
Chang Pan
|
Shuangyin Li
Lyrics generation has garnered increasing attention within the artificial intelligence community. Our task focuses on generating harmonious Cantonese lyrics. Unlike other languages, Cantonese has a unique system of nine contours and six tones, making it essential to satisfy the harmony rules that ensure the alignment between the melody and the tonal contours of the lyrics when composing lyrics. Current research has not yet addressed the challenge of generating lyrics that adhere to Cantonese harmony rules. To tackle this issue, we propose ToneCraft, a novel framework for generating Cantonese lyrics that ensures tonal and melodic harmony. It enables LLMs to generate lyrics with a fixed character count while aligning with tonal and melodic structures. We present an algorithm that combines character-level control, melodic guidance, and a task-specific loss to achieve tonal harmony without compromising generation flexibility and quality. By incorporating domain-specific expertise, we leverage pure lyric datasets to train our model, eliminating the need for aligned data. Both objective evaluations and subjective assessments show that our generated lyrics align with melodic contours significantly better than existing methods. All code and data are available at: https://github.com/purepasser-by/ToneCraft.
pdf
bib
abs
SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition
Zechen Li
|
Shohreh Deldari
|
Linyao Chen
|
Hao Xue
|
Flora D. Salim
We introduce SensorLLM, a two-stage framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from sensor time-series data. Despite their strong reasoning and generalization capabilities, LLMs remain underutilized for motion sensor data due to the lack of semantic context in time-series, computational constraints, and challenges in processing numerical inputs. SensorLLM addresses these limitations through a Sensor-Language Alignment stage, where the model aligns sensor inputs with trend descriptions. Special tokens are introduced to mark channel boundaries. This alignment enables LLMs to capture numerical variations, channel-specific features, and data of varying durations, without requiring human annotations. In the subsequent Task-Aware Tuning stage, we refine the model for HAR classification, achieving performance that matches or surpasses state-of-the-art methods. Our results demonstrate that SensorLLM evolves into an effective sensor learner, reasoner, and classifier through human-intuitive Sensor-Language Alignment, generalizing across diverse HAR datasets. We believe this work establishes a foundation for future research on time-series and text alignment, paving the way for foundation models in sensor data analysis. Our codes are available at https://github.com/zechenli03/SensorLLM.
pdf
bib
abs
MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora
Tuan-Luc Huynh
|
Thuy-Trang Vu
|
Weiqing Wang
|
Trung Le
|
Dragan Gasevic
|
Yuan-Fang Li
|
Thanh-Toan Do
Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
pdf
bib
abs
ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos
Patrick Giedemann
|
Pius von Däniken
|
Jan Milan Deriu
|
Alvaro Rodrigo
|
Anselmo Peñas
|
Mark Cieliebak
The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.
pdf
bib
abs
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng
|
Dayuan Fu
|
Xiangkun Hu
|
Xiaojie Cai
|
Lyumanshan Ye
|
Pengrui Lu
|
Pengfei Liu
Large Language Models (LLMs) with web search capabilities show significant potential for deep research, yet current methods—brittle prompt engineering or RAG-based reinforcement learning in controlled environments—fail to capture real-world complexities. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG approaches reliant on fixed corpora, DeepResearcher trains agents to navigate the noisy, dynamic open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, such as planning, cross-validation, self-reflection for research redirection, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is fundamental for developing robust research capabilities aligned with real-world applications. The source codefor DeepResearcher is released at: https://github.com/GAIR-NLP/DeepResearcher.
pdf
bib
abs
Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning
Enjun Du
|
Siyi Liu
|
Yongqi Zhang
Knowledge Graph (KG) reasoning, which aims to infer new facts from structured knowledge repositories, plays a vital role in Natural Language Processing (NLP) systems. Its effectiveness critically depends on constructing informative and contextually relevant reasoning paths. However, existing graph neural networks (GNNs) often adopt rigid, query-agnostic path-exploration strategies, limiting their ability to adapt to diverse linguistic contexts and semantic nuances. To address these limitations, we propose MoKGR, a mixture-of-experts framework that personalizes path exploration through two complementary components: (1) a mixture of length experts that adaptively selects and weights candidate path lengths according to query complexity, providing query-specific reasoning depth; and (2) a mixture of pruning experts that evaluates candidate paths from a complementary perspective, retaining the most informative paths for each query. Through comprehensive experiments on diverse benchmark, MoKGR demonstrates superior performance in both transductive and inductive settings, validating the effectiveness of personalized path exploration in KGs reasoning.
pdf
bib
abs
MPRF: Interpretable Stance Detection through Multi-Path Reasoning Framework
ZhaoDan Zhang
|
Jin Zhang
|
Hui Xu
|
Jiafeng Guo
|
Xueqi Cheng
Stance detection, a critical task in Natural Language Processing (NLP), aims to identify the attitude expressed in text toward specific targets. Despite advancements in Large Language Models (LLMs), challenges such as limited interpretability and handling nuanced content persist. To address these issues, we propose the Multi-Path Reasoning Framework (MPRF), a novel framework that generates, evaluates, and integrates multiple reasoning paths to improve accuracy, robustness, and transparency in stance detection. Unlike prior work that relies on single-path reasoning or static explanations, MPRF introduces a structured end-to-end pipeline: it first generates diverse reasoning paths through predefined perspectives, then dynamically evaluates and optimizes each path using LLM-based scoring, and finally fuses the results via weighted aggregation to produce interpretable and reliable predictions. Extensive experiments on the SEM16, VAST, and PStance datasets demonstrate that MPRF outperforms existing models. Ablation studies further validate the critical role of MPRF’s components, highlighting its effectiveness in enhancing interpretability and handling complex stance detection tasks.
pdf
bib
abs
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
Junjie Ye
|
Yuming Yang
|
Yang Nan
|
Shuo Li
|
Qi Zhang
|
Tao Gui
|
Xuanjing Huang
|
Peng Wang
|
Zhongchao Shi
|
Jianping Fan
Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model’s knowledge remains underexplored, limiting our ability to control knowledge behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.
pdf
bib
abs
JI2S: Joint Influence‐Aware Instruction Data Selection for Efficient Fine‐Tuning
Jingyu Wei
|
Bo Liu
|
Tianjiao Wan
|
Baoyun Peng
|
Xingkong Ma
|
Mengmeng Guo
Instruction tuning (IT) improves large language models (LLMs) by aligning their outputs with human instructions, but its success depends critically on training data quality, and datasets such as Alpaca often contain noisy or suboptimal examples that undermine fine‐tuning. Prior selection strategies score samples using general‐purpose LLMs (e.g., GPT), leveraging their strong language understanding yet introducing inherent biases that misalign with the target model’s behavior and yield unstable downstream performance. Influence‐based methods address this by estimating each example’s marginal contribution to overall performance, but they typically assume additive contributions and therefore overlook higher‐order interactions among samples. To overcome these limitations, we propose JI2S, a novel framework that jointly models both marginal and combinatorial influences within sample groups. Applying JI2S to select the top 1,000 most influential examples from Alpaca, we fine‐tune LLaMA2‐7B, Mistral‐7B, and LLaMA2‐13B and evaluate them on Open LLM Benchmarks, MT‐Bench, and GPT‐4–judged pairwise comparisons. Our experiments show that JI2S consistently outperforms full‐dataset training and strong baselines, highlighting the value of capturing joint influence for high‐quality instruction fine‐tuning. We provide our code in this GitHub repository.
pdf
bib
abs
SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models
Xingjian Diao
|
Chunhui Zhang
|
Keyi Kong
|
Weiyi Wu
|
Chiyu Ma
|
Zhongyu Ouyang
|
Peijun Qing
|
Soroush Vosoughi
|
Jiang Gui
While large language models have demonstrated impressive reasoning abilities, their extension to the audio modality, particularly within large audio-language models (LALMs), remains underexplored. Addressing this gap requires a systematic approach that involves a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this work, we present a comprehensive solution for audio logical reasoning (ALR) tasks: we introduce SoundMind, a dataset of 6,446 audio–text annotated samples specifically curated to support complex reasoning. Building on this resource, we propose SoundMind-RL, a rule-based reinforcement learning (RL) algorithm designed to equip audio-language models with robust audio–text reasoning capabilities. By fine-tuning Qwen2.5-Omni-7B on the proposed SoundMind dataset using SoundMind-RL, we achieve strong and consistent improvements over state-of-the-art baselines on the SoundMind benchmark. This work highlights the benefit of combining high-quality, reasoning-focused datasets with specialized RL techniques, and contributes to advancing auditory intelligence in language models. The code and dataset are publicly available at https://github.com/xid32/SoundMind.
pdf
bib
abs
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
Xiangchen Wang
|
Jinrui Zhang
|
Teng Wang
|
Haigang Zhang
|
Feng Zheng
Recent advancements in large video-language models have revolutionized video understanding tasks. However, their efficiency is significantly constrained by processing high volumes of visual tokens. Existing token compression strategies apply a fixed compression ratio, ignoring the variability in semantic density among different video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length. Our design mimics how humans dynamically express what they see: complex scenes (seeing more) elicit more detailed language to convey nuances (saying more), whereas simpler scenes are described with fewer words. Experimental results show that our method reduces FLOPs by 49% compared to VideoGPT+ while maintaining competitive performance. Furthermore, qualitative results demonstrate our approach adaptively adjusts the token compression ratio based on video segment richness. Code will be released once acceptance.
pdf
bib
abs
RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals
Xuanliang Zhang
|
Dingzirui Wang
|
Keyan Xu
|
Qingfu Zhu
|
Wanxiang Che
The table reasoning task, crucial for efficient data acquisition, aims to answer questions based on the given table. Recently, reasoning large language models (RLLMs) with Long Chain-of-Thought (Long CoT) significantly enhance reasoning capabilities, leading to brilliant performance on table reasoning. However, Long CoT suffers from high cost for training and exhibits low reliability due to table content hallucinations. Therefore, we propose Row-of-Thought (RoT), which performs iteratively row-wise table traversal, allowing for reasoning extension and reflection-based refinement at each traversal. Scaling reasoning length by row-wise traversal and leveraging reflection capabilities of LLMs, RoT is training-free. The sequential traversal encourages greater attention to the table, thus reducing hallucinations. Experiments show that RoT, using non-reasoning models, outperforms RLLMs by an average of 4.3%, and achieves state-of-the-art results on WikiTableQuestions and TableBench with comparable models, proving its effectiveness. Also, RoT outperforms Long CoT with fewer reasoning tokens, indicating higher efficiency.
pdf
bib
abs
T-MAD: Target-driven Multimodal Alignment for Stance Detection
ZhaoDan Zhang
|
Jin Zhang
|
Xueqi Cheng
|
Hui Xu
Multimodal Stance Detection (MSD) aims to determine a user’s stance - support, oppose, or neutral - toward a target by analyzing multimodal content such as texts and images from social media. Existing MSD methods struggle with generalizing to unseen targets and handling modality inconsistencies. To address these challenges, we propose the Target-driven Multi-modal Alignment and Dynamic Weighting Model (T-MAD), which combines target-driven multi-modal alignment and dynamic weighting mechanisms to capture target-specific relationships and balance modality contributions. The model incorporates iterative reasoning to iteratively refine predictions, achieving robust performance in both in-target and zero-shot settings. Experiments on the MMSD and MultiClimate datasets show that T-MAD outperforms state-of-the-art models, with optimal results achieved using RoBERTa, ViT, and an iterative depth of 5. Ablation studies further confirm the importance of multi-modal alignment and dynamic weighting in enhancing model effectiveness.
pdf
bib
abs
Emotion Transfer with Enhanced Prototype for Unseen Emotion Recognition in Conversation
Kun Peng
|
Cong Cao
|
Hao Peng
|
Guanlin Wu
|
Zhifeng Hao
|
Lei Jiang
|
Yanbing Liu
|
Philip S. Yu
Current Emotion Recognition in Conversation (ERC) research follows a closed-domain assumption. However, there is no clear consensus on emotion classification in psychology, which presents a challenge for models when it comes to recognizing previously unseen emotions in real-world applications. To bridge this gap, we introduce the Unseen Emotion Recognition in Conversation (UERC) task for the first time and propose **ProEmoTrans**, a solid prototype-based emotion transfer framework. This prototype-based approach shows promise but still faces key challenges: First, implicit expressions complicate emotion definition, which we address by proposing an LLM-enhanced description approach. Second, utterance encoding in long conversations is difficult, which we tackle with a proposed parameter-free mechanism for efficient encoding and overfitting prevention. Finally, the Markovian flow nature of emotions is hard to transfer, which we address with an improved Attention Viterbi Decoding (AVD) method to transfer seen emotion transitions to unseen emotions. Extensive experiments on three datasets show that our method serves as a strong baseline for preliminary exploration in this new area.
pdf
bib
abs
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization
Ruoxi Cheng
|
Yizhong Ding
|
Shuirong Cao
|
Ranjie Duan
|
Xiaoshuang Jia
|
Shaowei Yuan
|
Simeng Qin
|
Zhiqiang Wang
|
Xiaojun Jia
Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.
pdf
bib
abs
Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models
Yilong Xu
|
Jinhua Gao
|
Xiaoming Yu
|
Yuanhai Xue
|
Baolong Bi
|
Huawei Shen
|
Xueqi Cheng
Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provide valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility and generalize across tasks. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.
pdf
bib
abs
SportReason: Evaluating Retrieval-Augmented Reasoning across Tables and Text for Sports Question Answering
Kaiyue Feng
|
Siyue Zhang
|
Bingsen Chen
|
Yilun Zhao
|
Chen Zhao
We present SportReason, a benchmark for retrieval-augmented reasoning on numerical sports questions. Unlike existing benchmarks limited to one or two evidence units, SportReason requires combining and reasoning across free-text, structured tables, and semi-structured infoboxes. We provide 3,000 human-verified QA pairs by repurposing existing QA and table generation datasets, and by prompting large language models (LLMs). Each pair is grounded in multiple evidence from a multi-modal Wikipedia corpus containing 200K knowledge contexts. We evaluate existing retrievers and rerankers, along with agentic Retrieval-Augmented Generation (RAG) systems. The experimental results show that multi-evidence retrieval remains a challenge. Agentic RAG systems (e.g., Search-o1), despite iterative retrieval and reasoning capabilities, fail to improve performance due to imprecise queries, simple training, and distracting information.
pdf
bib
abs
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Junsheng Huang
|
Zhitao He
|
Yuchen Huang
|
Sandeep Polisetty
|
Qingyun Wang
|
Yi R. Fung
With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments across various base models and different model sizes demonstrate that our method proposed outperforms baselines by up to 25% in average precision.
pdf
bib
abs
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Zhenyi Shen
|
Hanqi Yan
|
Linhai Zhang
|
Zhanghao Hu
|
Yali Du
|
Yulan He
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.
pdf
bib
abs
PAFT: Prompt-Agnostic Fine-Tuning
Chenxing Wei
|
Yao Shu
|
Mingwen Ou
|
Ying He
|
Fei Yu
Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT consistently demonstrates improved performance on benchmarks for question answering, mathematical reasoning, and tool use. It achieves 7% higher generalization accuracy on unseen prompts than standard methods with similar training efficiency. Notably, models trained with PAFT attain 3.2× faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.
pdf
bib
abs
Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning
Deng Linger
|
Linghao Zhu
|
Yuliang Liu
|
Yu Wang
|
Qunyi Xie
|
Jingjing Wu
|
Gang Zhang
|
Yingying Zhu
|
Xiang Bai
Large Multimodal Models (LMMs) face limitations in geometric reasoning due to insufficient Chain of Thought (CoT) image-text training data. While existing approaches leverage template-based or LLM-assisted methods for geometric CoT data creation, they often face challenges in achieving both diversity and precision. To bridge this gap, we introduce a two-stage Theorem-Validated Reverse Chain-of-Thought Reasoning Synthesis (TR-CoT) framework. The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties. The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments. Our approach expands theorem-type coverage, corrects long-standing misunderstandings, and enhances geometric reasoning. Fine-grained CoT improves theorem understanding and increases logical consistency by 24.5%. Our best models surpass the baselines in MathVista and GeoQA by 10.1% and 4.7%, outperforming advanced closed-source models like GPT-4o.
pdf
bib
abs
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Yanshu Li
|
Jianjiang Yang
|
Tian Yun
|
Pinyuan Feng
|
Jinfa Huang
|
Ruixiang Tang
Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision–language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.
pdf
bib
abs
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey
Tianxin Xie
|
Yan Rong
|
Pengfei Zhang
|
Wenwu Wang
|
Li Liu
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides **the first** comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
pdf
bib
abs
Automating Steering for Safe Multimodal Large Language Models
Lyucheng Wu
|
Mengru Wang
|
Ziwen Xu
|
Tri Cao
|
Nay Oo
|
Bryan Hooi
|
Shumin Deng
Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
pdf
bib
abs
EMNLP: Educator-role Moral and Normative Large Language Models Profiling
Yilin Jiang
|
Mingzi Zhang
|
Sheng Jin
|
Zengyi Yu
|
Xiangjie Kong
|
Binghao Tu
Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.
pdf
bib
abs
TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain
Bohao Chu
|
Meijie Li
|
Sameh Frihat
|
Chengyu Gu
|
Georg Lodde
|
Elisabeth Livingstone
|
Norbert Fuhr
While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist (e.g., hallucination), especially in the medical domain. Tracing source evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citations pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves summary completeness. Source code and dataset are available at https://github.com/chubohao/TracSum.
pdf
bib
abs
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning
Wenbin Hu
|
Haoran Li
|
Huihao Jing
|
Qi Hu
|
Ziqian Zeng
|
Sirui Han
|
Xu Heli
|
Tianshu Chu
|
Peizhao Hu
|
Yangqiu Song
While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +8.58% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.
pdf
bib
abs
Towards General-Domain Word Sense Disambiguation: Distilling Large Language Model into Compact Disambiguator
Liqiang Ming
|
Sheng-hua Zhong
|
Yuncong Li
Word Sense Disambiguation (WSD) aims to determine the correct meaning of a word in context from a predefined inventory, and remains a fundamental challenge in natural language understanding. Existing methods rely heavily on manually annotated data, which limits coverage and generalization. In this work, we propose a scalable framework that leverages large language models (LLMs) as knowledge distillers to construct silver-standard WSD corpora. We explore generation-based distillation, where diverse examples are synthesized for dictionary senses, and annotation-based distillation, where LLMs assign sense labels to polysemous words within real-world corpus sentences. The resulting data is used to train tiny models. Extensive experiments show that models distilled from LLM-generated data outperform those trained on gold-standard corpora, especially on general-domain benchmarks. Our annotation-based model, after balancing sense distribution, achieves 50% F1 gain on the most challenging test set and the best distilled model can match or even exceed the performance of its LLM teacher, despite having over 1000 times fewer parameters. These results demonstrate the effectiveness of LLM-based distillation for building accurate, generalizable, and efficient WSD systems.
pdf
bib
abs
SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models
Hongyuan Lu
|
Zixuan Li
|
Zefan Zhang
|
Wai Lam
There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called Automatic Dictionary Selection (ADS). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call Select Low-frequency Words! (SLoW) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.
pdf
bib
abs
Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu
|
Zhihao Teng
|
Kewei Tu
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.
pdf
bib
abs
EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
Yuhang Chen
|
Zhen Tan
|
Tianlong Chen
Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents’ spatial, temporal, and logical understanding is critical yet not considerred by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9% accuracy on EQA-RM-Bench with 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM.
pdf
bib
abs
Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations
Yongkang Chen
|
Xiaohu Du
|
Xiaotian Zou
|
Chongyang Zhao
|
Huan Deng
|
Hu Li
|
Xiaohui Kuang
The responsible deployment of Large Language Models (LLMs) necessitates rigorous safety evaluations. However, a critical challenge arises from inconsistencies between an LLM’s internal refusal decisions and external safety assessments, hindering effective validation. This paper introduces the concept of the ‘refusal gap’ to formally define these discrepancies. We then present a novel, refusal-aware red teaming framework designed to automatically generate test cases that expose such gaps. Our framework employs ‘refusal probes’, which leverage the target model’s hidden states, to detect internal model refusals. These are subsequently contrasted with judgments from an external safety evaluator. The identified discrepancy serves as a signal to guide a red-teaming model in crafting test cases that maximize this refusal gap. To further enhance test case diversity and address challenges related to sparse rewards, we introduce a hierarchical, curiosity-driven mechanism that incentivizes both refusal gap maximization and broad topic exploration. Empirical results demonstrate that our method significantly outperforms existing reinforcement learning-based approaches in generating diverse test cases and achieves a substantially higher discovery rate of refusal gaps.
pdf
bib
abs
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Zekun Xi
|
Wenbiao Yin
|
Jizhan Fang
|
Jialong Wu
|
Runnan Fang
|
Yong Jiang
|
Pengjun Xie
|
Fei Huang
|
Huajun Chen
|
Ningyu Zhang
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model’s predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
pdf
bib
abs
LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL
Yihan Wang
|
Peiyu Liu
|
Xin Yang
Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign
pdf
bib
abs
On Relation-Specific Neurons in Large Language Models
Yihong Liu
|
Runsheng Chen
|
Lea Hirlimann
|
Ahmad Dawar Hakimi
|
Mingyang Wang
|
Amir Hossein Kargaran
|
Sascha Rothe
|
François Yvon
|
Hinrich Schuetze
In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While factual knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself – independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the LLama-2 family on a chosen set of relations, with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation
r on the LLM’s ability to handle (1) facts involving relation
r and (2) facts involving a different relation
r' ≠ r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons.
(i) Neuron cumulativity. Multiple neurons jointly contribute to processing facts involving relation
r, with no single neuron fully encoding a fact in
r on its own.
(ii) Neuron versatility. Neurons can be shared across multiple closely related as well as less related relations. In addition, some relation neurons transfer across languages.
(iii) Neuron interference. Deactivating neurons specific to one relation can improve LLMs’ factual recall performance for facts of other relations. We make our code and data publicly available at
https://github.com/cisnlp/relation-specific-neurons.
pdf
bib
abs
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
Hengyu An
|
Jinghuai Zhang
|
Tianyu Du
|
Chunyi Zhou
|
Qingming Li
|
Tao Lin
|
Shouling Ji
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect\ Prompt\ Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model’s inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To\ prevent\ malicious\ tool\ invocations\ at\ the\ source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents’ task execution process as a traversal over a planned Tool\ Dependency\ Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.
pdf
bib
abs
ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering
Xingjian Diao
|
Weiyi Wu
|
Keyi Kong
|
Peijun Qing
|
Xinwen Xu
|
Ming Cheng
|
Soroush Vosoughi
|
Jiang Gui
Visual Question Answering (VQA) is increasingly used in diverse applications ranging from general visual reasoning to safety-critical domains such as medical imaging and autonomous systems, where models must provide not only accurate answers but also explanations that humans can easily understand and verify. Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions for purely visual reasoning tasks, yet remains underexplored in the context of VQA. We present ProtoVQA, a unified prototypical framework that (i) learns question-aware prototypes that serve as reasoning anchors, connecting answers to discriminative image regions, (ii) applies spatially constrained matching to ensure that the selected evidence is coherent and semantically relevant, and (iii) supports both answering and grounding tasks through a shared prototype backbone. To assess explanation quality, we propose the Visual–Linguistic Alignment Score (VLAS), which measures how well the model’s attended regions align with ground-truth evidence. Experiments on Visual7W show that ProtoVQA yields faithful, fine-grained explanations while maintaining competitive accuracy, advancing the development of transparent and trustworthy VQA systems.
pdf
bib
abs
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
|
Yaqi Zhao
|
Yajie Zhang
|
Yuanxing Zhang
|
Ke Lin
|
Jiahao Wang
|
Xin Tao
|
Pengfei Wan
|
Wentao Zhang
|
Feng Zhao
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities by integrating visual and textual inputs, yet modality alignment remains one of the most challenging aspects. Current MLLMs typically rely on simple adapter architectures and pretraining approaches to bridge vision encoders with large language models (LLM), guided by image-level supervision. We identify this paradigm often leads to suboptimal alignment between modalities, significantly constraining the LLM’s ability to properly interpret and reason with visual features particularly for smaller language models. To address this fundamental limitation, we propose Supervised Embedding Alignment (SEA), a token-level supervision alignment method that enables more precise visual-text alignment during pretraining. SEA introduces minimal computational overhead while preserving language capabilities and substantially improving cross-modal understanding. Our comprehensive analyses reveal critical insights into the adapter’s role in multimodal integration, and extensive experiments demonstrate that SEA consistently improves performance across various model sizes, with smaller models benefiting the most (average performance gain of 7.61% for Gemma-2B). This work establishes a foundation for developing more effective alignment strategies for future multimodal systems.
pdf
bib
abs
Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction
George Arthur Baker
|
Mario Sanz-Guerrero
|
Katharina von der Wense
Large Language Models (LLMs) have demonstrated capabilities for natural language formulations of molecular property prediction tasks, but little is known about how performance depends on the representation of input molecules to the model; the status quo approach is to use SMILES strings, although alternative chemical notations convey molecular information differently, each with their own strengths and weaknesses. To learn more about molecular string representation preferences in LLMs, we compare the performance of four recent models—GPT-4o, Gemini 1.5 Pro, Llama 3.1 405b, and Mistral Large 2—on molecular property prediction tasks from the MoleculeNet benchmark across five different molecular string representations: SMILES, DeepSMILES, SELFIES, InChI, and IUPAC names. We find statistically significant zero- and few-shot preferences for InChI and IUPAC names, potentially due to representation granularity, favorable tokenization, and prevalence in pretraining corpora. This contradicts previous assumptions that molecules should be presented to LLMs as SMILES strings. When these preferences are taken advantage of, few-shot performance rivals or surpasses many previous conventional approaches to property prediction, with the advantage of explainable predictions through chain-of-thought reasoning not held by task-specific models.
pdf
bib
abs
Weight-Aware Activation Sparsity with Constrained Bayesian Optimization Scheduling for Large Language Models
Ming Wang
|
Miao Zhang
|
Xuebo Liu
|
Liqiang Nie
Activation sparsity provides a dynamic, input-dependent alternative to weight pruning for accelerating inference in large language models (LLMs), effectively reducing unnecessary computations and memory accesses during the forward pass. Despite its promise, existing activation sparsification methods suffer from two major limitations: (1) solely relying on activation magnitude for sparsification, ignoring the coupling influence with the corresponding weights, (2) applying uniform sparsity rates across all blocks without considering block-wise sparsity sensitivity. To address these issues, this paper proposes a novel training-free weight-aware activation sparsity framework, called **WAS**. Firstly, with analyzing the coupling relationship between weight and activation, we introduce a weight-aware scoring method to measure the activation importance in sparsification. Then, a novel constrained Bayesian optimization algorithm is further devised to set a suitable sparsity ratio for all blocks based on the sparsity sensitivity. Finally, we implement a custom GPU sparsity kernel to support the resulting sparsity patterns for wall-clock decoding speed-ups. Our **WAS** achieves competitive performance at 60% model-level sparsity and significantly outperforms prior methods at higher sparsity levels, achieving up to 1.68× inference speed-up—at no retraining or weight update. Codes are available at https://github.com/HITSZ-Miao-Group/WAS.
pdf
bib
abs
DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation
Ziming You
|
Yumiao Zhang
|
Dexuan Xu
|
Yiwei Lou
|
Yandong Yan
|
Wei Wang
|
Huamin Zhang
|
Yu Huang
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.
pdf
bib
abs
VC4VG: Optimizing Video Captions for Text-to-Video Generation
Yang Du
|
Zhuoran Lin
|
Kaiqiang Song
|
Biao Wang
|
Zhicheng Zheng
|
Tiezheng Ge
|
Bo Zheng
|
Qin Jin
Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models. We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements. Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code (https://github.com/qyr0403/VC4VG) to support further research.
pdf
bib
abs
LaMP-QA: A Benchmark for Personalized Long-form Question Answering
Alireza Salemi
|
Hamed Zamani
Personalization is essential for question answering systems that are user-centric. Despite its importance, personalization in answer generation has been relatively underexplored. This is mainly due to lack of resources for training and evaluating personalized question answering systems. We address this gap by introducing LaMP-QA—a benchmark designed for evaluating personalized long-form answer generation. The benchmark covers questions from three major categories: (1) Arts & Entertainment, (2) Lifestyle & Personal Development, and (3) Society & Culture, encompassing over 45 subcategories in total. To assess the quality and potential impact of the LaMP-QA benchmark for personalized question answering, we conduct comprehensive human and automatic evaluations, to compare multiple evaluation strategies for evaluating generated personalized responses and measure their alignment with human preferences. Furthermore, we benchmark a number of non-personalized and personalized approaches based on open-source and proprietary large language models. Our results show that incorporating the personalized context provided leads to up to 39% performance improvements. The benchmark is publicly released to support future research in this area.
pdf
bib
abs
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Yubo Zhu
|
Dongrui Liu
|
Zecheng Lin
|
Wei Tong
|
Sheng Zhong
|
Jing Shao
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
pdf
bib
abs
MCIP: Protecting MCP Safety via Model Contextual Integrity Protocol
Huihao Jing
|
Haoran Li
|
Wenbin Hu
|
Qi Hu
|
Xu Heli
|
Tianshu Chu
|
Peizhao Hu
|
Yangqiu Song
As Model Context Protocol (MCP) introduces an easy-to-use ecosystem for users and developers, it also brings underexplored safety risks. Its decentralized architecture, which separates clients and servers, poses unique challenges for systematic safety analysis. This paper proposes a novel framework to enhance MCP safety. Guided by the MAESTRO framework, we first analyze the missing safety mechanisms in MCP, and based on this analysis, we propose the Model Contextual Integrity Protocol (MCIP), a refined version of MCP that addresses these gaps. Next, we develop a fine-grained taxonomy that captures a diverse range of unsafe behaviors observed in MCP scenarios. Building on this taxonomy, we develop benchmark and training data that support the evaluation and improvement of LLMs’ capabilities in identifying safety risks within MCP interactions. Leveraging the proposed benchmark and training data, we conduct extensive experiments on state-of-the-art LLMs. The results highlight LLMs’ vulnerabilities in MCP interactions and demonstrate that our approach substantially improves their safety performance.
pdf
bib
abs
SAKI-RAG: Mitigating Context Fragmentation in Long-Document RAG via Sentence-level Attention Knowledge Integration
Wenyu Tao
|
Xiaofen Xing
|
Zeliang Li
|
Xiangmin Xu
Traditional Retrieval-Augmented Generation (RAG) frameworks often segment documents into larger chunks to preserve contextual coherence, inadvertently introducing redundant noise. Recent advanced RAG frameworks have shifted toward finer-grained chunking to improve precision. However, in long-document scenarios, such chunking methods lead to fragmented contexts, isolated chunk semantics, and broken inter-chunk relationships, making cross-paragraph retrieval particularly challenging. To address this challenge, maintaining granular chunks while recovering their intrinsic semantic connections, we propose **SAKI-RAG** (Sentence-level Attention Knowledge Integration Retrieval-Augmented Generation). Our framework introduces two core components: (1) the **SentenceAttnLinker**, which constructs a semantically enriched knowledge repository by modeling inter-sentence attention relationships, and (2) the **Dual-Axis Retriever**, which is designed to expand and filter the candidate chunks from the dual dimensions of semantic similarity and contextual relevance. Experimental results across four datasets—Dragonball, SQUAD, NFCORPUS, and SCI-DOCS demonstrate that SAKI-RAG achieves better recall and precision compared to other RAG frameworks in long-document retrieval scenarios, while also exhibiting higher information efficiency.
pdf
bib
abs
Skeletons Matter: Dynamic Data Augmentation for Text-to-Query
Yuchen Ji
|
Bo Xu
|
Jie Shi
|
Jiaqing Liang
|
Deqing Yang
|
Yu Mao
|
Hai Chen
|
Yanghua Xiao
The task of translating natural language questions into query languages has long been a central focus in semantic parsing. Recent advancements in Large Language Models (LLMs) have significantly accelerated progress in this field. However, existing studies typically focus on a single query language, resulting in methods with limited generalizability across different languages. In this paper, we formally define the Text-to-Query task paradigm, unifying semantic parsing tasks across various query languages. We identify query skeletons as a shared optimization target of Text-to-Query tasks, and propose a general dynamic data augmentation framework that explicitly diagnoses model-specific weaknesses in handling these skeletons to synthesize targeted training data. Experiments on four Text-to-Query benchmarks demonstrate that our method achieves state-of-the-art performance using only a small amount of synthesized data, highlighting the efficiency and generality of our approach and laying a solid foundation for unified research on Text-to-Query tasks. We release our code at https://github.com/jjjycaptain/Skeletron
pdf
bib
abs
CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching
Cheng Shen
|
Yew-Soon Ong
|
Joey Tianyi Zhou
Dataset condensation has emerged as a promising technique to improve data efficiency under limited data budgets. However, when applied to the text level, existing methods struggle to compress more information into samples through optimization. Thus, these methods provide no obvious advantage over simpler coreset selection despite their high computational cost. In this paper, we introduce CondenseLM, a novel paradigm for both effective and efficient text-level dataset condensation. Our framework employs an LLMs-driven approach to sidestep the inherent limitations of existing methods, successfully generating more informative and less biased samples. In addition, it incorporates reward matching to align the LLMs-condensed dataset with the original dataset, maximizing representability and coverage. We conducted extensive experiments on SST-2, MNLI, AG News, and IMDB. Our approach outperforms both coreset selection and existing dataset condensation methods by large margins while also substantially reducing the computational cost.
pdf
bib
abs
MovieCORE: COgnitive REasoning in Movies
Gueter Josmy Faure
|
Min-Hung Chen
|
Jia-Fong Yeh
|
Ying Cheng
|
Hung-Ting Su
|
Yung-Hao Tang
|
Shang-Hong Lai
|
Winston H. Hsu
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
pdf
bib
abs
Think Wider, Detect Sharper: Reinforced Reference Coverage for Document-Level Self-Contradiction Detection
Yuhao Chen
|
Yuanjie Lyu
|
Shuochen Liu
|
Chao Zhang
|
Junhui Lv
|
Tong Xu
Detecting self-contradictions within documents is a challenging task for ensuring textual coherence and reliability. While large language models (LLMs) have advanced in many natural language understanding tasks, document-level self-contradiction detection (DSCD) remains insufficiently studied. Recent approaches leveraging Chain-of-Thought (CoT) prompting aim to enhance reasoning and interpretability; however, they only gain marginal improvement and often introduce inconsistencies across repeated responses. We observe that such inconsistency arises from incomplete reasoning chains that fail to include all relevant contradictory sentences consistently. To address this, we propose a two-stage method that combines supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance DSCD performance. In the SFT phase, a teacher model helps the model learn reasoning patterns, while RL further refines its reasoning ability. Our method incorporates a task-specific reward function to expand the model’s reasoning scope, boosting both accuracy and consistency. On the ContraDoc benchmark, our approach significantly boosts Llama 3.1-8B-Instruct’s accuracy from 38.5% to 51.1%, and consistency from 59.6% to76.2%.
pdf
bib
abs
DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture
Arijit Maji
|
Raghvendra Kumar
|
Akash Ghosh
|
Anushka
|
Nemil Shah
|
Abhilekh Borah
|
Vanshika Shah
|
Nishant Mishra
|
Sriparna Saha
We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India’s diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models—across zero-shot and chain-of-thought settings. Our results expose key limitations in current models’ ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.
pdf
bib
abs
LingGym: How Far Are LLMs from Thinking Like Field Linguists?
Changbing Yang
|
Franklin Ma
|
Freda Shi
|
Jian Zhu
This paper introduces LingGym, a new benchmark that evaluates LLMs’ capacity for meta-linguistic reasoning using Interlinear Glossed Text (IGT) and grammatical descriptions extracted from 18 typologically diverse reference grammars. Unlike previous work that focuses on specific downstream tasks, we assess whether LLMs can generalize linguistic inference across low-resource languages and structures not seen during training. We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context using varying levels of linguistic information (e.g., glosses, grammatical explanations, translations). Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models. This work highlights both the promise and current limitations of using LLMs for typologically informed linguistic analysis and low-resource language documentation.
pdf
bib
abs
Learning from Few Samples: A Novel Approach for High-Quality Malcode Generation
Haijian Ma
|
Daizong Liu
|
Xiaowen Cai
|
Pan Zhou
|
Yulai Xie
Intrusion Detection Systems (IDS) play a crucial role in network security defense. However, a significant challenge for IDS in training detection models is the shortage of adequately labeled malicious samples. To address these issues, this paper introduces a novel semi-supervised framework GANGRL-LLM, which integrates Generative Adversarial Networks (GANs) with Large Language Models (LLMs) to enhance malicious code generation and SQL Injection (SQLi) detection capabilities in few-sample learning scenarios. Specifically, our framework adopts a collaborative training paradigm where: (1) the GAN-based discriminator improves malicious pattern recognition through adversarial learning with generated samples and limited real samples; and (2) the LLM-based generator refines the quality of malicious code synthesis using reward signals from the discriminator. The experimental results demonstrate that even with a limited number of labeled samples, our training framework is highly effective in enhancing both malicious code generation and detection capabilities. This dual enhancement capability offers a promising solution for developing adaptive defense systems capable of countering evolving cyber threats.
pdf
bib
abs
Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks
Sarfaroz Yunusov
|
Kaige Chen
|
Kazi Nishat Anwar
|
Ali Emami
As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conduc-ted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: *Rationals* strongly preferred GPT-4, particularly for goal-oriented tasks, while *idealists* favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.
pdf
bib
abs
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Yiming Jia
|
Jiachen Li
|
Xiang Yue
|
Bo Li
|
Ping Nie
|
Kai Zou
|
Wenhu Chen
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves the best known performance with SFT without RL within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
pdf
bib
abs
Thinking Out Loud: Do Reasoning Models Know When They’re Right?
Qingcheng Zeng
|
Weihao Xuan
|
Leyang Cui
|
Rob Voigt
Large reasoning models (LRMs) have recently demonstrated impressive capabilities in complex reasoning tasks by leveraging increased test-time computation and exhibiting behaviors reminiscent of human-like self-reflection. While LRMs show a clear capacity for valuable self-reflection, how this ability interacts with other model behaviors remains underexplored. We investigate this connection by analyzing verbalized confidence, how models articulate their certainty, as a lens into the nature of self-reflection in LRMs. We find that supervised fine-tuning on reasoning traces (i.e., distillation) and reinforcement learning can improve verbalized calibration in reasoning-intensive settings in a progressive, laddered fashion. However, our results also indicate that reasoning models may possess a diminished awareness of their own knowledge boundaries, as evidenced by significantly lower “I don’t know” response rates on factuality benchmarks. Moreover, we examine the relationship between verbalized confidence and reasoning chains, finding that models tend to express higher confidence when providing shorter or less elaborate reasoning. Our findings highlight how reasoning-oriented training can enhance performance in reasoning-centric tasks while potentially incurring a reasoning tax, a cost reflected in the model’s reduced ability to accurately recognize the limits of its own knowledge in small-scale models. More broadly, our work showcases how this erosion of knowledge boundaries can compromise model faithfulness, as models grow more confident without a commensurate understanding of when they should abstain.
pdf
bib
abs
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models
Weihao Xuan
|
Qingcheng Zeng
|
Heli Qi
|
Junjue Wang
|
Naoto Yokoya
Uncertainty quantification is essential for assessing the reliability and trustworthiness of modern AI systems. Among existing approaches, verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution in large language models (LLMs). However, its effectiveness in vision-language models (VLMs) remains insufficiently studied. In this work, we conduct a comprehensive evaluation of verbalized confidence in VLMs, spanning three model categories, four task domains, and three evaluation scenarios. Our results show that current VLMs often display notable miscalibration across diverse tasks and settings. Notably, visual reasoning models (i.e., thinking with images) consistently exhibit better calibration, suggesting that modality-specific reasoning is critical for reliable uncertainty estimation. To further address calibration challenges, we introduce Visual Confidence-Aware Prompting, a two-stage prompting strategy that improves confidence alignment in multimodal settings. Overall, our study highlights the inherent miscalibration in VLMs across modalities. More broadly, our findings underscore the fundamental importance of modality alignment and model faithfulness in advancing reliable multimodal systems.
pdf
bib
abs
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Mengqi Liao
|
Xiangyu Xi
|
Chen Ruinian
|
Jia Leng
|
Yangen Hu
|
Ke Zeng
|
Shuai Liu
|
Huaiyu Wan
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model’s exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://anonymous.4open.science/r/E3-RL4LLMs-DB28
pdf
bib
abs
LLM Bias Detection and Mitigation through the Lens of Desired Distributions
Ingroj Shrestha
|
Padmini Srinivasan
Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM’s outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM’s gender–profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets—male-dominated, female-dominated, and gender-balanced—derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30–75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50–62% reduction.
pdf
bib
abs
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Teng Lin
|
Yuyu Luo
|
Honglin Zhang
|
Jicheng Zhang
|
Chunlin Liu
|
Kaishun Wu
|
Nan Tang
Cross-Document Multi-entity question answering (MEQA) demands the integration of scattered information across documents to resolve complex queries involving entities, relationships, and contextual dependencies. Although Large Language Models (LLMs) and Retrieval-augmented Generation (RAG) systems show promise, their performance on cross-document MEQA remains underexplored due to the absence of tailored benchmarks. To address this gap, we introduce MEBench, a scalable multi-document, multi-entity benchmark designed to systematically evaluate LLMs’ capacity to retrieve, consolidate, and reason over scattered and dense information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories: Comparative Reasoning, Statistical Reasoning and Relational Reasoning, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
pdf
bib
abs
POSITION BIAS MITIGATES POSITION BIAS: Mitigate Position Bias Through Inter-Position Knowledge Distillation
Yifei Wang
|
Feng Xiong
|
Yong Wang
|
Linjing Li
|
Xiangxiang Chu
|
Daniel Dajun Zeng
Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. Previous studies have addressed PB either by modifying the underlying architectures or by employing extensive contextual awareness training. However, the former approach fails to effectively eliminate the substantialperformance disparities, while the latter imposes significant data and computational overhead. To address PB effectively, we introduce Pos2Distill, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under retrieval and reasoning paradigms, thereby designing two specialized instantiations: Pos2Distill-R1 and Pos2Distill-R2 respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.
pdf
bib
abs
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan
|
Rui Yang
|
Heli Qi
|
Qingcheng Zeng
|
Yunze Xiao
|
Aosong Feng
|
Dairui Liu
|
Yun Xing
|
Junjue Wang
|
Fan Gao
|
Jinghui Lu
|
Yuang Jiang
|
Huitao Li
|
Xin Li
|
Kunyu Yu
|
Ruihai Dong
|
Shangding Gu
|
Yuekang Li
|
Xiaofei Xie
|
Felix Juefei-Xu
|
Foutse Khomh
|
Osamu Yoshie
|
Qingyu Chen
|
Douglas Teodoro
|
Nan Liu
|
Randy Goebel
|
Lei Ma
|
Edison Marrese-Taylor
|
Shijian Lu
|
Yusuke Iwasawa
|
Yutaka Matsuo
|
Irene Li
Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
pdf
bib
abs
NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging
Weiming Zhang
|
Qingyao Li
|
Xinyi Dai
|
Jizheng Chen
|
Kounianhua Du
|
Weiwen Liu
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
|
Weinan Zhang
Debugging is a critical aspect of LLM’s coding ability. Early debugging efforts primarily focused on code-level analysis, which often falls short when addressing complex programming errors that require a deeper understanding of algorithmic logic. Recent advancements in large language models (LLMs) have shifted attention toward leveraging natural language reasoning to enhance code-related tasks. However, two fundamental questions remain unanswered: What type of natural language format is most effective for debugging tasks? And what specific benefits does natural language reasoning bring to the debugging process? In this paper, we introduce NL-DEBUGGING, a novel framework that employs natural language as an intermediate representation to improve code debugging. By debugging at a natural language level, we demonstrate that NL-DEBUGGING outperforms traditional debugging methods and enables a broader modification space through direct refinement guided by execution feedback. Our findings highlight the potential of natural language reasoning to advance automated code debugging and address complex programming challenges.
pdf
bib
abs
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
Bryan Chen Zhengyu Tan
|
Daniel Wai Kit Chin
|
Zhengyuan Liu
|
Nancy F. Chen
|
Roy Ka-Wei Lee
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce **DuET-PD** (**Du**al **E**valuation for **T**rust in **P**ersuasive **D**ialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct’s accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
pdf
bib
abs
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Yuan Liu
|
Zhongyin Zhao
|
Le Tian
|
Haicheng Wang
|
Xubing Ye
|
Yangxiu You
|
Zilin Yu
|
Chuhan Wu
|
Zhou Xiao
|
Yang Yu
|
Jie Zhou
High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model’s conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model will be made publicly available.
pdf
bib
abs
Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition
Xuemei Tang
|
Xufeng Duan
|
Zhenguang Cai
Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews, such as literature collection, organization, and summarization. However, it is yet unclear how good LLMs are at automating comprehensive and reliable literature reviews. This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature review writing: reference generation, abstract writing, and literature review composition. We introduce multidimensional evaluation metrics that assess the hallucination rates in generated references and measure the semantic coverage and factual consistency of the literature summaries and compositions against human-written counterparts. The experimental results reveal that even the most advanced models still generate hallucinated references, despite recent progress. Moreover, we observe that the performance of different models varies across disciplines when it comes to writing literature reviews. These findings highlight the need for further research and development to improve the reliability of LLMs in automating academic literature reviews. The dataset and code used in this study are publicly available in our GitHub repository .
pdf
bib
abs
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
Nafiseh Nikeghbal
|
Amir Hossein Kargaran
|
Jana Diesner
Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions.We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs’ reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.
pdf
bib
abs
From Schema to State: Zero-Shot Scheme-Only Dialogue State Tracking via Diverse Synthetic Dialogue and Step-by-Step Distillation
Huan Xu
|
Zequn Li
|
Wen Tang
|
Jian Jun Zhang
Dialogue State Tracking (DST) is crucial for linking user intentions to appropriate services in task-oriented dialogue systems. We propose a zero-shot, scheme-only approach that tackles two main challenges: generating synthetic dialogues that balance diversity with schema alignment, and efficiently distilling knowledge from a large language model (LLM) into a smaller model. Our pipeline first creates scenarios, dialogue logic flows, and utterances via dynamic complexity prompting, eliminating reliance on handcrafted templates. We then use a two-stage distillation process to learn formalized dialogue representations and DST related chain-of-thought reasoning. This structure preserves interpretive capabilities while reducing inference overhead. Experiments on the MultiWOZ benchmark show that our method achieves state-of-the-art performance under zero-shot, scheme-only situation and generalizes effectively to few-shot scenarios, offering a practical and scalable solution for domains lacking real data.
pdf
bib
abs
Beyond the Surface: Measuring Self-Preference in LLM Judgments
Zhi-Yuan Chen
|
Hao Wang
|
Xinyu Zhang
|
Enrui Hu
|
Yankai Lin
Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
pdf
bib
abs
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Dong Shu
|
Xuansheng Wu
|
Haiyan Zhao
|
Mengnan Du
|
Ninghao Liu
Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the influence between each latent feature and the model’s output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model’s output, and (2) only latents with high influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.
pdf
bib
abs
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation
Hengran Zhang
|
Minghao Tang
|
Keping Bi
|
Jiafeng Guo
|
Shihao Liu
|
Daiting Shi
|
Dawei Yin
|
Xueqi Cheng
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems, aiming to reduce dependence on costly human annotations. We address the gap between retrieval relevance and generative utility by employing LLMs to annotate document utility. To effectively utilize multiple positive samples per query, we introduce a novel loss that maximizes their summed marginal likelihood. Using the Qwen-2.5-32B model, we annotate utility on the MS MARCO dataset and conduct retrieval experiments on MS MARCO and BEIR, as well as RAG experiments on MS MARCO QA, NQ, and HotpotQA. Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics. Furthermore, combining LLM annotations with just 20% of human labels achieves performance comparable to using full human annotations. Our study offers a comprehensive approach to utilizing LLM annotations for initializing QA systems on new corpora.
pdf
bib
abs
CiteBART: Learning to Generate Citations for Local Citation Recommendation
Ege Yiğit Çelik
|
Selma Tekir
Local citation recommendation (LCR) suggests a set of papers for a citation placeholder within a given context. This paper introduces CiteBART, citation-specific pre-training within an encoder-decoder architecture, where author-date citation tokens are masked to learn to reconstruct them to fulfill LCR. The global version (CiteBART-Global) extends the local context with the citing paper’s title and abstract to enrich the learning signal. CiteBART-Global achieves state-of-the-art performance on LCR benchmarks except for the FullTextPeerRead dataset, which is quite small to see the advantage of generative pre-training. The effect is significant in the larger benchmarks, e.g., Refseer and ArXiv., with the Refseer pre-trained model emerging as the best-performing model. We perform comprehensive experiments, including an ablation study, a qualitative analysis, and a taxonomy of hallucinations with detailed statistics. Our analyses confirm that CiteBART-Global has a cross-dataset generalization capability; the macro hallucination rate (MaHR) at the top-3 predictions is 4%, and when the ground-truth is in the top-k prediction list, the hallucination tendency in the other predictions drops significantly. We publicly share our code, base datasets, global datasets, and pre-trained models to support reproducibility.
pdf
bib
abs
Autoformalization in the Wild: Assessing LLMs on Real-World Mathematical Definitions
Lan Zhang
|
Marco Valentino
|
Andre Freitas
Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge the gap between informal mathematics and formal languages through autoformalization. However, it is still unclear how well LLMs generalize to sophisticated and naturally occurring mathematical statements. To address this gap, we investigate the task of autoformalizing real-world mathematical definitions: a critical component of mathematical discourse. Specifically, we introduce two novel resources for autoformalization, collecting definitions from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically evaluate a range of LLMs, analyzing their ability to formalize definitions into Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs’ performance including refinement through external feedback from Proof Assistants, and formal definition grounding, where we augment LLMs’ formalizations through relevant contextual elements from formal mathematical libraries. Our findings reveal that definitions present a greater challenge compared to existing benchmarks, such as miniF2F. In particular, we found that LLMs still struggle with self-correction, and aligning with relevant mathematical libraries. At the same time, structured refinement methods and definition grounding strategies yield notable improvements of up to 16% on self-correction capabilities and 43% on the reduction of undefined errors, highlighting promising directions for enhancing LLM-based autoformalization in real-world scenarios.
pdf
bib
abs
Culture Cartography: Mapping the Landscape of Cultural Knowledge
Caleb Ziems
|
William Barr Held
|
Jane Yu
|
Amir Goldberg
|
David Grusky
|
Diyi Yang
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process to meet the researcher’s goals. We propose CultureCartography as a methodology that operationalizes this mixed-initiative vision. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement Culture Cartography as a tool called Culture Explorer. Compared to a baseline where humans answer LLM-proposed questions, we find that Culture Explorer more effectively produces knowledge that strong models like DeepSeek R1, Llama-4 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama models by up to 19.2% on related culture benchmarks.
pdf
bib
abs
Interpretability Analysis of Arithmetic In-Context Learning in Large Language Models
Gregory Polyakov
|
Christian Hepting
|
Carsten Eickhoff
|
Seyed Ali Bahrainian
Large language models (LLMs) exhibit sophisticated behavior, notably solving arithmetic with only a few in-context examples (ICEs). Yet the computations that connect those examples to the answer remain opaque. We probe four open-weight LLMs, Pythia-12B, Llama-3.1-8B, MPT-7B, and OPT-6.7B, on basic arithmetic to illustrate how they process ICEs. Our study integrates activation patching, information-flow analysis, automatic circuit discovery, and the logit-lens perspective into a unified pipeline. Within this framework we isolate partial-sum representations in three-operand tasks, investigate their influence on final logits, and derive linear function vectors that characterize tasks and align with ICE-induced activations. Controlled ablations show that strict pattern consistency in the formatting of ICEs guides the models more strongly than the symbols chosen or even the factual correctness of the examples. By unifying four complementary interpretability tools, this work delivers one of the most comprehensive interpretability studies of LLM arithmetic to date, and the first on three-operand tasks. Our code is publicly available.
pdf
bib
abs
SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence
Yao Zhang
|
Chenyang Lin
|
Shijie Tang
|
Haokun Chen
|
Shijie Zhou
|
Yunpu Ma
|
Volker Tresp
The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose **SwarmAgentic**, the *first framework that fully automates agentic system generation, optimization, and collaboration*, constructing agents from scratch and jointly refining functionality and coordination via language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a **+261.8% relative improvement** over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation.
pdf
bib
abs
We Politely Insist: Your LLM Must Learn the Persian Art of Taarof
Nikta Gohari Sadr
|
Sahar Heidariasl
|
Karine Megerdoomian
|
Laleh Seyyed-Kalantari
|
Ali Emami
Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian *taarof*, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce **TaarofBench**, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated “polite” by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.
pdf
bib
abs
Unstructured Evidence Attribution for Long Context Query Focused Summarization
Dustin Wright
|
Zain Muhammad Mujahid
|
Lu Wang
|
Isabelle Augenstein
|
David Jurgens
Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be “lost-in-the-middle”. To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public (https://github.com/dwright37/unstructured-evidence-sunset).
pdf
bib
abs
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Subrata Biswas
|
Mohammad Nur Hossain Khan
|
Bashima Islam
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning - each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio-Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks - including egocentric and exocentric tasks - show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.
pdf
bib
abs
Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Reasoning
Mingyuan Wu
|
Jize Jiang
|
Haozhen Zheng
|
Meitang Li
|
Zhaoheng Li
|
Beitong Tian
|
Bo Chen
|
Yongjoo Park
|
Minjia Zhang
|
ChengXiang Zhai
|
Klara Nahrstedt
Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose
Cache of Thought (CoT), a master–apprentice framework for collaborative inference between large and small VLMs. CoT manages high-quality query results from large VLMs (
master) in a cache, which are then selected via a novel multi-modal retrieval and in-context learning to aid the performance of small VLMs (
apprentice). We extensively evaluate CoT on various widely-recognized and challenging general reasoning benchmarks, and show that CoT increases overall reasoning performance by up to 7.7% under the same budget, and specifically boosts the reasoning performance of apprentice VLMs by up to 36.6%. Our code is available at
https://github.com/UIUC-MONET/Cache-of-Thoughts.
pdf
bib
abs
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu
|
Yiyu Wang
|
Junpeng Ma
|
Linfeng Zhang
Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues:
(i) overlooking distinctive visual signals across frames, leading to information loss;
(ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators.To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework “
Video
Compression
Commander” (
VidCom2). By quantifying each frame’s uniqueness, VidCom
2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom
2. With only
25% visual tokens, VidCom
2 achieves
99.6% of the original performance on LLaVA-OV while reducing
70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at
https://github.com/xuyang-liu16/VidCom2.
pdf
bib
abs
Router-Tuning: A Simple and Effective Approach for Dynamic Depth
Shwai He
|
Tao Ge
|
Guoheng Sun
|
Bowei Tian
|
Xiaoyang Wang
|
Dong Yu
The Mixture of Depths (MoD) was introduced to improve computational efficiency by dynamically skipping less important layers, reducing redundant computation while maintaining model capacity. Despite its promise, existing MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, which fine-tunes only the routers on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we investigate across different architectures and granularities, demonstrating its effectiveness on Attention layers and MoE layers. This method preserves the model’s performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21% speedup and only a 0.2% performance drop. The code will be released upon acceptance.
pdf
bib
abs
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
|
Xiaolong Jin
|
Jinyuan Jia
|
Xiangyu Zhang
Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD, a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model’s response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions. The code is available at https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.
pdf
bib
abs
TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games
Yuan Yuan
|
Muyu He
|
Muhammad Adil Shahid
|
Ziyang Li
|
Jiani Huang
|
Li Zhang
This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, reasoning steps and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs’ deductive reasoning abilities in complex, narrative-rich environments.
pdf
bib
abs
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling
Minghui Li
|
Hao Zhang
|
Yechao Zhang
|
Wei Wan
|
Shengshan Hu
|
Pei Xiaobing
|
Jing Wang
Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.
pdf
bib
abs
Direct Judgement Preference Optimization
PeiFeng Wang
|
Austin Xu
|
Yilun Zhou
|
Caiming Xiong
|
Shafiq Joty
To meet the increasing need for timely and accurate evaluation of large language model (LLM) responses, training LLM-as-judges to evaluate and critique other model responses has emerged as a popular paradigm. However, existing judge models are largely trained with supervised finetuning (SFT) on small data scales to perform limited types of evaluation tasks, fundamentally limiting generalization.To meet the need for strong, generalized judge models, we explore training foundational judge models at large data scales (680K) with direct preference optimization (DPO). Using four training tasks, we form three types of DPO preference pairs targeting different aspects of evaluation: Generating meaningful critiques, making accurate judgements, and understanding what comprises good and bad responses. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and evaluate on a comprehensive suite of 13 benchmarks (7 pairwise, 4 single rating, and 2 classification). Our models achieve the best aggregate performance, with even our 8B model outperforming GPT-4o in pairwise benchmarks. Further analysis shows that our judge models produce factual and actionable critiques and serve as strong foundational judges for continued finetuning.
pdf
bib
abs
WebInject: Prompt Injection Attack to Web Agents
Xilong Wang
|
John Bloch
|
Zedian Shao
|
Yuepeng Hu
|
Shuyan Zhou
|
Neil Zhenqiang Gong
Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. In this work, we propose WebInject, a prompt injection attack that manipulates the webpage environment to induce a web agent to perform an attacker-specified action. Our attack adds a perturbation to the raw pixel values of the rendered webpage. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the attacker-specified action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple datasets shows that WebInject is highly effective and significantly outperforms baselines.
pdf
bib
abs
F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations
Tian Lan
|
Jiang Li
|
Yemin Wang
|
Xu Liu
|
Xiangdong Su
|
Guanglai Gao
With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. Yet, most existing fairness benchmarks rely on closed-ended evaluation formats, which diverge from real-world open-ended interactions. These formats are prone to position bias and introduce a “minimum score” effect, where models can earn partial credit simply by guessing. Moreover, such benchmarks often overlook factuality considerations rooted in historical, social, physiological, and cultural contexts, and rarely account for intersectional biases. To address these limitations, we propose F²Bench: an open-ended fairness evaluation benchmark for LLMs that explicitly incorporates factuality considerations. F²Bench comprises 2,568 instances across 10 demographic groups and two open-ended tasks. By integrating text generation, multi-turn reasoning, and factual grounding, F²Bench aims to more accurately reflect the complexities of real-world model usage. We conduct a comprehensive evaluation of several LLMs across different series and parameter sizes. Our results reveal that all models exhibit varying degrees of fairness issues. We further compare open-ended and closed-ended evaluations, analyze model-specific disparities, and provide actionable recommendations for future model development. Our code and dataset are publicly available at https://github.com/VelikayaScarlet/F2Bench.
pdf
bib
abs
Value Profiles for Encoding Human Variation
Taylor Sorensen
|
Pushkar Mishra
|
Roma Patel
|
Michael Henry Tessler
|
Michiel A. Bakker
|
Georgina Evans
|
Iason Gabriel
|
Noah Goodman
|
Verena Rieser
Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles – natural language descriptions of underlying values compressed from in-context demonstrations – along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.
pdf
bib
abs
Language Models as Causal Effect Generators
Lucius E.j. Bynum
|
Kyunghyun Cho
In this work, we present sequence-driven structural causal models (SD-SCMs), a framework for specifying causal models with user-defined structure and language-model-defined mechanisms. We characterize how an SD-SCM enables sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data to test treatment effect estimation. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods for average, conditional average, and individual treatment effect estimation. We find under this benchmark that (1) causal methods outperform non-causal methods and that (2) even state-of-the-art methods struggle with individualized effect estimation, suggesting this benchmark captures some inherent difficulties in causal estimation. Apart from generating data, this same technique can underpin the auditing of language models for (un)desirable causal effects, such as misinformation or discrimination. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.
pdf
bib
abs
Constructions are Revealed in Word Distributions
Joshua Rozner
|
Leonie Weissweiler
|
Kyle Mahowald
|
Cory Shain
Construction grammar posits that constructions, or form-meaning pairings, are acquired through experience with language (the distributional learning hypothesis).But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what caused a particular word to occur.This requires computable models of the distribution over strings—namely, pretrained language models (PLMs).Here, we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity.We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) schematic constructions, whose “slots” can be filled by abstract word classes.Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text.Thus, statistical affinity is likely an important, but partial, signal available to learners.
pdf
bib
abs
CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages
Yilun Yang
|
Yekun Chai
Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks like LinCE and GLUECoS are limited by their narrow language pairs and tasks, failing to adequately assess large language models’ (LLMs) code-mixing abilities. Despite the recognized importance of code-mixing for multilingual users, research on LLMs in this context remains sparse. Additionally, current techniques for synthesizing code-mixed data are underdeveloped to generate code-mixing. In response, we introduce CodeMixBench, a comprehensive benchmark covering eight tasks, including three specific to LLMs and five traditional NLP tasks, and 18 languages from seven language families. We also propose a new method for generating large-scale synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our evaluation reveals consistent underperformance of LLMs on code-mixed datasets involving different language families. Enhancements in training data size, model scale, and few-shot learning could improve their performance. The code and dataset are available at https://github.com/Jeromeyluck/CodeMixBench.
pdf
bib
abs
RBPtool: A Deep Language Model Framework for Multi-Resolution RBP-RNA Binding Prediction and RNA Molecule Design
Jiyue Jiang
|
Yitao Xu
|
Zikang Wang
|
Yihan Ye
|
Yanruisheng Shao
|
Yuheng Shan
|
Jiuming Wang
|
Xiaodan Fan
|
Jiao Yuan
|
Yu Li
RNA-binding proteins (RBPs) play essential roles in post-transcriptional gene regulation via recognizing specific RNA molecules as well as modulating several key physiological processes in cellulo, represented by alternative splicing and RNA degradation. Despite extensive research, most existing approaches still rely on superficial sequence features or coarse structural representations, limiting their ability to capture the intricate nature of RBP-RNA interactions. The recent surge in large language models (LLMs), combined with advances in geometric deep learning for extracting three-dimensional representations, enables the integration of multi-modal, multi-scale biological data for precise modeling and biologically informed de novo RNA design. In this work, we curate and extend RPI15223 into a multi-resolution, structure-level RBP-RNA dataset, and introduce RBPtool, a multi-task, multi-resolution framework that combines a geometric vector perception (GVP) module together with a deep language model encoder to fuse sequence and structural information. Our tool achieves state-of-the-art performance on public benchmarks and the RPI15223 dataset, while also supporting fine-grained level predictions and enabling de novo RNA design through a generative module conditioned on protein, cell-type, and specified species. RBPtool provides a fast and versatile platform for both fundamental RBP-RNA research and practical RNA drug design, delivering enhanced predictive accuracy and fine-grained structural insights.
pdf
bib
abs
Unveiling Internal Reasoning Modes in LLMs: A Deep Dive into Latent Reasoning vs. Factual Shortcuts with Attribute Rate Ratio
Yiran Yang
|
Haifeng Sun
|
Jingyu Wang
|
Qi Qi
|
Zirui Zhuang
|
Huazheng Wang
|
Pengfei Ren
|
Jing Wang
|
Jianxin Liao
Existing research in multi-hop questions has identified two reasoning modes: latent reasoning and factual shortcuts, but has not deeply investigated how these modes differ during inference. This impacts both model generalization ability and downstream reasoning tasks. In this work, we systematically examine these distinctions and propose a simple and efficient classification metric, Attribute Rate Ratio (ARR). First, we construct specialized datasets corresponding to the two reasoning modes based on our proposed criteria. Then, using reverse engineering methods, including attention knockout and logit lens techniques, we reveal that subject representations differ significantly across modes: latent reasoning encodes bridge-related information for final answer extraction, while factual shortcuts bypass intermediate reasoning and resemble single-hop factual queries. Finally, our proposed ARR achieves around 90% accuracy on our datasets and demonstrates effectiveness in RAG conflict scenarios, showing that model behavior under conflicting prompts is closely tied to its underlying reasoning mode. Our findings and proposed metric have significant potential for advancing LLM development and applications.
pdf
bib
abs
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Zirui He
|
Mingyu Jin
|
Bo Shen
|
Ali Payani
|
Yongfeng Zhang
|
Mengnan Du
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.
pdf
bib
abs
BabyLM’s First Constructions: Causal interventions provide a signal of learning
Joshua Rozner
|
Leonie Weissweiler
|
Cory Shain
Construction grammar posits that language learners acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape RoBERTa’s output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.’s methods to evaluate construction learning in masked language models from the 2024 BabyLM Challenge.Our results show that even when trained on developmentally plausible quantities of data, models learn diverse constructions, even hard cases that are superficially indistinguishable.We further find correlational evidence that constructional performance may be functionally relevant: models that better represent constructions perform better on the BabyLM benchmarks.
pdf
bib
abs
Effective Red-Teaming of Policy-Adherent Agents
Itay Nakash
|
George Kour
|
Koren Lazar
|
Matan Vetzler
|
Guy Uziel
|
Ateret Anaby Tavor
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing Tau-bench benchmark, we introduce Tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks.
pdf
bib
abs
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
Zongxi Li
|
Yang Li
|
Haoran Xie
|
S. Joe Qin
Users often assume that large language models (LLMs) share their cognitive alignment of context and intent, leading them to omit critical information in question-answering (QA) and produce ambiguous queries. Responses based on misaligned assumptions may be perceived as hallucinations. Therefore, identifying possible implicit assumptions is crucial in QA. To address this fundamental challenge, we propose Conditional Ambiguous Question-Answering (CondAmbigQA), a benchmark comprising 2,000 ambiguous queries and condition-aware evaluation metrics. Our study pioneers “conditions” as explicit contextual constraints that resolve ambiguities in QA tasks through retrieval-based annotation, where retrieved Wikipedia fragments help identify possible interpretations for a given query and annotate answers accordingly. Experiments demonstrate that models considering conditions before answering improve answer accuracy by 11.75%, with an additional 7.15% gain when conditions are explicitly provided. These results highlight that apparent hallucinations may stem from inherent query ambiguity rather than model failure, and demonstrate the effectiveness of condition reasoning in QA, providing researchers with tools for rigorous evaluation.
pdf
bib
abs
SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery
Kunlun Zhu
|
Jiaxun Zhang
|
Ziheng Qi
|
Nuoxing Shang
|
Zijia Liu
|
Peixuan Han
|
Yue Su
|
Haofei Yu
|
Jiaxuan You
Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce **SafeScientist**, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose **SciSafetyBench** , a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab-uiuc/SafeScientist.**Warning**: this paper contains example data that may be offensive or harmful.
pdf
bib
abs
Improving Informally Romanized Language Identification
Adrian Benton
|
Alexander Gutkin
|
Christo Kirov
|
Brian Roark
The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts – Hindi and Urdu, for example – highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.
pdf
bib
abs
Integral Transformer: Denoising Attention, Not Too Much Not Too Little
Ivan Kobyzev
|
Abbas Ghaddar
|
Dingtao Hu
|
Boxing Chen
Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as punctuation and special tokens, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. This approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on rigorous knowledge and reasoning benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer more effectively balances attention distributions and reduces rank collapse in upper layers.
pdf
bib
abs
CHENGYU-BENCH: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Yicheng Fu
|
Zhemin Huang
|
Liuxin Yang
|
Yumeng Lu
|
Zhongdongming Dai
Chinese idioms (成语, Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks—multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce CHENGYU-BENCH, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. CHENGYU-BENCH comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy in Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. CHENGYU-BENCH demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and code will be released upon paper acceptance.
pdf
bib
abs
Improving Cross Lingual Transfer by Pretraining with Active Forgetting
Divyanshu Aggarwal
|
Ashutosh Sathe
|
Sunayana Sitaram
Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. However, the efficacy of such models to languages other than English is often limited. Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages. In this work, we propose a pretraining strategy that uses active forgetting to achieve similar cross lingual transfer in decoder-only LLMs. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages. Through extensive experimentation, we find that LLMs pretrained with active forgetting are able to learn better multilingual representations which translates to better performance in many downstream tasks.
pdf
bib
abs
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
Shuo Xing
|
Peiran Li
|
Yuping Wang
|
Ruizheng Bai
|
Yueqi Wang
|
Chan-Wei Hu
|
Chengxuan Qian
|
Huaxiu Yao
|
Zhengzhong Tu
The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications.
pdf
bib
abs
To Mask or to Mirror: Human-AI Alignment in Collective Reasoning
Crystal Qian
|
Aaron T Parisi
|
Clémentine Bouleau
|
Vivian Tsai
|
Maël Lebreton
|
Lucas Dixon
As large language models (LLMs) are increasingly used to model and augment collective decision-making, it is critical to examine their alignment with human social reasoning. We present an empirical framework for assessing collective alignment, in contrast to prior work on the individual level. Using the Lost at Sea social psychology task, we conduct a large-scale online experiment (N=748), randomly assigning groups to leader elections with either visible demographic attributes (e.g. name, gender) or pseudonymous aliases. We then simulate matched LLM groups conditioned on the human data, benchmarking Gemini 2.5, GPT-4.1, Claude Haiku 3.5, and Gemma 3. LLM behaviors diverge: some mirror human biases; others mask these biases and attempt to compensate for them. We empirically demonstrate that human-AI alignment in collective reasoning depends on context, cues, and model-specific inductive biases. Understanding how LLMs align with collective human behavior is critical to advancing socially-aligned AI, and demands dynamic benchmarks that capture the complexities of collective reasoning.
pdf
bib
abs
SWAN: An Efficient and Scalable Approach for Long-Context Language Modeling
Krishna C Puvvada
|
Faisal Ladhak
|
Santiago Akle Serano
|
Cheng-Ping Hsieh
|
Shantanu Acharya
|
Somshubra Majumdar
|
Fei Jia
|
Samuel Kriman
|
Simeng Sun
|
Dima Rekesh
|
Boris Ginsburg
We present SWAN, a causal Transformer architecture in the decoder-only style that generalizes robustly to sequence lengths substantially longer than those seen during training. SWAN interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE), and applies a dynamic scaling mechanism for attention scores during inference. Experiments demonstrate that SWAN achieves strong length extrapolation without requiring additional long-context training. In addition, SWAN is more computationally efficient than the standard Transformer architecture, resulting in lower training cost and higher inference throughput. We further demonstrate that existing pre-trained decoder-only models can be adapted to the SWAN architecture with minimal continued training, enabling extended contexts. Overall, our work presents an effective approach for scaling language models to longer contexts in a robust and efficient manner.
pdf
bib
abs
LLMs Behind the Scenes: Enabling Narrative Scene Illustration
Melissa Roemmele
|
John Joon Young Chung
|
Taewook Kim
|
Yuqian Sun
|
Alex Calderwood
|
Max Kreminski
Generative AI has established the opportunity to readily transform content from one medium to another. This capability is especially powerful for storytelling, where visual illustrations can illuminate a story originally expressed in text. In this paper, we focus on the task of narrative scene illustration, which involves automatically generating an image depicting a scene in a story. Motivated by recent progress on text-to-image models, we consider a pipeline that uses LLMs as an interface for prompting text-to-image models to generate scene illustrations given raw story text. We apply variations of this pipeline to a prominent story corpus in order to synthesize illustrations for scenes in these stories. We conduct a human annotation task to obtain pairwise quality judgments for these illustrations. The outcome of this process is the SceneIllustrations dataset, which we release as a new resource for future work on cross-modal narrative transformation. Through our analysis of this dataset and experiments modeling illustration quality, we demonstrate that LLMs can effectively verbalize scene knowledge implicitly evoked by story text. Moreover, this capability is impactful for generating and evaluating illustrations.
pdf
bib
abs
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
Le Zhang
|
Bo Wang
|
Xipeng Qiu
|
Siva Reddy
|
Aishwarya Agrawal
We present REARANK, a large language model (LLM)-based listwise reasoning rerank- ing agent. REARANK explicitly reasons be- fore reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular informa- tion retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in- domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results under- score the effectiveness of our approach and highlight how reinforcement learning can en- hance LLM reasoning capabilities in reranking.
pdf
bib
abs
Large Language Models Do Multi-Label Classification Differently
Marcus Ma
|
Georgios Chochlakis
|
Niyantha Maruthu Pandiyan
|
Jesse Thomason
|
Shrikanth Narayanan
Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, focusing on subjective tasks, by analyzing the output distributions of the models at each label generation step. We find that the initial probability distribution for the first label often does not reflect the eventual final output, even in terms of relative order and find LLMs tend to suppress all but one label at each generation step. We further observe that as model scale increases, their token distributions exhibit lower entropy and higher single-label confidence, but the internal relative ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. We introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches. We find one method – taking the max probability over all label generation distributions instead of just using the initial probability distribution – improves both distribution alignment and overall F1 classification without adding any additional computation.
pdf
bib
abs
FilBench: Can LLMs Understand and Generate Filipino?
Lester James Validad Miranda
|
Elyanah Aco
|
Conner G. Manuel
|
Jan Christian Blaise Cruz
|
Joseph Marvin Imperial
Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.
pdf
bib
abs
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis
ChengYan Wu
|
Bolei Ma
|
Yihong Liu
|
Zheyu Zhang
|
Ningyuan Deng
|
Yanshu Li
|
Baolan Chen
|
Yi Zhang
|
Yun Xue
|
Barbara Plank
Aspect-based sentiment analysis (ABSA) is a crucial task in information extraction and sentiment analysis, aiming to identify aspects with associated sentiment elements in text. However, existing ABSA datasets are predominantly English-centric, limiting the scope for multilingual evaluation and research. To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7 domains and 21 languages, making it the most extensive multilingual parallel dataset for ABSA to date. Our primary focus is on triplet extraction, which involves identifying aspect terms, aspect categories, and sentiment polarities. The dataset is constructed through an automatic translation process with human review to ensure quality. We perform extensive experiments using various baselines to assess performance and compatibility on M-ABSA. Our empirical findings highlight that the dataset enables diverse evaluation tasks, such as multilingual and multi-domain transfer learning, and large language model evaluation, underscoring its inclusivity and its potential to drive advancements in multilingual ABSA research.
pdf
bib
abs
RuCCoD: Towards Automated ICD Coding in Russian
Alexandr Nesterov
|
Andrey Sakhovskiy
|
Ivan Sviridov
|
Airat Valiev
|
Vladimir Makharev
|
Petr Anokhin
|
Galina Zubkova
|
Elena Tutubalina
This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts. Our code and dataset are available at https://github.com/auto-icd-coding/ruccod.
pdf
bib
abs
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
Dayu Yang
|
Tianyang Liu
|
Daoan Zhang
|
Antoine Simoulin
|
Xiaoyi Liu
|
Yuwei Cao
|
Zhaopu Teng
|
Xin Qian
|
Grey Yang
|
Jiebo Luo
|
Julian McAuley
Code and reasoning recently exhibit a mutually reinforcing relationship in large language models (LLMs): Code is abstract, modular, highly structured and has strong logic, guiding reasoning in training and inference. While reasoning translates high-level goals into small executable steps, enable more sophisticated code intellignece, solving real-world challenging software development problems. In this study, we examine how code serves as a structured medium for enhancing reasoning - providing verifiable execution paths, enforcing logical decomposition, and enabling runtime validation, and how advances in reasoning have transformed code intelligence from basic completion to sophisticated agent - enabling models to tackle complex software engineering tasks through deliberate planning and systematic debugging. Finally, we identify key challenges and propose future research directions may deepen the synergy, ultimately advancing LLM performance in both complex reasoning and code intelligence.
pdf
bib
abs
Efficient Model Development through Fine-tuning Transfer
Pin-Jie Lin
|
Rishab Balasubramanian
|
Fengyuan Liu
|
Nikhil Kandpal
|
Tu Vu
Modern LLMs face a major obstacle: each new pre-trained model version requires expensive and repetitive alignment. We propose a method that transfers fine-tuning updates across model versions. The key idea is to extract the *diff vector*, which is the difference in parameters induced by fine-tuning, from a *source* model version and apply it to the base of a different *target* version. We show that transferring diff vectors significantly improves the target base model, often achieving performance comparable to its fine-tuned counterpart. For example, applying the fine-tuning updates from Llama 3.0 8B to Llama 3.1 8B increases accuracy by 46.9% on IFEval and 15.7% on LiveCodeBench without further training, surpassing Llama 3.1 8B Instruct. In multilingual settings, we also observe accuracy gains relative to Llama 3.1 8B Instruct, including 4.7% for Malagasy and 15.5% for Turkish on Global MMLU. Our controlled experiments reveal that fine-tuning transfer works best when source and target models are linearly connected in parameter space. We also show that this transfer provides a stronger and more efficient starting point for subsequent fine-tuning. Finally, we propose an iterative *recycling-then-finetuning* approach for continuous model development, which improves both efficiency and effectiveness. Our findings suggest that fine-tuning transfer is a viable strategy to reduce training costs while maintaining model performance.
pdf
bib
abs
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes
Mingyang Wang
|
Lukas Lange
|
Heike Adel
|
Yunpu Ma
|
Jannik Strötgen
|
Hinrich Schuetze
Reasoning language models (RLMs) excel at complex tasks by leveraging a chain-of-thought process to generate structured intermediate steps. However, language mixing, i.e., reasoning steps containing tokens from languages other than the prompt, has been observed in their outputs and shown to affect performance, though its impact remains debated. We present the first systematic study of language mixing in RLMs, examining its patterns, impact, and internal causes across 15 languages, 7 task difficulty levels, and 18 subject areas, and show how all three factors influence language mixing. Moreover, we demonstrate that the choice of reasoning language significantly affects performance: forcing models to reason in Latin or Han scripts via constrained decoding notably improves accuracy. Finally, we show that the script composition of reasoning traces closely aligns with that of the model’s internal representations, indicating that language mixing reflects latent processing preferences in RLMs. Our findings provide actionable insights for optimizing multilingual reasoning and open new directions for reasoning language control to build more interpretable and adaptable RLMs.
pdf
bib
abs
User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
Yuhan Liu
|
Michael JQ Zhang
|
Eunsol Choi
Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting implicit user feedback from user-LM interaction logs. We study two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation logs, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. Specifically, we study whether incorporating the contents of user feedback (e.g., user wanted clarification), in addition to the polarity of the feedback, can improve the model performance. We observe mixed results, showing this helps in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.
pdf
bib
abs
Read to Hear: A Zero-Shot Pronunciation Assessment Using Textual Descriptions and LLMs
Yu-Wen Chen
|
Melody Ma
|
Julia Hirschberg
Automatic pronunciation assessment is typically performed by acoustic models trained on audio-score pairs. Although effective, these systems provide only numerical scores, without the information needed to help learners understand their errors. Meanwhile, large language models (LLMs) have proven effective in supporting language learning, but their potential for assessing pronunciation remains unexplored. In this work, we introduce TextPA, a zero-shot, Textual description-based Pronunciation Assessment approach. TextPA utilizes human-readable representations of speech signals, which are fed into an LLM to assess pronunciation accuracy and fluency, while also providing reasoning behind the assigned scores. Finally, a phoneme sequence match scoring method is used to refine the accuracy scores. Our work highlights a previously overlooked direction for pronunciation assessment. Instead of relying on supervised training with audio-score examples, we exploit the rich pronunciation knowledge embedded in written text. Experimental results show that our approach is both cost-efficient and competitive in performance. Furthermore, TextPA significantly improves the performance of conventional audio-score-trained models on out-of-domain data by offering a complementary perspective.
pdf
bib
abs
COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision-Language Models
Sanchit Sinha
|
Guangzhi Xiong
|
Aidong Zhang
Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present “COCO-Tree” - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM’s linguistic reasoning. COCO-Tree’s beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.
pdf
bib
abs
SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Tong Bao
|
Mir Tafseer Nayeem
|
Davood Rafiei
|
Chengzhi Zhang
Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement—from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.
pdf
bib
abs
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Zhisheng Zheng
|
Puyuan Peng
|
Anuj Diwan
|
Cong Phuoc Huynh
|
Xiaohang Sun
|
Zhu Liu
|
Vimal Bhat
|
David Harwath
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot text-to-speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
pdf
bib
abs
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
|
Bohan Jiang
|
Liangjie Huang
|
Alimohammad Beigi
|
Chengshuai Zhao
|
Zhen Tan
|
Amrita Bhattacharjee
|
Yuxuan Jiang
|
Canyu Chen
|
Tianhao Wu
|
Kai Shu
|
Lu Cheng
|
Huan Liu
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the “LLM-as-a-judge” paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: what to judge, how to judge, and how to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area.
pdf
bib
abs
MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification
Iustin Sirbu
|
Robert-Adrian Popovici
|
Cornelia Caragea
|
Stefan Trausan-Matu
|
Traian Rebedea
We introduce **MultiMatch**, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques - heads agreement from **Multi**head Co-training, self-adaptive thresholds from Free**Match**, and Average Pseudo-Margins from Margin**Match** - resulting in a holistic approach that improves robustness and performance in SSL settings.Experimental results on benchmark datasets highlight the superior performance of MultiMatch, i.e., MultiMatch achieves state-of-the-art results on 8 out of 10 setups from 5 natural language processing datasets and ranks first according to the Friedman test among 21 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26%, a critical advantage for real-world text classification tasks. Our code is available on GitHub.
pdf
bib
abs
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
Prakamya Mishra
|
Jiang Liu
|
Jialian Wu
|
Xiaodong Yu
|
Zicheng Liu
|
Emad Barsoum
Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce **TTT-Bench**, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and **discover that the models that excel at hard math problems frequently fail at these simple reasoning games**. Further testing reveals that our evaluated reasoning models score on average ↓ 41% & ↓ 5% lower on TTT-Bench compared to MATH 500 & AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.
pdf
bib
abs
Learning from Diverse Reasoning Paths with Routing and Collaboration
Zhenyu Lei
|
Zhen Tan
|
Song Wang
|
Yaochen Zhu
|
Zihan Chen
|
Yushun Dong
|
Jundong Li
Advances in large language models (LLMs) significantly enhance reasoning capabilities but their deployment is restricted in resource-constrained scenarios. Knowledge distillation addresses this by transferring knowledge from powerful teacher models to compact and transparent students.However, effectively capturing the teacher’s comprehensive reasoning is challenging due to conventional token-level supervision’s limited scope. Using multiple reasoning paths per query alleviates this problem, but treating each path identically is suboptimal as paths vary widely in quality and suitability across tasks and models.We propose Quality-filtered Routing with Cooperative Distillation(QR-Distill), combining path quality filtering, conditional routing, and cooperative peer teaching. First, quality filtering retains only correct reasoning paths scored by an LLM-based evaluation. Second, conditional routing dynamically assigns paths tailored to each student’s current learning state. Finally, cooperative peer teaching enables students to mutually distill diverse insights, addressing knowledge gaps and biases toward specific reasoning styles. Experiments demonstrate QR-Distill’s superiority over traditional single- and multi-path distillation methods. Ablation studies further highlight the importance of each component—quality filtering, conditional routing, and peer teaching—in effective knowledge transfer. Our code is available at https://github.com/LzyFischer/Distill.
pdf
bib
abs
Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning
Jiayuan Zhu
|
Jiazhen Pan
|
Yuyuan Liu
|
Fenglin Liu
|
Junde Wu
The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose ***Ask Patients with Patience (APP)***, a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.
pdf
bib
abs
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Shrey Pandit
|
Jiawei Xu
|
Junyuan Hong
|
Zhangyang Wang
|
Tianlong Chen
|
Kaidi Xu
|
Ying Ding
Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting ”hard” category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a ”not sure” category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.
pdf
bib
abs
NUTMEG: Separating Signal From Noise in Annotator Disagreement
Jonathan Ivey
|
Susan Gauch
|
David Jurgens
NLP models often rely on human-labeled data for training and evaluation. Many approaches crowdsource this data from a large number of annotators with varying skills, backgrounds, and motivations, resulting in conflicting annotations. These conflicts have traditionally been resolved by aggregation methods that assume disagreements are errors. Recent work has argued that for many tasks annotators may have genuine disagreements and that variation should be treated as signal rather than noise. However, few models separate signal and noise in annotator disagreement. In this work, we introduce NUTMEG, a new Bayesian model that incorporates information about annotator backgrounds to remove noisy annotations from human-labeled training data while preserving systematic disagreements. Using synthetic and real-world data, we show that NUTMEG is more effective at recovering ground-truth from annotations with systematic disagreement than traditional aggregation methods, and we demonstrate that downstream models trained on NUTMEG-aggregated data significantly outperform models trained on data from traditionally aggregation methods. We provide further analysis characterizing how differences in subpopulation sizes, rates of disagreement, and rates of spam affect the performance of our model. Our results highlight the importance of accounting for both annotator competence and systematic disagreements when training on human-labeled data.
pdf
bib
abs
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
Abhilekh Borah
|
Chhavi Sharma
|
Danush Khanna
|
Utkarsh Bhatt
|
Gurpreet Singh
|
Hasnat Md Abdullah
|
Raghav Kaushik Ravi
|
Vinija Jain
|
Jyoti Patel
|
Shubham Singh
|
Vasu Sharma
|
Arpita Vats
|
Rahul Raja
|
Aman Chadha
|
Amitava Das
Alignment is no longer a luxury; it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the **Alignment Quality Index (AQI)**. This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the *Davies-Bouldin score (DBS)*, *Dunn index (DI)*, *Xie-Beni index (XBI)*, and *Calinski-Harabasz index (CHI)* across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding-invariant tool for behavior-agnostic safety auditing. Additionally, we propose the **LITMUS** dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI’s correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
pdf
bib
abs
MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform
Hayoung Jung
|
Shravika Mittal
|
Ananya Aatreya
|
Navreet Kaur
|
Munmun De Choudhury
|
Tanu Mitra
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)—a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.
pdf
bib
abs
Demystifying optimized prompts in language models
Rimon Melamed
|
Lucas Hurley McCabe
|
H Howie Huang
Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (“optimized”) prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model’s activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.
pdf
bib
abs
Whisper-UT: A Unified Translation Framework for Speech and Text
Cihan Xiao
|
Matthew Wiesner
|
Debashish Chakraborty
|
Reno Kriz
|
Keith Cunningham
|
Kenton Murray
|
Kevin Duh
|
Luis Tavarez-Arce
|
Paul McNamee
|
Sanjeev Khudanpur
Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.
pdf
bib
abs
Unleashing the Reasoning Potential of LLMs by Critique Fine-Tuning on One Problem
Yubo Wang
|
Ping Nie
|
Kai Zou
|
Lijun Wu
|
Wenhu Chen
Critique Fine-Tuning (CFT) has recently emerged as a promising paradigm for unlocking the reasoning capabilities of large language models (LLMs). In this work, we introduce one-shot CFT, a highly compute-efficient approach that leverages critique data generated from a single math problem. Remarkably, this method yields significant gains in reasoning accuracy, surpassing one-shot RLVR (Reinforcement Learning with Verifiable Reward) while requiring 15 to 20 times less compute. Given one math problem, we first prompt a set of diverse small models to produce candidate solutions, then use frontier models such as GPT-4.1 to generate high-quality critiques of these responses. We fine-tune Qwen and Llama family models ranging from 1.5B to 14B parameters with CFT. With just 5 GPU hours, our models achieve up to a 16 percent absolute improvement in average accuracy across six mathematical reasoning benchmarks (for example, Qwen2.5-Math-7B improves from 26 percent to 42 percent). Furthermore, ablation studies reveal the robustness of one-shot CFT across different prompt problems. Our findings suggest an extremely compute-efficient approach to unleash the reasoning potential of LLMs.
pdf
bib
abs
Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
Hongxiang Zhang
|
Hao Chen
|
Muhao Chen
|
Tianyi Zhang
Recent decoding methods improve the factuality of large language models (LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.
pdf
bib
abs
BBScoreV2: Learning Time-Evolution and Latent Alignment from Stochastic Representation
Tianhao Zhang
|
Zhecheng Sheng
|
Zhexiao Lin
|
Chen Jiang
|
Dongyeop Kang
Autoregressive generative models play a key role in various language tasks, especially for modeling and evaluating long text sequences. While recent methods leverage stochastic representations to better capture sequence dynamics, encoding both temporal and structural dependencies and utilizing such information for evaluation remains challenging. In this work, we observe that fitting transformer-based model embeddings into a stochastic process yields ordered latent representations from originally unordered model outputs. Building on this insight and prior work, we theoretically introduce a novel likelihood-based evaluation metric BBScoreV2. Empirically, we demonstrate that the stochastic latent space induces a “clustered-to-temporal ordered” mapping of language model representations in high-dimensional space, offering both intuitive and quantitative support for the effectiveness of BBScoreV2. Furthermore, this structure aligns with intrinsic properties of natural language and enhances performance on tasks such as temporal consistency evaluation (e.g., Shuffle tasks) and AI-generated content detection.
pdf
bib
abs
SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Yu Xia
|
Yiran Jenny Shen
|
Junda Wu
|
Tong Yu
|
Sungchul Kim
|
Ryan A. Rossi
|
Lina Yao
|
Julian McAuley
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
pdf
bib
abs
LLMs as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment
Lingyao Li
|
Dawei Li
|
Zhenhui Ou
|
Xiaoran Xu
|
Jingxiao Liu
|
Zihui Ma
|
Runlong Yu
|
Min Deng
Efficient simulation is essential for enhancing proactive preparedness for sudden-onset disasters such as earthquakes. Recent advancements in large language models (LLMs) as world models show promise in simulating complex scenarios. This study examines multiple LLMs to proactively estimate perceived earthquake impacts. Leveraging multimodal datasets including geospatial, socioeconomic, building, and street-level imagery data, our framework generates Modified Mercalli Intensity (MMI) predictions at zip code and county scales. Evaluations on the 2014 Napa and 2019 Ridgecrest earthquakes using USGS “Did You Feel It? (DYFI)” reports demonstrate significant alignment, as evidenced by high correlation of 0.88 and low RMSE of 0.77 as compared to real reports at the zip code level. Techniques such as RAG and ICL can improve simulation performance, while visual inputs notably enhance accuracy compared to structured numerical data alone. These findings show the promise of LLMs in simulating disaster impacts that can help strengthen pre-event planning.
pdf
bib
abs
Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?
Hua Shen
|
Nicholas Clark
|
Tanu Mitra
Existing research assesses LLMs’ values by analyzing their stated inclinations, overlooking potential discrepancies between stated values and actions—termed the “Value-Action Gap.” This study introduces ValueActionLens, a framework to evaluate the alignment between LLMs’ stated values and their value-informed actions. The framework includes a dataset of 14.8k value-informed actions across 12 cultures and 11 social topics, along with two tasks measuring alignment through three metrics. Experiments show substantial misalignment between LLM-generated value statements and their actions, with significant variations across scenarios and models. Misalignments reveal potential harms, highlighting risks in relying solely on stated values to predict behavior. The findings stress the need for context-aware evaluations of LLM values and the value-action gaps.
pdf
bib
abs
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Jiazheng Li
|
Yuxiang Zhou
|
Junru Lu
|
Gladys Tyen
|
Lin Gui
|
Cesare Aloisi
|
Yulan He
Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a **contrastive reflection synthesis pipeline** that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose *DARS*, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. *DARS* achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of *DARS*. We release the DARS code at https://github.com/lijiazheng99/DARS.
pdf
bib
abs
Image Embedding Sampling Method for Diverse Captioning
Sania Waheed
|
Na Min An
Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions. Our code is available at
https://github.com/xfactlab/HBoP.
pdf
bib
abs
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Huihan Li
|
You Chen
|
Siyuan Wang
|
Yixin He
|
Ninareh Mehrabi
|
Rahul Gupta
|
Xiang Ren
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources – local, mid-range, or long-range – based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
pdf
bib
abs
FANS: Formal Answer Selection for LLM Natural Language Math Reasoning Using Lean4
Jiarui Yao
|
Ruida Wang
|
Tong Zhang
Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates, especially in math reasoning. The inherent ambiguity of Natural Language (NL) limits LLMs’ ability to perform verifiable reasoning, making the answers lack coherence and trustworthy support. To tackle the above challenges, we propose a novel framework named FANS: Formal ANswer Selection for LLM Natural Language Math Reasoning Using Lean4. It is a pioneering framework that utilizes Lean4 to enhance LLMs’ NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it invokes another Lean4 prover LLM to produce proofs, and finally verifies the proofs by Lean4 compiler. Answers are selected based on the verifications. It enhances LLMs’ NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model based ones. Our experiments demonstrate the effectiveness of FANS with an improvement of nearly 2% across several math benchmarks, and even higher further based on reward models or in subfields such as algebra and number theory that Lean4 is better at. The code is available in https://github.com/MaxwellJryao/FANS.
pdf
bib
abs
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Gagan Bhatia
|
Maxime Peyrard
|
Wei Zhao
Modern BPE tokenisers often split calendar dates into meaningless fragments, e.g., “20250312” → “202”, “503”, “12”, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokeniser preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction heals date fragments. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year → month → day).
pdf
bib
abs
Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark
Jianyou Wang
|
Weili Cao
|
Longtian Bao
|
Youze Zheng
|
Gil Pasternak
|
Kaicheng Wang
|
Xiaoyue Wang
|
Ramamohan Paturi
|
Leon Bergen
Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence from different studies, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. Derived from over 500 biomedical studies, the three benchmark tasks encompass expert reviewers’ judgments of studies’ research methodologies, including the assessments of risk of bias within these studies. The benchmark contains a human-validated annotation pipeline for fine-grained alignment of reviewers’ judgments with research paper sentences. Our analyses show that large language models’ reasoning and retrieval capabilities impact their effectiveness with risk-of-bias assessment. The dataset is available at https://github.com/RoBBR-Benchmark/RoBBR.
pdf
bib
abs
SHIFT: Selected Helpful Informative Frame for Video-guided Machine Translation
Boyu Guan
|
Chuang Han
|
Yining Zhang
|
Yupu Liang
|
Zhiyang Zhang
|
Yang Zhao
|
Chengqing Zong
Video-guided Machine Translation (VMT) aims to improve translation quality by integrating contextual information from paired short video clips. Mainstream VMT approaches typically incorporate multimodal information by uniformly sampling frames from the input videos. However, this paradigm frequently incurs significant computational overhead and introduces redundant multimodal content, which degrades both efficiency and translation quality. To tackle these challenges, we propose SHIFT (Selected Helpful Informative Frame for Translation). It is a lightweight, plug-and-play framework designed for VMT with Multimodal Large Language Models (MLLMs). SHIFT adaptively selects a single informative key frame when visual context is necessary; otherwise, it relies solely on textual input. This process is guided by a dedicated clustering module and a selector module. Experimental results demonstrate that SHIFT enhances the performance of MLLMs on the VMT task while simultaneously reducing computational cost, without sacrificing generalization ability. The code will be released upon acceptance.
pdf
bib
abs
Surge: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Bohan Lyu
|
Siqiao Huang
|
Zichen Liang
|
Qian Sun
|
Jiaming Zhang
Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with 1160 problems covering 8 key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of 21 open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at
https://github.com/Imbernoulli/SURGE.
pdf
bib
abs
Few-Shot Learning Translation from New Languages
Carlos Mullov
|
Alexander Waibel
Recent work shows strong transfer learning capability to unseen languages in sequence-to-sequence neural networks, under the assumption that we have high-quality word representations for the target language. We evaluate whether this direction is a viable path forward for translation from low-resource languages by investigating how much data is required to learn such high-quality word representations. We first show that learning word embeddings separately from a translation model can enable rapid adaptation to new languages with only a few hundred sentences of parallel data. To see whether the current bottleneck in transfer to low-resource languages lies mainly with learning the word representations, we then train word embeddings models on varying amounts of data, to then plug them into a machine translation model. We show that in this simulated low-resource setting with only 500 parallel sentences and 31,250 sentences of monolingual data we can exceed 15 BLEU on Flores on unseen languages. Finally, we investigate why on a real low-resource language the results are less favorable and find fault with the publicly available multilingual language modelling datasets.
pdf
bib
abs
Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design
Yunze Xiao
|
Lynnette Hui Xian Ng
|
Jiarui Liu
|
Mona T. Diab
Large Language Models (LLMs) increasingly exhibit anthropomorphism characteristics – human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a concept of design that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: perceptive, linguistic, behavioral, and cognitive. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.
pdf
bib
abs
TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Heming Xia
|
Chak Tou Leong
|
Wenjie Wang
|
Yongqi Li
|
Wenjie Li
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI’s o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop.
pdf
bib
abs
Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
Tu Anh Dinh
|
Jan Niehues
Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models’ output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower probability does not necessarily mean lower output quality. Due to this observation, we propose a QE approach called BoostedProb, which boosts the model’s confidence in cases where there are multiple viable output options. With no increase in complexity, BoostedProb is notably better than raw model probability in different settings, achieving on average +0.194 improvement in Pearson correlation to ground-truth quality. It also comes close to or outperforms more costly approaches like supervised or ensemble-based QE in certain settings.
pdf
bib
abs
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
|
Michihiro Yasunaga
|
Andrew Cohen
|
Yoon Kim
|
Asli Celikyilmaz
|
Marjan Ghazvininejad
Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build reWordBench, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.
pdf
bib
abs
Why Do Some Inputs Break Low-Bit LLM Quantization?
Ting-Yun Chang
|
Muru Zhang
|
Jesse Thomason
|
Robin Jia
Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find that the quantization errors of 50 pairs of methods are strongly correlated (avg. 𝜌 = 0.82) on FineWeb examples. Moreover, the residual stream magnitudes of full-precision models are indicative of future quantization errors. We further establish a hypothesis that relates the residual stream magnitudes to error amplification and accumulation over layers. Using LLM localization techniques, early exiting, and activation patching, we show that examples with large errors rely on precise residual activations in the late layers, and that the outputs of MLP gates play a crucial role in maintaining the perplexity. Our work reveals why certain examples result in large quantization errors and which model components are most critical for performance preservation.
pdf
bib
abs
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Keisuke Kamahori
|
Jungo Kasai
|
Noriyuki Kojima
|
Baris Kasikci
Modern automatic speech recognition (ASR) models, such as OpenAI’s Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3’s encoder size by over 50%, matching Whisper medium’s size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at https://github.com/efeslab/LiteASR.
pdf
bib
abs
AROMA: Autonomous Rank-one Matrix Adaptation
Hao Nan Sheng
|
Zhi-Yong Wang
|
Hing Cheung So
|
Mingrui Yang
As large language models continue to grow in size, parameter-efficient fine-tuning (PEFT) has become increasingly crucial. While low-rank adaptation (LoRA) offers a solution through low-rank updates, its static rank allocation may yield suboptimal results. Adaptive low-rank adaptation (AdaLoRA) improves this with dynamic allocation but remains sensitive to initial and target rank configurations. We introduce AROMA, a framework that automatically constructs layer-specific updates by iteratively building up rank-one components with very few trainable parameters that gradually diminish to zero. Unlike existing methods that employ rank reduction mechanisms, AROMA introduces a dual-loop architecture for rank growth. The inner loop extracts information from each rank-one subspace, while the outer loop determines the number of rank-one subspaces, i.e., the optimal rank. We reset optimizer states to maintain subspace independence. AROMA significantly reduces parameters compared to LoRA and AdaLoRA while achieving superior performance on natural language understanding and generation, commonsense reasoning, offering new insights into adaptive PEFT.
pdf
bib
abs
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens
Ziyang Ma
|
Qingyue Yuan
|
Zhenglin Wang
|
Deyu Zhou
Previous research has primarily focused on the cognitive error detection capabilities of Large Language Models (LLMs), often prompting them to analyze mistakes in reasoning chains. However, few studies have examined the meta-cognitive abilities of LLMs (e.g., their self-awareness of step errors), which are crucial for their reliability. While studies on LLM self-evaluation present some measures, such as perplexity, which can reflect the answer correctness and be viewed as the lens of meta-cognition, they lack step-level analysis and adaptation. This paper studies the evaluation of LLM meta-cognition using the current lenses and how to improve these lenses. Specifically, we propose AutoMeco, an Automated Meta-cognition Evaluation framework for benchmarking the existing lenses. Furthermore, a training-free Markovian Intrinsic Reward Adjustment strategy, MIRA, is proposed to boost current meta-cognition lenses. Experimental results on three mathematical reasoning datasets and three LLMs show the reasonableness of AutoMeco by comparing it with Best-of-N verification. Moreover, the meta-cognition ability of LLMs can be better evaluated using MIRA.
pdf
bib
abs
Anchoring-Guidance Fine-Tuning (AnGFT): Elevating Professional Response Quality in Role-Playing Conversational Agents
Qibin Li
|
Zhen Xu
|
Shengyuan Bai
|
Nianmin Yao
|
Kaili Sun
|
Bowen Wu
|
Ying Li
|
Baoxun Wang
Large Language Models (LLMs) have demonstrated significant advancements in various fields, notably in Role-Playing Conversational Agents (RPCAs). However, when confronted with role-specific professional inquiries, LLMs-based RPCAs tend to underperform due to their excessive emphasis on the conversational abilities of characters rather than effectively invoking and integrating relevant expert knowledge. This often results in inaccurate responses. We refer to this phenomenon as the “Knowledge Misalignment” which underscores the limitations of RPCAs in integrating expert knowledge. To mitigate this issue, we have introduced an Anchoring-Guidance Fine-Tuning (AnGFT) Framework into the RPCAs’ training process. This involves initially linking the Anchoring-Based System Prompt (ASP) with the LLM’s relevant expert domains through diverse prompt construction strategies and supervised fine-tuning (SFT). Following the role-play enriched SFT, the integration of ASP enables LLMs to better associate with relevant expert knowledge, thus enhancing their response capabilities in role-specific expert domains. Moreover, we have developed four comprehensive metrics—helpfulness, thoroughness, credibility, and feasibility—to evaluate the proficiency of RPCAs in responding to professional questions. Our method was tested across four professional fields, and the experimental outcomes suggest that the proposed AnGFT Framework substantially improves the RPCAs’ performance in handling role-specific professional queries, while preserving their robust role-playing abilities.
pdf
bib
abs
RiTTA: Modeling Event Relations in Text-to-Audio Generation
Yuhang He
|
Yash Jain
|
Xubo Liu
|
Andrew Markham
|
Vibhav Vineet
Existing text-to-audio (TTA) generation methods have neither systematically explored audio event relation modeling, nor proposed any new framework to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: (1) proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; (2) introducing a new audio event corpus encompassing commonly heard audios; and (3) proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a gated prompt tuning strategy that improves existing TTA models’ relation modeling capability with negligible extra parameters. Specifically, we introduce learnable relation and event prompt that append to the text prompt before feeding to existing TTA models.
pdf
bib
abs
Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLMs
Xiaofeng Zhang
|
Yihao Quan
|
Chen Shen
|
Chaochen Gu
|
Xiaosong Yuan
|
Shaotian Yan
|
Jiawei Cao
|
Hao Cheng
|
Kaijie Wu
|
Jieping Ye
Multimodal large language models (MLLMs) demonstrate excellent abilities for understanding visual information, while the hallucination remains. Albeit image tokens constitute the majority of the MLLMs input, the relation between image tokens and hallucinations is still unexplored. In this paper, we analyze the attention score distribution of image tokens across layers and attention heads in models, revealing an intriguing but common phenomenon: most hallucinations are closely linked to the attention sink patterns of image tokens attention matrix, where shallow layers exhibit dense sinks and deep layers exhibit the sparse. We further explore the attention heads of different layers, finding: heads with high-density attention sink of the image part act positively in mitigating hallucinations. Inspired by these findings, we propose a training-free approach called Enhancing Vision Attention Sinks (EVAS) to facilitate the convergence of the image token attention sink within shallow layers. Specifically, EVAS identifies the attention heads that emerge as the densest visual sink in shallow layers and extracts its attention matrix, which is then broadcast to other heads of the same layer, thereby strengthing the layer’s focus on the image itself. Extensive empirical results of various MLLMs illustrate the superior performance of the proposed EVAS, demonstrating its effectiveness and generality.
pdf
bib
abs
WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat
|
Pume Tuchinda
|
Lalita Lowphansirikul
|
Surapon Nonesung
|
Panuthep Tasawong
|
Alham Fikri Aji
|
Can Udomcharoenchaikit
|
Sarana Nutanong
Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
pdf
bib
abs
MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models
Zhengyi Zhao
|
Shubo Zhang
|
Yuxi Zhang
|
Yanxi Zhao
|
Yifan Zhang
|
Zezhong Wang
|
Huimin Wang
|
Yutian Zhao
|
Bin Liang
|
Yefeng Zheng
|
Binyang Li
|
Kam-Fai Wong
|
Xian Wu
Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme’s image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.
pdf
bib
abs
A Comprehensive Literary Chinese Reading Comprehension Dataset with an Evidence Curation Based Solution
Dongning Rao
|
Rongchu Zhou
|
Peng Chen
|
Zhihua Jiang
Low-resource language understanding is challenging, even for large language models (LLMs). An epitome of this problem is the CompRehensive lIterary chineSe readIng comprehenSion (CRISIS), whose difficulties include limited linguistic data, long input, and insight-required questions. Besides the compelling necessity of providing a larger dataset for CRISIS, excessive information, order bias, and entangled conundrums still haunt the CRISIS solutions. Thus, we present the eVIdence cuRation with opTion shUffling and Abstract meaning representation-based cLauses segmenting (VIRTUAL) procedure for CRISIS, with the largest dataset. While the dataset is also named CRISIS, it results from a three-phase construction process, including question selection, data cleaning, and a silver-standard data augmentation step, which augments translations, celebrity profiles, government jobs, reign mottos, and dynasty to CRISIS. The six steps of VIRTUAL include embedding, shuffling, abstract beaning representation based option segmenting, evidence extracting, solving, and voting. Notably, the evidence extraction algorithm facilitates literary Chinese evidence sentences, translated evidence sentences, and annotations of keywords with a similarity-based ranking strategy. While CRISIS congregates understanding-required questions from seven sources, the experiments on CRISIS substantiate the effectiveness of VIRTUAL, with a 7 percent hike in accuracy compared with the baseline. Interestingly, both non-LLMs and LLMs have order bias, and abstract beaning representation based option segmenting is constructive for CRISIS.
pdf
bib
abs
Dialect-SQL: An Adaptive Framework for Bridging the Dialect Gap in Text-to-SQL
Jie Shi
|
Xi Cao
|
Bo Xu
|
Jiaqing Liang
|
Yanghua Xiao
|
Jia Chen
|
Peng Wang
|
Wei Wang
Text-to-SQL is the task of translating natural language questions into SQL queries based on relational databases. Different databases implement their own SQL dialects, leading to variations in syntax. As a result, SQL queries designed for one database may not execute properly in another, creating a dialect gap. Existing Text-to-SQL research primarily focuses on specific database systems, limiting adaptability to different dialects. This paper proposes a novel adaptive framework called Dialect-SQL, which employs Object Relational Mapping (ORM) code as an intermediate language to bridge this gap. Given a question, we guide Large Language Models (LLMs) to first generate ORM code, which is then parsed into SQL queries targeted for specific databases. However, there is a lack of high-quality Text-to-Code datasets that enable LLMs to effectively generate ORM code. To address this issue, we propose a bootstrapping approach to synthesize ORM code, where verified ORM code is iteratively integrated into a demonstration pool that serves as in-context examples for ORM code generation. Our experiments demonstrate that Dialect-SQL significantly enhances dialect adaptability, outperforming traditional methods that generate SQL queries directly. Our code and data are released at https://github.com/jieshi10/orm-sql.
pdf
bib
abs
FinMTEB: Finance Massive Text Embedding Benchmark
Yixuan Tang
|
Yi Yang
The efficacy of text embedding models in representing and retrieving information is crucial for many NLP applications, with performance significantly advanced by Large Language Models (LLMs). Despite this progress, existing benchmarks predominantly use general-purpose datasets, inadequately addressing the nuanced requirements of specialized domains like finance. To bridge this gap, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a comprehensive evaluation suite specifically designed for the financial domain. FinMTEB encompasses 64 datasets across 7 task types, including classification, clustering, retrieval, pair classification, reranking, summarization, and semantic textual similarity (STS) in English and Chinese. Alongside this benchmark, we introduce Fin-E5, a state-of-the-art finance-adapted embedding model, ranking first on FinMTEB. Fin-E5 is developed by fine-tuning e5-Mistral-7B-Instruct on a novel persona-based synthetic dataset tailored for diverse financial embedding tasks. Evaluating 15 prominent embedding models on FinMTEB, we derive three key findings: (1) domain-specific models, including our Fin-E5, significantly outperform general-purpose models; (2) performance on general benchmarks is a poor predictor of success on financial tasks; and (3) surprisingly, traditional Bag-of-Words (BoW) models surpass dense embedding models on financial STS tasks. This work provides a robust benchmark for financial NLP and offers actionable insights for developing future domain-adapted embedding solutions. Both FinMTEB and Fin-E5 will be open-sourced for the research community.
pdf
bib
abs
Scaling Rich Style-Prompted Text-to-Speech Datasets
Anuj Diwan
|
Zhisheng Zheng
|
David Harwath
|
Eunsol Choi
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 282 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .
pdf
bib
abs
Exploring Changes in Nation Perception with Nationality-Assigned Personas in LLMs
Mahammed Kamruzzaman
|
Gene Louis Kim
Persona assignment has become a common strategy for customizing LLM use to particular tasks and contexts. In this study, we explore how evaluation of different nations changes when LLMs are assigned specific nationality personas. We assign 193 different nationality personas (e.g., an American person) to five LLMs and examine how the LLM evaluations (or *“perceptions”*) of countries change. We find that all LLM-persona combinations tend to favor Western European nations, though nation-personas push LLM behaviors to focus more on and treat the nation-persona’s own region more favorably. Eastern European, Latin American, and African nations are treated more negatively by different nationality personas. We additionally find that evaluations by nation-persona LLMs of other nations correlate with human survey responses but fail to match the values closely. Our study provides insight into how biases and stereotypes are realized within LLMs when adopting different national personas. Our findings underscore the critical need for developing mechanisms to ensure that LLM outputs promote fairness and avoid over-generalization.
pdf
bib
abs
Eliciting Implicit Acoustic Styles from Open-domain Instructions to Facilitate Fine-grained Controllable Generation of Speech
Jianxing Yu
|
Gou Zihao
|
Chen Li
|
Zhisheng Wang
|
Peiji Yang
|
Wenqing Chen
|
Jian Yin
This paper focuses on generating speech with the acoustic style that meets users’ needs based on their open-domain instructions. To control the style, early work mostly relies on pre-defined rules or templates. The control types and formats are fixed in a closed domain, making it hard to meet diverse needs of users. One solution is to resort to instructions in free text to guide the generation. Current work mainly studies the instructions that clearly specify the acoustic styles, such as low pitch and fast speed. However, the instructions are complex, some even vague and abstract, such as “Generate a voice of a woman who is heartbroken due to a breakup. It is hard to infer this implicit style by traditional matching-based methods. To address this problem, we propose a new controllable model. It first utilizes multimodal LLMs with knowledge-augmented techniques to infer the desired speech style from the instructions. The powerful language understanding ability of LLMs can help us better elicit the implicit style factors from the instruction. By using these factors as a control condition, we design a diffusion-based generator adept at finely adjusting speech details. That offers higher flexibility to meet complex users’ needs. Next, we verify the output speech from three aspects, i.e., consistency of decoding state, mel-spectrogram, and instruction style. This verified feedback can inversely optimize the generator. Extensive experiments are conducted on three popular datasets. The results show the effectiveness and good controllability of our approach.
pdf
bib
abs
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Xiaoyu Xu
|
Minxin Du
|
Qingqing Ye
|
Haibo Hu
Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components—masking, distillation, and world fact. Using low-rank adapters (LoRA) ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (via a new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.
pdf
bib
abs
AdaptThink: Reasoning Models Can Learn When to Think
Jiajie Zhang
|
Nianyi Lin
|
Lei Hou
|
Ling Feng
|
Juanzi Li
Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency.
pdf
bib
abs
T2: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering
Zhengyi Zhao
|
Shubo Zhang
|
Zezhong Wang
|
Huimin Wang
|
Yutian Zhao
|
Bin Liang
|
Yefeng Zheng
|
Binyang Li
|
Kam-Fai Wong
|
Xian Wu
Recent advances in large language models have demonstrated remarkable performance on Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models’ inherent reasoning capabilities. To address these limitations, we present T2: Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T2 leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T2 works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T2 not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2%.
pdf
bib
abs
Non-Existent Relationship: Fact-Aware Multi-Level Machine-Generated Text Detection
Yang Wu
|
Ruijia Wang
|
Jie Wu
Machine-generated text detection is critical for preventing misuse of large language models (LLMs). Although LLMs have recently excelled at mimicking human writing styles, they still suffer from factual hallucinations manifested as entity-relation inconsistencies with real-world knowledge. Current detection methods inadequately address the authenticity of the entity graph, which is a key discriminative feature for identifying machine-generated content. To bridge this gap, we propose a fact-aware model that assesses discrepancies between textual and factual entity graphs through graph comparison. In order to holistically analyze context information, our approach employs hierarchical feature extraction with gating units, enabling the adaptive fusion of multi-grained features from entity, sentence, and document levels. Experimental results on three public datasets demonstrate that our approach outperforms the state-of-the-art methods. Interpretability analysis shows that our model can capture the differences in entity graphs between machine-generated and human-written texts.
pdf
bib
abs
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Ziwei Ji
|
Lei Yu
|
Yeskendir Koishekenov
|
Yejin Bang
|
Anthony Hartshorn
|
Alan Schelten
|
Cheng Zhang
|
Pascale Fung
|
Nicola Cancedda
LLMs often adopt an assertive language style also when making false claims. Such ”overconfident hallucinations” mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that ”verbal uncertainty” is governed by a single linear feature in the representation space of LLMs, and shows that this has only moderate correlation with the actual ”semantic uncertainty” of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce confident hallucinations on short-form answers, achieving an average relative reduction of ~30%.
pdf
bib
abs
JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning
Huanghai Liu
|
Quzhe Huang
|
Qingjing Chen
|
Yiran Hu
|
Jiayu Ma
|
Yun Liu
|
Weixing Shen
|
Yansong Feng
In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning.To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at:
https://github.com/THUlawtech/JUREXpdf
bib
abs
CIE: Controlling Language Model Text Generations Using Continuous Signals
Vinay Samuel
|
Harshita Diddee
|
Yiming Zhang
|
Daphne Ippolito
Aligning language models (LMs) with user intent is becoming increasingly relevant to enhance user experience.This calls for designing methods that can allow users to control the properties of the language that LMs generate, for example, controlling the length of the generation or the complexity of the language that gets chosen.Most existing work attempts to integrate users’ control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale.In this work, we are interested in continuous control signals, ones that exist along a spectrum that can’t easily be captured in a natural language prompt or via existing techniques in conditional generation.Through a case study in controlling the precise response-length of generations, we demonstrate how an LM can be finetuned to expect a control vector that is interpolated between a “low” and a “high” token embedding.Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal.
pdf
bib
abs
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
Xi Wang
|
Songlei Jian
|
Shasha Li
|
Xiaopeng Li
|
Bin Ji
|
Ma Jun
|
Xiaodong Liu
|
Jing Wang
|
Jianfeng Zhang
|
Jie Yu
|
Feilong Bao
|
Wangbaosheng
Large language models (LLMs) generate human-aligned content under certain safety constraints. However, the current known technique “jailbreak prompt” can circumvent safety-aligned measures and induce LLMs to output malicious content. Research on Jailbreaking can help identify vulnerabilities in LLMs and guide the development of robust security frameworks. To circumvent the issue of attack templates becoming obsolete as models evolve, existing methods adopt iterative mutation and dynamic optimization to facilitate more automated jailbreak attacks. However, these methods face two challenges: inefficiency and repetitive optimization, as they overlook the value of past attack experiences. To better integrate past attack experiences to assist current jailbreak attempts, we propose the JailExpert, an automated jailbreak framework, which is the first to achieve a formal representation of experience structure, group experiences based on semantic drift, and support the dynamic updating of the experience pool. Extensive experiments demonstrate that JailExpert significantly improves both attack effectiveness and efficiency. Compared to the current state-of-the-art black-box jailbreak method, JailExpert achieves an average increase of 24% in attack success rate and 2.7 times improvement in attack efficiency.
pdf
bib
abs
Language-to-Space Programming for Training-Free 3D Visual Grounding
Boyu Mi
|
Hanqing Wang
|
Tai Wang
|
Yilun Chen
|
Jiangmiao Pang
3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely **La**nguage-to-**S**pace **P**rogramming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.
pdf
bib
abs
RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Wanlong Liu
|
Junying Chen
|
Ke Ji
|
Li Zhou
|
Wenyu Chen
|
Benyou Wang
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models by incorporating external knowledge. However, current RAG methods exhibit limited capabilities in complex RAG scenarios and suffer from limited task diversity. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs’ RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks.
pdf
bib
abs
AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation
Yilong Lai
|
Jialong Wu
|
Zhenglin Wang
|
Deyu Zhou
Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.
pdf
bib
abs
SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Xudong Lu
|
Haohao Gao
|
Renshou Wu
|
Shuai Ren
|
Xiaoxin Chen
|
Hongsheng Li
|
Fangyuan Li
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce **SmartBench**, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
pdf
bib
abs
F2TEval: Human-Aligned Multi-Dimensional Evaluation for Figure-to-Text Task
Tan Yue
|
Rui Mao
|
Zilong Song
|
Zonghai Hu
|
Dongyan Zhao
Figure-to-Text (F2T) tasks aim to convert structured figure information into natural language text, serving as a bridge between visual perception and language understanding.However, existing evaluation methods remain limited: 1) Reference-based methods can only capture shallow semantic similarities and rely on costly labeled reference text; 2) Reference-free methods depend on multimodal large language models, which suffer from low efficiency and instruction sensitivity; 3) Existing methods provide only sample-level evaluations, lacking interpretability and alignment with expert-level multi-dimensional evaluation criteria.Accordingly, we propose F2TEval, a five-dimensional reference-free evaluation method aligned with expert criteria, covering faithfulness, completeness, conciseness, logicality, and analysis, to support fine-grained evaluation. We design a lightweight mixture-of-experts model that incorporates independent scoring heads and applies the Hilbert-Schmidt Independence Criterion to optimize the disentanglement of scoring representations across dimensions. Furthermore, we construct F2TBenchmark, a human-annotated benchmark dataset covering 21 chart types and 35 application domains, to support research on F2T evaluation. Experimental results demonstrate our model’s superior performance and efficiency, outperforming Gemini-2.0 and Claude-3.5 with only 0.9B parameters.
pdf
bib
abs
Icon2: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
Qiyuan Chen
|
Hongsen Huang
|
Qian Shao
|
Jiahe Chen
|
Jintai Chen
|
Hongxia Xu
|
Renjie Hua
|
Ren Chuan
|
Jian Wu
Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs’ representation space for efficient and tailored preference dataset construction, named Icon2. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.
pdf
bib
abs
DSCD: Large Language Model Detoxification with Self-Constrained Decoding
Ming Dong
|
Jinkui Zhang
|
Bolong Zheng
|
Xinhui Tu
|
Po Hu
|
Tingting He
Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work innovatively proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLMs detoxification without parameter fine-tuning. DSCD strengthens the inner token distribution of the safety layer while weakening that of hallucination and toxic layer during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD’s effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD’s potential as a practical and scalable solution for safer LLM deployments.
pdf
bib
abs
From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models
Jue Zhang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
Large Reasoning Models (LRMs) generate explicit reasoning traces alongside final answers, yet the extent to which these traces influence answer generation remains unclear. In this work, we conduct a three-stage investigation into the interplay between reasoning and answer generation in three distilled DeepSeek R1 models. First, through empirical evaluation, we demonstrate that including explicit reasoning consistently improves answer quality across diverse domains. Second, attention analysis reveals that answer tokens attend substantially to reasoning tokens, with certain mid-layer Reasoning-Focus Heads (RFHs) closely tracking the reasoning trajectory, including self-reflective cues. Third, we apply mechanistic interventions using activation patching to assess the dependence of answer tokens on reasoning activations. Our results show that perturbations to key reasoning tokens can reliably alter the final answers, confirming a directional and functional flow of information from reasoning to answer. These findings deepen our understanding of how LRMs leverage reasoning tokens for answer generation, highlighting the functional role of intermediate reasoning in shaping model outputs.
pdf
bib
abs
Quantifying Language Disparities in Multilingual Large Language Models
Songbo Hu
|
Ivan Vulić
|
Anna Korhonen
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics—the performance realisation ratio, its coefficient of variation, and language potential—enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.
pdf
bib
abs
KoBLEX: Open Legal Question Answering with Multi-hop Reasoning
Jihyung Lee
|
Daehui Kim
|
Seonjeong Hwang
|
Hyounghun Kim
|
Gary Lee
Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs’ legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM–human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.
pdf
bib
abs
End-to-End Learnable Psychiatric Scale Guided Risky Post Screening for Depression Detection on Social Media
Bichen Wang
|
Yuzhe Zi
|
Yixin Sun
|
Hao Yang
|
Yanyan Zhao
|
Bing Qin
Detecting depression through users’ social media posting history is crucial for enabling timely intervention; however, irrelevant content within these posts negatively impacts detection performance. Thus, it is crucial to extract pertinent content from users’ complex posting history. Current methods utilize frozen screening models, which can miss critical information and limit overall performance due to isolated screening and detection processes. To address these limitations, we propose **E2-LPS** **E**nd-to-**E**nd **L**earnable **P**sychiatric Scale Guided Risky Post **S**creening Model) for jointly training our screening model, guided by psychiatric scales, alongside the detection model. We employ a straight-through estimator to enable a learnable end-to-end screening process and avoid the non-differentiability of the screening process. Experimental results show that E2-LPS outperforms several strong baseline methods, and qualitative analysis confirms that it better captures users’ mental states than others.
pdf
bib
abs
ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
Zhao Xinjie
|
Fan Gao
|
Xingyu Song
|
Yingjian Chen
|
Rui Yang
|
Yanran Fu
|
Yuyang Wang
|
Yusuke Iwasawa
|
Yutaka Matsuo
|
Irene Li
Multi-hop question answering (QA) remains challenging, as solutions must reliably integrate and reconcile evidence from multiple sources without succumbing to error propagation. While large language models (LLMs) have achieved substantial improvements via chain-of-thought (CoT) prompting and retrieval-augmented generation, these methods typically adopt a forward-only workflow—early mistakes persist throughout inference, and contradictions discovered later cannot systematically trigger re-evaluation. To address this limitation, we present ReAgent, a reversible multi-agent reasoning framework. Specifically, ReAgent enables agents to backtrack to earlier valid states when conflicts arise, thereby isolating and rectifying flawed assumptions before they undermine subsequent reasoning. Our approach combines explicit local and global rollback protocols with modular role specialization, resulting in a flexible and error-tolerant pipeline. Empirical evaluation on three multi-hop QA benchmarks demonstrates consistent performance gains of approximately 6% over forward-only baselines, in addition to enhanced interpretability. These findings highlight the value of non-monotonic, backtracking-driven inference in complex QA scenarios and point to broader implications for multi-agent collaboration in knowledge-intensive tasks.
pdf
bib
abs
Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Peter Jansen
|
Samiah Hassan
|
Ruoyao Wang
Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims, while operationalizing feasibility assessment as a temporally-filtered claim verification task using backtesting. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable – highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.
pdf
bib
abs
ModRWKV: Transformer Multimodality in Linear Time
Jiale Kang
|
Ziyin Yue
|
Qingyu Yin
|
Rui Jiang
|
Weile Li
|
Zening Lu
|
Zhouran Ji
Currently, most multimodal studies are based on large language models (LLMs) with quadratic-complexity Transformer architectures. While linear models like RNNs enjoy low inference costs, their application has been largely limited to the text-only modality. This work explores the capabilities of modern RNN architectures in multimodal contexts. We propose ModRWKV—a decoupled multimodal framework built upon the RWKV7 architecture as its LLM backbone—which achieves multi-source information fusion through dynamically adaptable heterogeneous modality encoders. We designed the multimodal modules in ModRWKV with an extremely lightweight architecture and, through extensive experiments, identified a configuration that achieves an optimal balance between performance and computational efficiency. ModRWKV leverages the pretrained weights of the RWKV7 LLM for initialization, which significantly accelerates multimodal training. Comparative experiments with different pretrained checkpoints further demonstrate that such initialization plays a crucial role in enhancing the model’s ability to understand multimodal signals. Supported by extensive experiments, we conclude that modern RNN architectures present a viable alternative to Transformers in the domain of multimodal large language models (MLLMs). Furthermore, we identify the optimal configuration of the ModRWKV architecture through systematic exploration.
pdf
bib
abs
Multimedia Event Extraction with LLM Knowledge Editing
Jiaao Yu
|
Yijing Lin
|
Zhipeng Gao
|
Xuesong Qiu
|
Lanlan Rui
Multimodal event extraction task aims to identify event types and arguments from visual and textual representations related to events. Due to the high cost of multimedia training data, previous methods mainly focused on weakly alignment of excellent unimodal encoders. However, they ignore the conflict between event understanding and image recognition, resulting in redundant feature perception affecting the understanding of multimodal events. In this paper, we propose a multimodal event extraction strategy with a multi-level redundant feature selection mechanism, which enhances the event understanding ability of multimodal large language models by leveraging knowledge editing techniques, and requires no additional parameter optimization work. Extensive experiments show that our method outperforms the state-of-the-art (SOTA) baselines on the M2E2 benchmark. Compared with the highest baseline, we achieve a 34% improvement of precision on event extraction and a 11% improvement of F1 on argument extraction.
pdf
bib
abs
Exploring the Impact of Personality Traits on LLM Toxicity and Bias
Shuo Wang
|
Renhao Li
|
Xi Chen
|
Yulin Yuan
|
Min Yang
|
Derek F. Wong
With the different roles that AI is expected to play in human life, imbuing large language models (LLMs) with different personalities has attracted increasing research interest. While the “personification” enhances human experiences of interactivity and adaptability of LLMs, it gives rise to critical concerns about content safety, particularly regarding bias, sentiment, and toxicity of LLM generation. This study explores how assigning different personality traits to LLMs affects the toxicity and biases of their outputs. Leveraging the widely accepted HEXACO personality framework developed in social psychology, we design experimentally sound prompts to test three LLMs’ performance on three toxic and bias benchmarks. The findings demonstrate the sensitivity of all three models to HEXACO personality traits and, more importantly, a consistent variation in the biases, negative sentiment, and toxicity of their output. In particular, adjusting the levels of several personality traits can effectively reduce bias and toxicity in model performance, similar to humans’ correlations between personality traits and toxic behaviors. The findings highlight the additional need to examine content safety besides the efficiency of training or fine-tuning methods for LLM personification, they also suggest a potential for the adjustment of personalities to be a simple and low-cost method to conduct controlled text generation.
pdf
bib
abs
Task-aware Contrastive Mixture of Experts for Quadruple Extraction in Conversations with Code-like Replies and Non-opinion Detection
Chenyuan He
|
Yuxiang Jia
|
Fei Gao
|
Senbin Zhu
|
Hongde Liu
|
Hongying Zan
|
Min Peng
This paper focuses on Dialogue Aspect-based Sentiment Quadruple (DiaASQ) analysis, aiming to extract structured quadruples from multi-turn conversations. Applying Large Language Models (LLMs) for this specific task presents two primary challenges: the accurate extraction of multiple elements and the understanding of complex dialogue reply structure. To tackle these issues, we propose a novel LLM-based multi-task approach, named Task-aware Contrastive Mixture of Experts (TaCoMoE), to tackle the DiaASQ task by integrating expert-level contrastive loss within task-oriented mixture of experts layer. TaCoMoE minimizes the distance between the representations of the same expert in the semantic space while maximizing the distance between the representations of different experts to efficiently learn representations of different task samples. Additionally, we design a Graph-Centric Dialogue Structuring strategy for representing dialogue reply structure and perform non-opinion utterances detection to enhance the performance of quadruple extraction. Extensive experiments are conducted on the DiaASQ dataset, demonstrating that our method significantly outperforms existing parameter-efficient fine-tuning techniques in terms of both accuracy and computational efficiency. The code is available at https://github.com/he2720/TaCoMoE.
pdf
bib
abs
Mitigating Biases in Language Models via Bias Unlearning
Dianqing Liu
|
Yi Liu
|
Guoqing Jin
|
Zhendong Mao
Many studies have shown various biases targeting different demographic groups in language models, amplifying discrimination and harming fairness. Recent parameter modification debiasing approaches significantly degrade core capabilities such as text coherence and task accuracy. And Prompt-based debiasing methods, only effective for predefined trigger words, fail to address deeply embedded stereotypical associations in model parameters. In this paper, we propose BiasUnlearn, a novel model debiasing framework which achieves targeted debiasing via dual-pathway unlearning mechanisms coordinating stereotype forgetting with anti-stereotype retention, while preventing bias polarity reversal through adversarial forget set and dynamic dataset swapping. We conducted extensive experiments with multiple language models across various evaluation benchmarks. The results show that BiasUnlearn outperforms existing methods in mitigating bias in language models while retaining language modeling capabilities. Further experiments reveal that debiasing weights are transferable across model variants, confirming that bias representations become entrenched during pre-training and persist through fine-tuning phases.
pdf
bib
abs
UNComp: Can Matrix Entropy Uncover Sparsity? — A Compressor Design from an Uncertainty-Aware Perspective
Jing Xiong
|
Jianghan Shen
|
Fanghua Ye
|
Chaofan Tao
|
Zhongwei Wan
|
Jianqiao Lu
|
Xun Wu
|
Chuanyang Zheng
|
Zhijiang Guo
|
Min Yang
|
Lingpeng Kong
|
Ngai Wong
Deploying large language models (LLMs) for long-context inference remains challenging due to their substantial memory and computational demands. While techniques such as Key-Value (KV) cache compression are designed to reduce memory usage, they often neglect the structured sparsity inherent in the relationship between hidden states and their corresponding KV cache. In this work, we explore the role of uncertainty as a potential indicator of sparsity within LLMs. We propose UNComp, an uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. Unlike traditional methods that apply uniform compression, UNComp dynamically adjusts its approach to compression, guided by uncertainty measures that reflect the importance of various model components. Our analysis shows that sparsity patterns, when derived from uncertainty estimates, can be exploited to reveal special long-range dependencies, such as retrieval heads and retrieval layers. This perspective not only enhances our understanding of how compression can be optimized but also provides new insights into the inherent sparsity of LLMs during long-context inference. By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4× — not only delivering strong lossless compression performance, but also validating the effectiveness of the underlying theoretical tool. Our codes are submitted with the paper.
pdf
bib
abs
Superpose Task-specific Features for Model Merging
Haiquan Qiu
|
You Wu
|
Dong Li
|
Jianmin Guo
|
Quanming Yao
Model merging enables powerful capabilities in neural networks without requiring additional training. In this paper, we introduce a novel perspective on model merging by leveraging the fundamental mechanisms of neural network representation. Our approach is motivated by the linear representation hypothesis, which states that neural networks encode information through linear combinations of feature vectors. We propose a method that superposes task-specific features from individual models into a merged model. Our approach specifically targets linear transformation matrices, which are crucial for feature activation and extraction in deep networks. By formulating the merging process as a linear system, we can preserve output feature directions from individual models and create merged models that effectively maintain multi-task capabilities compared to existing methods. Extensive experiments across diverse benchmarks and models demonstrate that our method outperforms existing techniques.
pdf
bib
abs
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain
Suifeng Zhao
|
Zhuoran Jin
|
Sujian Li
|
Jun Gao
Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance. This benchmark effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.
pdf
bib
abs
BacktrackAgent: Enhancing GUI Agent with Error Detection and Backtracking Mechanism
Qinzhuo Wu
|
Pengzhi Gao
|
Wei Liu
|
Jian Luan
Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent’s performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
pdf
bib
abs
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective
Siyue Zhang
|
Yilun Zhao
|
Liyuan Geng
|
Arman Cohan
|
Anh Tuan Luu
|
Chen Zhao
Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.
pdf
bib
abs
BannerAgency: Advertising Banner Design with Multimodal LLM Agents
Heng Wang
|
Yotaro Shimose
|
Shingo Takamatsu
Advertising banners are critical for capturing user attention and enhancing advertising campaign effectiveness. Creating aesthetically pleasing banner designs while conveying the campaign messages is challenging due to the large search space involving multiple design elements. Additionally, advertisers need multiple sizes for different displays and various versions to target different sectors of audiences. Since design is intrinsically an iterative and subjective process, flexible editability is also in high demand for practical usage. While current models have served as assistants to human designers in various design tasks, they typically handle only segments of the creative design process or produce pixel-based outputs that limit editability. This paper introduces a training-free framework for fully automated banner ad design creation, enabling frontier multimodal large language models (MLLMs) to streamline the production of effective banners with minimal manual effort across diverse marketing contexts. We present BannerAgency, an MLLM agent system that collaborates with advertisers to understand their brand identity and banner objectives, generates matching background images, creates blueprints for foreground design elements, and renders the final creatives as editable components in Figma or SVG formats rather than static pixels. To facilitate evaluation and future research, we introduce BannerRequest400, a benchmark featuring 100 unique logos paired with 400 diverse banner requests. Through quantitative and qualitative evaluations, we demonstrate the framework’s effectiveness, emphasizing the quality of the generated banner designs, their adaptability to various banner requests, and their strong editability enabled by this component-based approach.
pdf
bib
abs
DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Weijie Shi
|
Jipeng Zhang
|
Yaguang Wu
|
Jingzhi Fang
|
Shibo Zhang
|
Yao Zhao
|
Hao Chen
|
Ruiyuan Zhang
|
Yue Cui
|
Jia Zhu
|
Sirui Han
|
Jiajie Xu
|
Xiaofang Zhou
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model’s output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
pdf
bib
abs
Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
Chang Su
|
Dengliang Shi
|
Siyuan Huang
|
Jintao Du
|
Changhua Meng
|
Yu Cheng
|
Weiqiang Wang
|
Zhouhan Lin
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as ‘[EOS]‘. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the ‘[EOS]‘ embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
pdf
bib
abs
ReMedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
Shaomu Tan
|
Christof Monz
A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
pdf
bib
abs
SolEval: Benchmarking Large Language Models for Repository-level Solidity Smart Contract Generation
Zhiyuan Peng
|
Xin Yin
|
Rui Qian
|
Peiqin Lin
|
YongKang Liu
|
Hao Zhang
|
Chenhao Ying
|
Yuan Luo
Large language models (LLMs) have transformed code generation.However, most existing approaches focus on mainstream languages such as Python and Java, neglecting the Solidity language, the predominant programming language for Ethereum smart contracts.Due to the lack of adequate benchmarks for Solidity, LLMs’ ability to generate secure, cost-effective smart contracts remains unexplored.To fill this gap, we construct SolEval, the first repository-level benchmark designed for Solidity smart contract generation, to evaluate the performance of LLMs on Solidity.SolEval consists of 1,507 samples from 28 different repositories, covering 6 popular domains, providing LLMs with a comprehensive evaluation benchmark.Unlike the existing Solidity benchmark, SolEval not only includes complex function calls but also reflects the real-world complexity of the Ethereum ecosystem by incorporating Gas@k and Vul@k.We evaluate 16 LLMs on SolEval, and our results show that the best-performing LLM achieves only 26.29% Pass@10, highlighting substantial room for improvement in Solidity code generation by LLMs.Additionally, we conduct supervised fine-tuning (SFT) on Qwen-7B using SolEval, resulting in a significant performance improvement, with Pass@5 increasing from 16.67% to 58.33%, demonstrating the effectiveness of fine-tuning LLMs on our benchmark.We release our data and code at https://github.com/pzy2000/SolEval.
pdf
bib
abs
In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties
Nathan Roll
|
Calbert Graham
|
Yuka Tatsumi
|
Kim Tien Nguyen
|
Meghan Sumner
|
Dan Jurafsky
Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models (SLMs)? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal (Phi-4-MM) using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided—though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.
pdf
bib
abs
Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills
Changsheng Wang
|
Chongyu Fan
|
Yihua Zhang
|
Jinghan Jia
|
Dennis Wei
|
Parikshit Ram
|
Nathalie Baracaldo
|
Sijia Liu
Recent advances in large reasoning models (LRMs) have enabled strong multi-step reasoning capabilities. However, existing machine unlearning algorithms are tailored to standard language modeling and fail to address the unique challenges posed by LRMs. In this work, we present the first systematic study of LRM unlearning and reveal that conventional unlearning methods often overlook critical information leakage in reasoning traces, even when final answers are successfully removed. To address this, we propose Reasoning-aware Representation Misdirection for Unlearning (R2MU), a method that suppresses sensitive reasoning traces while preserving the model’s general reasoning ability. Our experiments demonstrate that R2MU significantly reduces reasoning trace leakage and achieves strong performance across both reasoning and safety benchmarks, including WMDP, StrongReject, JBB-Behaviors and WildJailbreak, under state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B. To the best of our knowledge, MU is the first principled approach to both expose and mitigate reasoning trace leakage in LRM unlearning, while preserving reasoning ability.
pdf
bib
abs
Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions
Yijun Shen
|
Delong Chen
|
Fan Liu
|
Xingyu Wang
|
Chuanyi Zhang
|
Liang Yao
|
Yuhui Zheng
While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the “residual”—the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13% vs. 40.52%) over the parallel method.
pdf
bib
abs
DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling
Hao Sun
|
Zile Qiao
|
Bo Wang
|
Guoxin Chen
|
Yingyan Hou
|
Yong Jiang
|
Pengjun Xie
|
Fei Huang
|
Yan Zhang
Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG’s flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges:(1) the success of each step depends on both high-quality planning and accurate search,(2) the lack of supervision for intermediate reasoning steps, and(3) the exponentially large candidate space for planning and searching.To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes, demonstrate the effectiveness of our method.
pdf
bib
abs
RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis
Jianwei Wang
|
Chengming Shi
|
Junyao Yang
|
Haoran Li
|
Qianli Ma
|
Huiping Zhuang
|
Cen Chen
|
Ziqian Zeng
The success of large language models (LLMs) has attracted many individuals to fine-tune them for domain-specific tasks by uploading their data. However, in sensitive areas like healthcare and finance, privacy concerns often arise. One promising solution is to generate synthetic data with Differential Privacy (DP) guarantees to replace private data. However, these synthetic data contain significant flawed data, which are considered as noise. Existing solutions typically rely on naive filtering by comparing ROUGE-L scores or embedding similarities, which are ineffective in addressing the noise. To address this issue, we propose ***RewardDS***, a novel privacy-preserving framework that fine-tunes a reward proxy model and uses reward signals to guide the synthetic data generation. Our RewardDS introduces two key modules, Reward Guided Filtering and Self-Optimizing Refinement, to both filter and refine the synthetic data, effectively mitigating the noise. Extensive experiments across medical, financial, and code generation domains demonstrate the effectiveness of our method.
pdf
bib
abs
Synergizing Multimodal Temporal Knowledge Graphs and Large Language Models for Social Relation Recognition
Haorui Wang
|
Zheng Wang
|
Yuxuan Zhang
|
Bo Wang
|
Bin Wu
Recent years have witnessed remarkable advances in Large Language Models (LLMs). However, in the task of social relation recognition, Large Language Models (LLMs) encounter significant challenges due to their reliance on sequential training data, which inherently restricts their capacity to effectively model complex graph-structured relationships. To address this limitation, we propose a novel low-coupling method synergizing multimodal temporal Knowledge Graphs and Large Language Models (mtKG-LLM) for social relation reasoning. Specifically, we extract multimodal information from the videos and model the social networks as spatial Knowledge Graphs (KGs) for each scene. Temporal KGs are constructed based on spatial KGs and updated along the timeline for long-term reasoning. Subsequently, we retrieve multi-scale information from the graph-structured knowledge for LLMs to recognize the underlying social relation. Extensive experiments demonstrate that our method has achieved state-of-the-art performance in social relation recognition. Furthermore, our framework exhibits effectiveness in bridging the gap between KGs and LLMs. Our code will be released after acceptance.
pdf
bib
abs
LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation
Chaeeun Kim
|
Jinu Lee
|
Wonseok Hwang
Legal Case Retrieval (LCR), which retrieves relevant cases from a query case, is a fundamental task for legal professionals in research and decision-making. However, existing studies on LCR face two major limitations. First, they are evaluated on relatively small-scale retrieval corpora (e.g., 100-55K cases) and use a narrow range of criminal query types, which cannot sufficiently reflect the complexity of real-world legal retrieval scenarios. Second, their reliance on embedding-based or lexical matching methods often results in limited representations and legally irrelevant matches. To address these issues, we present: (1) LEGAR BENCH, the first large-scale Korean LCR benchmark, covering 411 diverse crime types in queries over 1.2M candidate cases; and (2) LegalSearchLM, a retrieval model that performs legal element reasoning over the query case and directly generates content containing those elements, grounded in the target cases through constrained decoding. Experimental results show that LegalSearchLM outperforms baselines by 6 - 20% on LEGAR BENCH, achieving state-of-the-art performance. It also demonstrates strong generalization to out-of-domain cases, outperforming naive generative models trained on in-domain data by 15%.
pdf
bib
abs
ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Jingxuan Wei
|
Nan Xu
|
Junnan Zhu
|
Haoyanni
|
Gaowei Wu
|
Qi Chen
|
Bihui Yu
|
Lei Wang
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
pdf
bib
abs
COLA: Collaborative Multi-Agent Framework with Dynamic Task Scheduling for GUI Automation
Di Zhao
|
Longhui Ma
|
Siwei Wang
|
Miao Wang
|
Zhao Lv
With the rapid advancements in Large Language Models (LLMs), an increasing number of studies have leveraged LLMs as the cognitive core of agents to address complex task decision-making challenges. Specially, recent research has demonstrated the potential of LLM-based agents on automating GUI operations. However, existing methodologies exhibit two critical challenges: (1) static agent architectures struggle to adapt to diverse GUI application scenarios, leading to inadequate scenario generalization; (2) the agent workflows lack fault tolerance mechanism, necessitating complete process re-execution for GUI agent decision error. To address these limitations, we introduce COLA, a collaborative multi-agent framework for automating GUI operations. In this framework, a scenario-aware agent Task Scheduler decomposes task requirements into atomic capability units, dynamically selects the optimal agent from a decision agent pool, effectively responds to the capability requirements of diverse scenarios. Furthermore, we develop an interactive backtracking mechanism that enables human to intervene to trigger state rollbacks for non-destructive process repair. Experiments on the GAIA dataset show that COLA achieves competitive performance among GUI Agent methods, with an average accuracy of 31.89%. On WindowsAgentArena, it performs particularly well in Web Browser (33.3%), Media & Video (33.3%), and Windows Utils (25.0%), suggesting the effectiveness of specialized agent design and dynamic strategy allocation. The code is available at https://github.com/Alokia/COLA-demo.
pdf
bib
abs
DASA-Trans-STM: Adaptive Efficient Transformer for Short Text Matching using Data Augmentation and Semantic Awareness
Jiguo Liu
|
Chao Liu
|
Meimei Li
|
Nan Li
|
Shihao Gao
|
Dali Zhu
Rencent advancements in large language models (LLM) have shown impressive versatility across various tasks. Short text matching is one of the fundamental technologies in natural language processing. In previous studies, the common approach to applying them to Chinese is segmenting each sentence into words, and then taking these words as input. However, existing approaches have three limitations: 1) Some Chinese words are polysemous, and semantic information is not fully utilized. 2) Some models suffer potential issues caused by word segmentation and incorrect recognition of negative words affects the semantic understanding of the whole sentence. 3) Fuzzy negation words in ancient Chinese are difficult to recognize and match. In this work, we propose a novel adaptive Transformer for Chinese short text matching using Data Augmentation and Semantic Awareness (DASA), which can fully mine the information expressed in Chinese text to deal with word ambiguity. DASA is based on a Graph Attention Transformer Encoder that takes two word lattice graphs as input and integrates sense information from N-HowNet to moderate word ambiguity. Specially, we use an LLM to generate similar sentences for the optimal text representation. Experimental results show that the augmentation done using DASA can considerably boost the performance of our system and achieve significantly better results than previous state-of-the-art methods on four available datasets, namely MNS, LCQMC, AFQMC, and BQ.
pdf
bib
abs
Pruning the Paradox: How CLIP’s Most Informative Heads Enhance Performance While Amplifying Bias
Avinash Madasu
|
Vasudev Lal
|
Phillip Howard
CLIP is one of the most popular foundation models and is heavily used for many vision-language tasks, yet little is known about its inner workings. As CLIP is increasingly deployed in real-world applications, it is becoming even more critical to understand its limitations and embedded social biases to mitigate potentially harmful downstream consequences. However, the question of what internal mechanisms drive both the impressive capabilities as well as problematic shortcomings of CLIP has largely remained unanswered. To bridge this gap, we study the conceptual consistency of text descriptions for attention heads in CLIP-like models. Specifically, we propose Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. Our soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, we find that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding. Moreover, we prove that high CCS heads learn spurious correlations which amplify social biases. These results position CCS as a powerful interpretability metric exposing the paradox of performance and social biases in CLIP models.
pdf
bib
abs
CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation
Ziyue Liu
|
Ruijie Zhang
|
Zhengyang Wang
|
Mingsong Yan
|
Zi Yang
|
Paul D. Hovland
|
Bogdan Nicolae
|
Franck Cappello
|
Sui Tang
|
Zheng Zhang
The full-size MLPs and the projection layers in attention introduce tremendous model sizes of large language models (LLMs), consuming extensive computational resources in pre-training. We empirically observe that the activations of pre-trained LLMs exhibit low-rank property. Motivated by such observations, we propose **CoLA** and its memory-efficient implementation, **CoLA-M**, to replace these full-size layers with compute-efficient **auto-encoders** that naturally enforce low-rank activations throughout training. This fundamental architectural change eliminates the activation redundancy and significantly boosts model capacity and training efficiency. Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by 2\pmb{\times} and improves training throughput by 1.86\pmb{\times} while maintaining full-rank level performance. CoLA-M further squeezes memory cost without sacrificing throughput, offering a pre-training approach with collectively superior parameter, computing, and memory efficiency. The LLMs produced are also 2\pmb{\times} smaller, enabling faster inference with lower memory cost on resource-constrained platforms.
pdf
bib
abs
TS-CLIP: Time Series Understanding by CLIP
Ziwen Chen
|
Xiaoyuan Zhang
|
Ming Zhu
Contrastive Language–Image Pre-training (CLIP) has recently demonstrated remarkable success in aligning vision and language. Aligning time series with text leverages the rich semantic cues of language to enhance interpretability and generalization, addressing a largely underexplored area of research. Although applying the CLIP training paradigm to time-series and language pairs is promising, it may result in label collapse due to the sparse semantic annotations and the absence of visual cues in time-series data. To address this, we introduce Time Series CLIP (TS-CLIP), a novel approach that tackles label collapse using a synonym bank mechanism. Synonym bank exploits word analogy phenomena to generate potential synonym embeddings as alignment targets. Specifically, the synonym bank facilitates aligning time series with a word distribution instead of a precise textual description. We conducted extensive zero-shot and few-shot experiments on 128 sub-datasets from the UCR archive. The results show that TS-CLIP achieves state-of-the-art (SOTA) performance in zero-shot settings on 51 datasets. Comprehensive ablation studies and visualization analyzes reveal that TS-CLIP effectively aligns time series with natural language. To the best of our knowledge, this is the first foundational model to achieve general time series and natural language alignment. TS-CLIP introduces a new paradigm for the semantic understanding of time series and opens the possibility of integrating the time series modality into multimodal large models.
pdf
bib
abs
MultiAgentESC: A LLM-based Multi-Agent Collaboration Framework for Emotional Support Conversation
Yangyang Xu
|
Jinpeng Hu
|
Zhuoer Zhao
|
Zhangling Duan
|
Xiao Sun
|
Xun Yang
The development of Emotional Support Conversation (ESC) systems is critical for delivering mental health support tailored to the needs of help-seekers. Recent advances in large language models (LLMs) have contributed to progress in this domain, while most existing studies focus on generating responses directly and overlook the integration of domain-specific reasoning and expert interaction.Therefore, in this paper, we propose a training-free Multi-Agent collaboration framework for ESC (MultiAgentESC).The framework is designed to emulate the human-like process of providing emotional support through three stages: dialogue analysis, strategy deliberation, and response generation.At each stage, a multi-agent system is employed to iteratively enhance information understanding and reasoning, simulating real-world decision-making processes by incorporating diverse interactions among these expert agents.Additionally, we introduce a novel response-centered approach to handle the one-to-many problem on strategy selection, where multiple valid strategies are initially employed to generate diverse responses, followed by the selection of the optimal response through multi-agent collaboration.Experiments on the ESConv dataset reveal that our proposed framework excels at providing emotional support as well as diversifying support strategy selection.
pdf
bib
abs
Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models
Yilin Wang
|
Heng Wang
|
Yuyang Bai
|
Minnan Luo
In Large Language Models (LLMs) generation, there exist knowledge conflicts, and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs’ sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs’ sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models’ sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS’ practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs’ sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.
pdf
bib
abs
Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding
Yun-Shiuan Chuang
|
Sameer Narendran
|
Nikunj Harlalka
|
Alexander Cheung
|
Sizhe Gao
|
Siddharth Suresh
|
Junjie Hu
|
Timothy T. Rogers
Guesstimation—the task of making approximate quantitative estimates about objects or events—is a common real-world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)—where the median of multiple estimates improves accuracy—we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy, self-consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position guesstimation as a useful probe of LLM world knowledge and highlight WOC decoding as a strategy for enhancing LLM guesstimation performance on real-world tasks.
pdf
bib
abs
Recall with Reasoning: Chain-of-Thought Distillation for Mamba’s Long-Context Memory and Extrapolation
Jun-Yu Ma
|
Tianqing Fang
|
Zhisong Zhang
|
Hongming Zhang
|
Haitao Mi
|
Dong Yu
Mamba’s theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba’s long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, teaching Mamba to actively recall and reason over long contexts. Experiments on LONGMEMEVAL and HELMET show that RwR outperforms existing long-term memory methods on the Mamba model. Furthermore, under similar pre-training conditions, RwR improves the long-context performance of Mamba relative to comparable Transformer/hybrid baselines while preserving short-context capabilities, all without changing the architecture.
pdf
bib
abs
Scalable Data Synthesis through Human-like Cognitive Imitation and Data Recombination
Zhongyi Ye
|
Weitai Zhang
|
Xinyuan Zhou
|
Yongxin Zhu
|
Ninghui Rao
|
Enhong Chen
Large language models (LLMs) rely on massive amounts of training data, however, the quantity of empirically observed data is limited. To alleviate this issue, lots of LLMs leverage synthetic data to enhance the quantity of training data. Despite significant advancements in LLMs, the efficiency and scalability characteristics of data synthesis during pre-training phases remain insufficiently explored. In this work, we propose a novel data synthesis framework, Cognitive Combination Synthesis (CCS), designed to achieve highly efficient and scalable data synthesis. Specifically, our methodology mimics human cognitive behaviors by recombining and interconnecting heterogeneous data from diverse sources thereby enhancing advanced reasoning capabilities in LLMs. Extensive experiments demonstrate that: (1) effective data organization is essential, and our mapping-based combination learning approach significantly improves data utilization efficiency; (2) by enhancing data diversity, accuracy, and complexity, our synthetic data scales beyond 100B tokens, revealing CCS’s strong scalability. Our findings highlight the impact of data organization methods on LLM learning efficiency and the significant potential of scalable synthetic data to enhance model reasoning capabilities.
pdf
bib
abs
BeSimulator: A Large Language Model Powered Text-based Behavior Simulator
Jianan Wang
|
Bin Li
|
Jingtao Qi
|
Xueying Wang
|
Fu Li
|
Lihanxun
Traditional robot simulators focus on physical process modeling and realistic rendering, often suffering from high computational costs, inefficiencies, and limited adaptability. To handle this issue, we concentrate on behavior simulation in robotics to analyze and validate the logic behind robot behaviors, aiming to achieve preliminary evaluation before deploying resource-intensive simulators and thus enhance simulation efficiency. In this paper, we propose BeSimulator, a modular and novel LLM-powered framework, as an attempt towards behavior simulation in the context of text-based environments. By constructing text-based virtual environments and performing semantic-level simulation, BeSimulator can generalize across scenarios and achieve long-horizon complex simulation. Inspired by human cognition paradigm, it employs a “consider-decide-capture-transfer” four-phase simulation process, termed Chain of Behavior Simulation (CBS), which excels at analyzing action feasibility and state transition. Additionally, BeSimulator incorporates code-driven reasoning to enable arithmetic operations and enhance reliability, and reflective feedback to refine simulation. Based on our manually constructed behavior-tree-based simulation benchmark, BTSIMBENCH, our experiments show a significant performance improvement in behavior simulation compared to baselines, ranging from 13.60% to 24.80%. Code and data are available at https://github.com/Dawn888888/BeSimulator.
pdf
bib
abs
Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs
Hexiang Tan
|
Fei Sun
|
Sha Liu
|
Du Su
|
Qi Cao
|
Xin Chen
|
Jingang Wang
|
Xunliang Cai
|
Yuanzhuo Wang
|
Huawei Shen
|
Xueqi Cheng
As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness.However, existing detection methods often overlook a critical problem we term as **self-consistent error**, where LLMs repeatedly generate the same incorrect response across multiple stochastic samples.This work formally defines self-consistent errors and evaluates mainstream detection methods on them.Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as the LLM scale increases, the frequency of self-consistent errors remains stable or even increases.(2) All four types of detection methods significantly struggle to detect self-consistent errors.These findings reveal critical limitations in current detection methods and underscore the need for improvement.Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross‐model probe method that fuses hidden state evidence from an external verifier LLM.Our method significantly enhances performance on self-consistent errors across three LLM families.
pdf
bib
abs
pFedGPT: Hierarchically Optimizing LoRA Aggregation Weights for Personalized Federated GPT Models
Zhanming Shen
|
Tianqi Xu
|
Hao Wang
|
Jian Li
|
Miao Pan
Federated finetuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) offers computational efficiency and preserves data privacy. However, applying LoRA in federated settings faces significant challenges: standard approaches struggle with data heterogeneity, and existing personalization techniques fail to precisely adapt shared global knowledge to individual client needs. To address these issues, we propose pFedGPT, a framework that leverages Hierarchical Bayesian Optimization (HBO) for fine-grained, personalized LoRA aggregation. pFedGPT intelligently partitions LoRA parameters based on model structure and client information, then employs HBO to hierarchically search for optimal, module-specific weights. This enables a nuanced integration of the downloaded global LoRA state with each client’s local model, precisely capturing client-specific requirements. To manage the optimization cost inherent in HBO, pFedGPT incorporates efficient multi-fidelity evaluations and a curriculum learning strategy. Extensive experiments demonstrate that pFedGPT achieves state-of-the-art (SOTA) performance on personalized FL benchmarks, showcasing robustness and scalability while introducing only minimal (approx. 4%) additional optimization overhead. Our results also underscore the limitations of traditional FL methods for LoRA-based LLM personalization, highlighting the need for tailored approaches like pFedGPT.
pdf
bib
abs
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao
|
Wenhao Lu
|
Sheng Wang
|
Lingpeng Kong
|
Chuan Wu
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs). While activation-weight joint quantization enables efficient low-precision decoding, it suffers substantial performance degradation on multi-step reasoning tasks. We propose QSPEC, a novel quantization paradigm that decouples efficiency from quality by integrating two complementary schemes via speculative decoding: low-precision joint quantization for fast drafting and high-precision weight-only quantization for accurate verification. QSPEC reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models. Compared to high-precision baselines, QSPEC achieves up to 1.64x speedup without quality degradation, and outperforms state-of-the-art speculative decoding methods by up to 1.55x in batched settings. Furthermore, QSPEC supports plug-and-play deployment and generalizes well across model scales, quantization methods, and workloads. These properties make QSPEC a practical and scalable solution for high-fidelity quantized LLM serving under memory-constrained scenarios.
pdf
bib
abs
Co-Evolving LLMs and Embedding Models via Density-Guided Preference Optimization for Text Clustering
Zetong Li
|
Qinliang Su
|
Minhua Huang
|
Yin Yang
Large language models (LLMs) have shown strong potential in enhancing text clustering when combined with traditional embedding models. However, existing methods predominantly treat LLMs as static pseudo-oracles, i.e., unidirectionally querying them for similarity assessment or data augmentation, while never seeking feedback from embedding models to improve them. In this work, we propose a training framework that enables bidirectional refinement between LLMs and embedding models. We first design task-aware prompts to guide the LLM in generating interpretations for the input texts. These interpretations are projected into the embedding space, in which interpretations that are preferred by the embedding model are selected based on their distribution densities. The selected interpretations are then used to fine-tune the LLM via preference optimization to prioritize the generation of helpful interpretations. Meanwhile, we enhance the embedding model via contrastive learning on the generated interpretations and perform clustering on the output embeddings, leading to iterative co-training between the LLM and the embedding model. Experiments on 14 benchmark datasets across 5 tasks demonstrate the effectiveness of our method.
pdf
bib
abs
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs
Yidan Zhang
|
Yu Wan
|
Boyi Deng
|
Baosong Yang
|
Hao-Ran Wei
|
Fei Huang
|
Bowen Yu
|
Dayiheng Liu
|
Junyang Lin
|
Fei Huang
|
Jingren Zhou
Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we introduce P-MMEval, a large-scale benchmark covering fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models and tasks, explore the relationship between multilingual performances and factors such as tasks, model sizes, languages, and prompts, and examine the effectiveness of knowledge transfer from English to other languages. The resulting insights are intended to offer valuable guidance for future research.
pdf
bib
abs
Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization
Yutao Zhu
|
Jiajie Jin
|
Hongjin Qian
|
Zheng Liu
|
Zhicheng Dou
|
Ji-Rong Wen
Existing studies have optimized retrieval-augmented generation (RAG) across various sub-tasks, such as query understanding and retrieval refinement, but integrating these optimizations into a unified framework remains challenging. To tackle this problem, this work proposes RoleRAG, a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization. RoleRAG comprises six modules, each handling a specific sub-task within the RAG process. Additionally, we introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state. All modules are driven by the same underlying LLM, distinguished by task-specific role tokens that are individually optimized. This design allows RoleRAG to dynamically activate different modules within a single LLM instance, thereby streamlining deployment and reducing resource consumption. Experimental results on five open-domain question-answering datasets demonstrate the effectiveness, generalizability, and flexibility of our framework.
pdf
bib
abs
TrInk: Ink Generation with Transformer Network
Zezhong Jin
|
Shubhang Desai
|
Xu Chen
|
Biyi Fang
|
Zhuoyi Huang
|
Zhe Li
|
Chong-Xin Gan
|
Xiao Tu
|
Man-Wai Mak
|
Yan Lu
|
Shujie Liu
In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/
pdf
bib
abs
CalligraphicOCR for Chinese Calligraphy Recognition
Xiaoyi Bao
|
Zhongqing Wang
|
Jinghang Gu
|
Chu-Ren Huang
With thousand years of history, calligraphy serve as one of the representative symbols of Chinese culture. Increasing works try to digitize calligraphy by recognizing the context of calligraphy for better preservation and propagation. However, previous works stick to isolated single character recognition, not only requires unpractical manual splitting into characters, but also abandon the enriched context information that could be supplementary. To this end, we construct the pioneering end-to-end calligraphy recognition benchmark dataset, this dataset is challenging due to both the visual variations such as different writing styles and the textual understanding such as the domain shift in semantics. We further propose CalligraphicOCR (COCR) equipped with calligraphic image augmentation and action-based corrector targeted at the challenging root of this setting. Experiments demonstrate the advantage of our proposed model over cutting-edge baselines, underscoring the necessity of introducing this new setting, thereby facilitating a solid precondition for protecting and propagating the already scarce resources.
pdf
bib
abs
When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang
|
Gelei Deng
|
Xianglin Yang
|
Han Qiu
|
Tianwei Zhang
Large Audio-Language Models (LALMs) are augmented with the ability to perceive audio, demonstrating impressive capabilities in processing combined audio and text signals. However, their reliability when faced with conflicting inputs across modalities remains largely unexplored. This study examines how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, often disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, explore mitigation strategies through supervised fine-tuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balancing during training and more sophisticated fusion mechanisms to enhance robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
pdf
bib
abs
RESF: Regularized-Entropy-Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models
Pingyi Hu
|
Xiaofan Bai
|
Xiaojing Ma
|
Chaoxiang He
|
Dongmei Zhang
|
Bin Benjamin Zhu
The proliferation of Machine Learning as a Service (MLaaS) has enabled widespread deployment of large language models (LLMs) via cloud APIs, but also raises critical concerns about model integrity and security. Existing black-box tamper detection methods, such as watermarking and fingerprinting, rely on the stability of model outputs—a property that does not hold for inherently stochastic LLMs. We address this challenge by formulating black-box tamper detection for LLMs as a hypothesis-testing problem. To enable efficient and sensitive fingerprinting, we derive a first-order surrogate for KL divergence—the entropy-gradient norm—to identify prompts most responsive to parameter perturbations. Building on this, we propose Regularized Entropy-Sensitive Fingerprinting (RESF), which enhances sensitivity while regularizing entropy to improve output stability and control false positives. To further distinguish tampering from benign randomness, such as temperature shifts, RESF employs a lightweight two-tier sequential test combining support-based and distributional checks with rigorous false-alarm control.Comprehensive analysis and experiments across multiple LLMs show that RESF achieves up to 98.80% detection accuracy under challenging conditions, such as minimal LoRA fine-tuning with five optimized fingerprints. RESF consistently demonstrates strong sensitivity and robustness, providing an effective and scalable solution for black-box tamper detection in cloud-deployed LLMs.
pdf
bib
abs
Model-based Large Language Model Customization as Service
Zhaomin Wu
|
Jizhou Guo
|
Junyi Hou
|
Bingsheng He
|
Lixin Fan
|
Qiang Yang
Prominent Large Language Model (LLM) services from providers like OpenAI and Google excel at general tasks but often underperform on domain-specific applications. Current customization services for these LLMs typically require users to upload data for fine-tuning, posing significant privacy risks. While differentially private (DP) data synthesis presents a potential alternative, its application commonly results in low effectiveness due to the introduction of excessive noise on data for DP. To overcome this, we introduce *Llamdex*, a novel framework that facilitates LLM customization as a service, where the client uploads pre-trained domain-specific *models* rather than data. This client-uploaded model, optionally protected by DP with much lower noise, is inserted into the base LLM via connection modules. Significantly, these connecting modules are trained without requiring sensitive domain data, enabling clients to customize LLM services while preserving data privacy. Experiments demonstrate that Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints and, by obviating the need for users to provide domain context within queries, maintains inference efficiency comparable to the original LLM service.
pdf
bib
abs
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Haochen Sun
|
Shuwen Zhang
|
Lujie Niu
|
Lei Ren
|
Hao Xu
|
Hao Fu
|
Fangkun Zhao
|
Caixia Yuan
|
Xiaojie Wang
Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
pdf
bib
abs
Improving Reasoning Capabilities in Small Models through Mixture-of-layers Distillation with Stepwise Attention on Key Information
Yao Chen
|
Jiawei Sheng
|
Wenyuan Zhang
|
Tingwen Liu
The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers’ dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher’s stepwise attention on key information to the student model. This establishes structured guidance for the student’s progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.
pdf
bib
abs
Through the Valley: Path to Effective Long CoT Training for Small Language Models
Renjie Luo
|
Jiaxi Li
|
Chen Huang
|
Wei Lu
Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; ≤3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.
pdf
bib
abs
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
Jiahui Li
|
Lin Li
|
Tai-Wei Chang
|
Kun Kuang
|
Long Chen
|
Jun Zhou
|
Cheng Yang
Reinforcement learning from human feedback (RLHF) offers a promising approach to aligning large language models (LLMs) with human preferences. Typically, a reward model is trained or supplied to act as a proxy for humans in evaluating generated responses during the reinforcement training phase. However, current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence. This approach may overlook the significant contributions of individual tokens toward the desired outcome. To this end, we propose a more fine-grained, token-level guidance approach for RL training. Specifically, we introduce RED, a novel REward reDistribition method that evaluates and assigns specific credit to each token using an off-the-shelf reward model. Utilizing these fine-grained rewards enhances the model’s understanding of language nuances, leading to more precise performance improvements. Notably, our method does not require modifying the reward model or introducing additional training steps, thereby incurring minimal computational costs. Experimental results across diverse datasets and tasks demonstrate the superiority of our approach.
pdf
bib
abs
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding
|
Wen Sun
|
Dailin Li
|
Wei Zou
|
Jiaming Wang
|
Jiajun Chen
|
Shujian Huang
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model’s inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model’s own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs’ discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model’s generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
pdf
bib
abs
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
Zizhen Li
|
Chuanhao Li
|
Yibin Wang
|
Qi Chen
|
Diping Song
|
Yukang Feng
|
Jianwen Sun
|
Jiaxin Ai
|
Fanrui Zhang
|
Mingzhu Sun
|
Kaipeng Zhang
LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs’ capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human–AI interaction.
pdf
bib
abs
MIO: A Foundation Model on Multimodal Tokens
Zekun Moore Wang
|
King Zhu
|
Chunpu Xu
|
Wangchunshu Zhou
|
Jiaheng Liu
|
Yibo Zhang
|
Jessie Wang
|
Ning Shi
|
Siyu Li
|
Yizhi Li
|
Haoran Que
|
Zhaoxiang Zhang
|
Yuanxing Zhang
|
Ge Zhang
|
Ke Xu
|
Jie Fu
|
Wenhao Huang
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
pdf
bib
abs
DART: Distilling Autoregressive Reasoning to Silent Thought
Nan Jiang
|
Ziming Wu
|
De-Chuan Zhan
|
Fuming Lai
|
Shaobing Lian
Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose **DART** (**D**istilling **A**utoregressive **R**easoning to Silent **T**hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART offers significant performance gains compared with existing non-autoregressive baselines without extra inference latency, serving as a feasible alternative for efficient reasoning.
pdf
bib
abs
LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
Qi Zhang
|
Shouqing Yang
|
Lirong Gao
|
Hao Chen
|
Xiaomeng Hu
|
Jinglei Chen
|
Jiexiang Wang
|
Sheng Guo
|
Bo Zheng
|
Haobo Wang
|
Junbo Zhao
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose **Le**arning to **T**hink-and-**S**earch (**LeTS**), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of **LeTS** across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs’ reasoning ability via RL under other scenarios.
pdf
bib
abs
CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency
Zhanming Shen
|
Hao Chen
|
Yulei Tang
|
Shaolin Zhu
|
Wentao Ye
|
Xiaomeng Hu
|
Haobo Wang
|
Gang Chen
|
Junbo Zhao
Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models—an answer generator and a question generator—are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
pdf
bib
abs
Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where?
Grace LeFevre
|
Qingcheng Zeng
|
Adam Leif
|
Jason Jewell
|
Denis Peskoff
|
Rob Voigt
The social impact of Natural Language Processing (NLP) is increasingly important, with a rising community focus on initiatives related to NLP for Social Good (NLP4SG). Indeed, in recent years, almost 20% of all papers in the ACL Anthology address topics related to social good as defined by the UN Sustainable Development Goals (Aduato et al. 2023). In this study, we take an author- and venue-level perspective to map the landscape of NLP4SG, quantifying the proportion of work addressing social good concerns both within and beyond the ACL community, by both core ACL contributors and non-ACL authors. With this approach we discover two surprising facts about the landscape of NLP4SG. First, ACL authors are dramatically more likely to do work addressing social good concerns when publishing in venues outside of ACL. Second, the vast majority of publications using NLP techniques to address concerns of social good are done by non-ACL authors in venues outside of ACL. We discuss the implications of these findings on agenda-setting considerations for the ACL community related to NLP4SG.
pdf
bib
abs
From General Reward to Targeted Reward: Improving Open-ended Long-context Generation Models
Zhihan Guo
|
Jiele Wu
|
Wenqian Cui
|
Yifei Zhang
|
Minda Hu
|
Yufei Wang
|
Irwin King
Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the **Open-ended Long Text Generation** (Open-LTG) remains insufficiently explored. Training a long text generation model requires curation of gold-standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce **ProxyReward**, an innovative reinforcement learning (RL) based framework, which includes a data synthesis method and a novel reward signal. Firstly, **ProxyReward Dataset** synthesis is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, **ProxyReward Signal** offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward **surpasses even GPT-4-Turbo**. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by humans.
pdf
bib
abs
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
Xinyue Lou
|
You Li
|
Jinan Xu
|
Xiangyu Shi
|
Chi Chen
|
Kaiyu Huang
The rapid development of Multimodal Large Reasoning Models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 13 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, the long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhance the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs.
pdf
bib
abs
Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Bajian Xiang
|
Shuaijiang Zhao
|
Tingwei Guo
|
Wei Zou
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.
pdf
bib
abs
AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity
Yifan Liu
|
Wenkuan Zhao
|
Shanshan Zhong
|
Jinghui Qin
|
Mingfu Liang
|
Zhongzhan Huang
|
Wushao Wen
Recent advancements in multimodal large language models (MLLMs) have garnered significant attention, offering a promising pathway toward artificial general intelligence (AGI). Among the essential capabilities required for AGI, creativity has emerged as a critical trait for MLLMs, with association serving as its foundation. Association reflects a model’s ability to think creatively, making it vital to evaluate and understand. While several frameworks have been proposed to assess associative ability, they often overlook the inherent ambiguity in association tasks, which arises from the divergent nature of associations and undermines the reliability of evaluations. To address this issue, we decompose ambiguity into two types—internal ambiguity and external ambiguity—and introduce AssoCiAm, a benchmark designed to evaluate associative ability while circumventing the ambiguity through a hybrid computational method. We then conduct extensive experiments on MLLMs, revealing a strong positive correlation between cognition and association. Additionally, we observe that the presence of ambiguity in the evaluation process causes MLLMs’ behavior to become more random-like. Finally, we validate the effectiveness of our method in ensuring more accurate and reliable evaluations. See Project Page for the data and codes.
pdf
bib
abs
M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models
Zexuan Li
|
Hongliang Dai
|
Piji Li
For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.
pdf
bib
abs
R-TOFU: Unlearning in Large Reasoning Models
Sangyeon Yoon
|
Wonje Jeung
|
Albert No
Large Reasoning Models (LRMs) embed private or copyrighted information not only in their final answers but also throughout multi-step chain-of-thought (CoT) traces, making reliable unlearning far more demanding than in standard LLMs. We introduce Reasoning-TOFU (R-TOFU), the first benchmark tailored to this setting. R-TOFU augments existing unlearning tasks with realistic CoT annotations and provides step-wise metrics that expose residual knowledge invisible to answer-level checks. Using R-TOFU, we carry out a comprehensive comparison of gradient-based and preference-optimization baselines and show that conventional answer-only objectives leave substantial forget traces in reasoning. We further propose Reasoned IDK, a preference-optimization variant that preserves coherent yet inconclusive reasoning, achieving a stronger balance between forgetting efficacy and model utility than earlier refusal styles. Finally, we identify a failure mode: decoding variants such as ZeroThink and LessThink can still reveal forgotten content despite seemingly successful unlearning, emphasizing the need to evaluate models under diverse decoding settings. Together, the benchmark, analysis, and new baseline establish a systematic foundation for studying and improving unlearning in LRMs while preserving their reasoning capabilities.
pdf
bib
abs
Chat-Driven Text Generation and Interaction for Person Retrieval
Zequn Xie
|
Chuxin Wang
|
Yeqiang Wang
|
Sihang Cai
|
Shulei Wang
|
Tao Jin
Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions—characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.
pdf
bib
abs
Spontaneous Giving and Calculated Greed in Language Models
Yuxuan Li
|
Hirokazu Shirado
Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of “spontaneous giving and calculated greed.” Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address—rather than reinforce—the challenges of collective action.
pdf
bib
abs
SenDetEX: Sentence-Level AI-Generated Text Detection for Human-AI Hybrid Content via Style and Context Fusion
Lei Jiang
|
Desheng Wu
|
Xiaolong Zheng
Text generated by Large Language Models (LLMs) now rivals human writing, raising concerns about its misuse. However, mainstream AI-generated text detection (AGTD) methods primarily target document-level long texts and struggle to generalize effectively to sentence-level short texts. And current sentence-level AGTD (S-AGTD) research faces two significant limitations: (1) lack of a comprehensive evaluation on complex human-AI hybrid content, where human-written text (HWT) and AI-generated text (AGT) alternate irregularly, and (2) failure to incorporate contextual information, which serves as a crucial supplementary feature for identifying the origin of the detected sentence. Therefore, in our work, we propose
AutoFill-Refine, a high-quality synthesis strategy for human-AI hybrid texts, and then construct a dedicated S-AGTD benchmark dataset. Besides, we introduce
SenDetEX, a novel framework for sentence-level AI-generated text detection via style and context fusion. Extensive experiments demonstrate that SenDetEX significantly outperforms all baseline models in detection accuracy, while exhibiting remarkable transferability and robustness. Source code is available at
https://github.com/TristoneJiang/SenDetEX.
pdf
bib
abs
Judge and Improve: Towards a Better Reasoning of Knowledge Graphs with Large Language Models
Mo Zhiqiang
|
Yang Hua
|
Jiahui Li
|
Yuan Liu
|
Shawn Wong
|
Jianmin Huang
Graph Neural Networks (GNNs) have shown immense potential in improving the performance of large-scale models by effectively incorporating structured relational information. However, current approaches face two key challenges: (1) achieving robust semantic alignment between graph representations and large models, and (2) ensuring interpretability in the generated outputs. To address these challenges, we propose ExGLM (Explainable Graph Language Model), a novel training framework designed to seamlessly integrate graph and language modalities while enhancing transparency. Our framework introduces two core components: (1) a graph-language synergistic alignment module, which aligns graph structures with language model to ensure semantic consistency across modalities; and (2) a judge-and-improve paradigm, which allows the language model to iteratively evaluate, refine, and prioritize responses with higher interpretability, thereby improving both performance and transparency. Extensive experiments conducted on three benchmark datasets—ogbn-arxiv, Cora, and PubMed—demonstrate that ExGLM not only surpasses existing methods in efficiency but also generates outputs that are significantly more interpretable, effectively addressing the primary limitations of current approaches.
pdf
bib
abs
Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm
Zhuo Li
|
Yuhao Du
|
Xiaoqi Jiao
|
Steven Y. Guo
|
Yuege Feng
|
Xiang Wan
|
Anningzhe Gao
|
Jinpeng Hu
Selecting high-quality and diverse training samples from extensive datasets plays a crucial role in reducing training overhead and enhancing the performance of Large Language Models (LLMs). However, existing studies fall short in assessing the overall value of selected data, focusing primarily on individual quality, and struggle to strike an effective balance between ensuring diversity and minimizing data point traversals. Therefore, this paper introduces a novel choice-based sample selection framework that shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples when incorporated into the subset. Thanks to the advanced language understanding capabilities of LLMs, we utilize LLMs to evaluate the value of each option during the selection process. Furthermore, we design a greedy sampling process where samples are incrementally added to the subset, thereby improving efficiency by eliminating the need for exhaustive traversal of the entire dataset with the limited budget. Extensive experiments demonstrate that selected data from our method not only surpasses the performance of the full dataset but also achieves competitive results with recent powerful studies, while requiring fewer selections. Moreover, we validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
pdf
bib
abs
QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models
Jiajun Zhou
|
Yifan Yang
|
Kai Zhen
|
Ziyue Liu
|
Yequan Zhao
|
Ershad Banijamali
|
Athanasios Mouchtaris
|
Ngai Wong
|
Zheng Zhang
Large Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various downstream tasks. Traditional fine-tuning methods such as stochastic gradient descent and Adam optimization require backpropagation, which is error-prone in the low-precision settings. To overcome these limitations, we propose the Quantized Zeroth-Order (QuZO) framework, specifically designed for fine-tuning LLMs through low-precision (e.g., 4- or 8-bit) forward passes. Our method avoids the low-precision straight-through estimator, which requires backward computation, and instead utilizes optimized stochastic rounding to mitigate increased bias. QuZO simplifies the training process, while achieving results comparable to first-order methods in FP8 and superior accuracy in INT8 and INT4 training. Experiments demonstrate that QuZO achieves competitive performance on classification, multi-choice, and generation tasks under low-bit training, including zero-shot reasoning tasks. Notably, QuZO incurs minimal overhead and reduces memory consumption by 2.94 ×–5.47 × compared to quantized first-order methods during LLaMA-7B fine-tuning.
pdf
bib
abs
Cost-Optimal Grouped-Query Attention for Long-Context Modeling
Yingfa Chen
|
Yutong Wu
|
Chenyang Song
|
Zhen Leng Thai
|
Xingyu Shen
|
Xu Han
|
Zhiyuan Liu
|
Maosong Sun
Grouped-Query Attention (GQA) is a widely adopted strategy for reducing the computational cost of attention layers in large language models (LLMs). However, current GQA configurations are often suboptimal because they overlook how context length influences inference cost. Since inference cost grows with context length, the most cost-efficient GQA configuration should vary accordingly. In this work, we analyze the relationship among context length, model size, GQA configuration, and model loss, and introduce two innovations: (1) we decouple the total head size from the hidden size, enabling more flexible control over attention FLOPs; and (2) we jointly optimize the model size and the GQA configuration to arrive at a better allocation of inference resources between attention layers and other components. Our analysis reveals that commonly used GQA configurations are highly suboptimal for long-context scenarios. Moreover, we propose a recipe for deriving cost-optimal GQA configurations. Our results show that for long-context scenarios, one should use fewer attention heads while scaling up the model size. Configurations selected by our recipe can reduce both memory usage and FLOPs by more than 50% compared to Llama-3’s GQA, with *no degradation in model capabilities*. Our findings offer valuable insights for designing efficient long-context LLMs.
pdf
bib
abs
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhongyi Zhou
|
Yichen Zhu
|
Minjie Zhu
|
Junjie Wen
|
Ning Liu
|
Zhiyuan Xu
|
Weibin Meng
|
Yaxin Peng
|
Chaomin Shen
|
Feifei Feng
|
Yi Xu
Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can’t large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks. Notably, it achieves a six times higher performance on MMMU and scores 47.2% on MMStar with a more parameter-efficient design than ECoT. Furthermore, ChatVLA demonstrates superior performance on 25 real-world robot manipulation tasks compared to existing VLA methods like OpenVLA. Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
pdf
bib
abs
KG-RAG: Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation
Ziyi Guan
|
Jason Chun Lok Li
|
Zhijian Hou
|
Pingping Zhang
|
Donglai Xu
|
Yuzhi Zhao
|
Mengyang Wu
|
Jinpeng Chen
|
Thanh-Toan Nguyen
|
Pengfei Xian
|
Wenao Ma
|
Shengchao Qin
|
Graziano Chesi
|
Ngai Wong
Despite recent progress, Graphic User Interface (GUI) agents powered by Large Language Models (LLMs) struggle with complex mobile tasks due to limited app-specific knowledge. While UI Transition Graphs (UTGs) offer structured navigation representations, they are underutilized due to poor extraction and inefficient integration. We introduce KG-RAG, a Knowledge Graph-driven Retrieval-Augmented Generation framework that transforms fragmented UTGs into structured vector databases for efficient real-time retrieval. By leveraging an intent-guided LLM search method, KG-RAG generates actionable navigation paths, enhancing agent decision-making. Experiments across diverse mobile apps show that KG-RAG outperforms existing methods, achieving a 75.8% success rate (8.9% improvement over AutoDroid), 84.6% decision accuracy (8.1% improvement), and reducing average task steps from 4.5 to 4.1. Additionally, we present KG-Android-Bench and KG-Harmony-Bench, two benchmarks tailored to the Chinese mobile ecosystem for future research. Finally, KG-RAG transfers to web/desktop (+40% SR on Weibo-web; +20% on QQ Music-desktop), and a UTG cost ablation shows accuracy saturates at ~4h per complex app, enabling practical deployment trade-offs.
pdf
bib
abs
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
|
Xiaoye Qu
|
Tong Zhu
|
Yu Cheng
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder. Code is available at https://github.com/OpenSparseLLMs/CLIP-MoE.
pdf
bib
abs
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li
|
Guanting Dong
|
Jiajie Jin
|
Yuyao Zhang
|
Yujia Zhou
|
Yutao Zhu
|
Peitian Zhang
|
Zhicheng Dou
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce **Search-o1**, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness of LRMs in complex reasoning tasks, paving the way for advanced deep research systems. The code is available at
https://github.com/RUC-NLPIR/Search-o1.
pdf
bib
abs
From Personas to Talks: Revisiting the Impact of Personas on LLM-Synthesized Emotional Support Conversations
Shenghan Wu
|
Yimo Zhu
|
Wynne Hsu
|
Mong-Li Lee
|
Yang Deng
The rapid advancement of Large Language Models (LLMs) has revolutionized the generation of emotional support conversations (ESC), offering scalable solutions with reduced costs and enhanced data privacy. This paper explores the role of personas in the creation of ESC by LLMs. Our research utilizes established psychological frameworks to measure and infuse persona traits into LLMs, which then generate dialogues in the emotional support scenario. We conduct extensive evaluations to understand the stability of persona traits in dialogues, examining shifts in traits post-generation and their impact on dialogue quality and strategy distribution. Experimental results reveal several notable findings: 1) LLMs can infer core persona traits, 2) subtle shifts in emotionality and extraversion occur, influencing the dialogue dynamics, and 3) the application of persona traits modifies the distribution of emotional support strategies, enhancing the relevance and empathetic quality of the responses. These findings highlight the potential of persona-driven LLMs in crafting more personalized, empathetic, and effective emotional support dialogues, which has significant implications for the future design of AI-driven emotional support systems.
pdf
bib
abs
Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models
Shuodi Liu
|
Yingzhuo Liu
|
Zi Wang
|
Yusheng Wang
|
Huijia Wu
|
Liuyu Xiang
|
Zhaofeng He
Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at https://github.com/summervvind/Select-Then-Decompose.
pdf
bib
abs
TombRaider: Entering the Vault of History to Jailbreak Large Language Models
Junchen Ding
|
Jiahao Zhang
|
Yi Liu
|
Ziqi Ding
|
Gelei Deng
|
Yuekang Li
**Warning: This paper contains content that may involve potentially harmful behaviours, discussed strictly for research purposes.**Jailbreak attacks can hinder the safety of Large Language Model (LLM) applications, especially chatbots. Studying jailbreak techniques is an important AI red teaming task for improving the safety of these applications. In this paper, we introduce TombRaider, a novel jailbreak technique that exploits the ability to store, retrieve, and use historical knowledge of LLMs. TombRaider employs two agents, the inspector agent to extract relevant historical information and the attacker agent to generate adversarial prompts, enabling effective bypassing of safety filters. We intensively evaluated TombRaider on six popular models. Experimental results showed that TombRaider could outperform state-of-the-art jailbreak techniques, achieving nearly 100% attack success rates (ASRs) on bare models and maintaining over 55.4% ASR against defence mechanisms. Our findings highlight critical vulnerabilities in existing LLM safeguards, underscoring the need for more robust safety defences.
pdf
bib
abs
Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks
Danny Wang
|
Ruihong Qiu
|
Guangdong Bai
|
Zi Huang
Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.
pdf
bib
abs
APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
Zhuo Li
|
Yuege Feng
|
Dandan Guo
|
Jinpeng Hu
|
Anningzhe Gao
|
Xiang Wan
The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences.
pdf
bib
abs
HS-STaR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation
Feng Xiong
|
Hongling Xu
|
Yifei Wang
|
Runxi Cheng
|
Yong Wang
|
Xiangxiang Chu
Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM’s reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.
pdf
bib
abs
SEPS: A Separability Measure for Robust Unlearning in LLMs
Wonje Jeung
|
Sangyeon Yoon
|
Albert No
Machine unlearning aims to selectively remove targeted knowledge from Large Language Models (LLMs), ensuring they forget specified content while retaining essential information. Existing unlearning metrics assess whether a model correctly answers retain queries and rejects forget queries, but they fail to capture real-world scenarios where forget queries rarely appear in isolation. In fact, forget and retain queries often coexist within the same prompt, making mixed-query evaluation crucial.We introduce SEPS, an evaluation framework that explicitly measures a model’s ability to both forget and retain information within a single prompt. Through extensive experiments across three benchmarks, we identify two key failure modes in existing unlearning methods: (1) untargeted unlearning indiscriminately erases both forget and retain content once a forget query appears, and (2) targeted unlearning overfits to single-query scenarios, leading to catastrophic failures when handling multiple queries. To address these issues, we propose Mixed Prompt (MP) unlearning, a strategy that integrates both forget and retain queries into a unified training objective. Our approach significantly improves unlearning effectiveness, demonstrating robustness even in complex settings with up to eight mixed forget and retain queries in a single prompt.
pdf
bib
abs
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
Zehong Yan
|
Peng Qi
|
Wynne Hsu
|
Mong-Li Lee
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model’s ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
pdf
bib
abs
Tree-of-Quote Prompting Improves Factuality and Attribution in Multi-Hop and Medical Reasoning
Justin Xu
|
Yiming Li
|
Zizheng Zhang
|
Augustine Yui Hei Luk
|
Mayank Jobanputra
|
Samarth Oza
|
Ashley Murray
|
Meghana Reddy Kasula
|
Andrew Parker
|
David W Eyre
Large language models (LLMs) can produce fluent but factually incorrect outputs and often have limited ability to attribute their claims to source material. This undermines their reliability, particularly in multi-hop and high-stakes domains such as medicine. We propose Tree-of-Quote (ToQ) prompting, a structured framework that decomposes complex questions into subquestions, generates quotes to support each step without retrieval, and selectively advances reasoning based on quote quality. We also introduce FQ-Score, a unified metric that captures answer correctness, attribution fidelity, and reasoning quality. Experiments on StrategyQA, 2WikiMultiHopQA, MuSiQue, MoreHopQA, and MedQA demonstrate that ToQ improves factuality and attribution over standard prompting baselines. To validate FQ-Score as a proxy for human judgment, we conduct two reader studies with clinicians on medical questions, and observe strong correlations. Both clinician scores and FQ-Scores also indicate a preference for ToQ over baselines due to a combination of greater correctness, completeness, and logical flow. Our results suggest ToQ is a promising approach for building more trustworthy and auditable LLM systems.
pdf
bib
abs
UnitCoder: Scalable Code Synthesis from Pre-training Corpora
Yichuan Ma
|
Yunfan Shao
|
Peiji Li
|
Demin Song
|
Qipeng Guo
|
Linyang Li
|
Xipeng Qiu
|
Kai Chen
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Despite the abundant sources of code data, constructing high-quality training datasets at scale poses a significant challenge. Pre-training code data typically suffers from inconsistent data quality issues. Conversely, instruction-based methods which use a high-quality subset as seed samples suffer from limited task diversity. In this paper, we introduce UnitCoder, which directly supervises pre-training data quality through automatically generated unit tests, while ensuring the correctness via an iterative fix and refine flow. Code synthesized by UnitCoder benefits from both the diversity of pre-training corpora and the high quality ensured by unit test supervision. Our experiments demonstrate that models fine-tuned on our synthetic dataset exhibit consistent performance improvements. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released.
pdf
bib
abs
GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models
Jixiao Zhang
|
Chunsheng Zuo
Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.
pdf
bib
abs
Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
Peichao Lai
|
Jiaxin Gan
|
Feiyang Ye
|
Wentao Zhang
|
Fangcheng Fu
|
Yilei Wang
|
Bin Cui
Sequence labeling remains a significant challenge in low-resource, domain-specific scenarios, particularly for character-dense languages. Existing methods primarily focus on enhancing model comprehension and improving data diversity to boost performance. However, these approaches still struggle with inadequate model applicability and semantic distribution biases in domain-specific contexts. To overcome these limitations, we propose a novel framework that combines an LLM-based knowledge enhancement workflow with a span-based Knowledge Fusion for Rich and Efficient Extraction (KnowFREE) model. Our workflow employs explanation prompts to generate precise contextual interpretations of target entities, effectively mitigating semantic biases and enriching the model’s contextual understanding. The KnowFREE model further integrates extension label features, enabling efficient nested entity extraction without relying on external knowledge during inference. Experiments on multiple domain-specific sequence labeling datasets demonstrate that our approach achieves state-of-the-art performance, effectively addressing the challenges posed by low-resource settings.
pdf
bib
abs
Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding
Congchi Yin
|
Qian Yu
|
Zhiwei Fang
|
Changping Peng
|
Piji Li
Recent major milestones have successfully reconstructed natural language from non-invasive brain signals (e.g. functional Magnetic Resonance Imaging (fMRI) and Electroencephalogram (EEG)) across subjects. However, we find current dataset splitting strategies for cross-subject brain-to-text decoding are wrong. Specifically, we first demonstrate that all current splitting methods suffer from data leakage problem, which refers to the leakage of validation and test data into training set, resulting in significant overfitting and overestimation of decoding models. In this study, we develop a right cross-subject data splitting criterion without data leakage for decoding fMRI and EEG signal to text. Some SOTA brain-to-text decoding models are re-evaluated correctly with the proposed criterion for further research.
pdf
bib
abs
RCScore: Quantifying Response Consistency in Large Language Models
Dongjun Jang
|
Youngchae Ahn
|
Hyopil Shin
Current LLM evaluations often rely on a single instruction template, overlooking models’ sensitivity to instruction style—a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.
pdf
bib
abs
A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection
Hui Li
|
Ante Wang
|
Kunquan Li
|
Zhihao Wang
|
Liang Zhang
|
Delai Qiu
|
Qingsong Liu
|
Jinsong Su
Misinformation spans various domains, but detection methods trained on specific domains often perform poorly when applied to others. With the rapid development of Large Language Models (LLMs), researchers have begun to utilize LLMs for cross-domain misinformation detection. However, existing LLM-based methods often fail to adequately analyze news in the target domain, limiting their detection capabilities. More importantly, these methods typically rely on manually designed decision rules, which are limited by domain knowledge and expert experience, thus limiting the generalizability of decision rules to different domains. To address these issues, we propose a Multi-Agent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization (MARO). Under this framework, we first employs multiple expert agents to analyze target-domain news. Subsequently, we introduce a question-reflection mechanism that guides expert agents to facilitate higher-quality analysis. Furthermore, we propose a decision rule optimization approach based on carefully designed cross-domain validation tasks to iteratively enhance decision rule effectiveness across domains. Experimental results and analysis on commonly used datasets demonstrate that MARO achieves significant improvements over existing methods.
pdf
bib
abs
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
Shuting Wang
|
Jiejun Tan
|
Zhicheng Dou
|
Ji-Rong Wen
Retrieval-augmented generation (RAG) has emerged as a key application of large language models (LLMs), especially in vertical domains where LLMs may lack domain-specific knowledge. This paper introduces OmniEval, an omnidirectional and automatic RAG benchmark for the financial domain, featured by its multi-dimensional evaluation framework: First, we categorize RAG scenarios by five task classes and 16 financial topics, leading to a matrix-based structured assessment for RAG evaluation; Next, we leverage a multi-dimensional evaluation data generation method that integrates GPT-4-based automatic generation and human annotation approaches, achieving an 87.47% acceptance ratio in human evaluations of generated instances; Further, we utilize a multi-stage evaluation pipeline to assess both retrieval and generation performance, resulting in an all-sided evaluation of the RAG pipeline. Finally, rule-based and LLM-based metrics are combined to build a multi-dimensional evaluation system, enhancing the reliability of assessments through fine-tuned LLM-based evaluators. Our omnidirectional evaluation experiments highlight the performance variations of RAG systems across diverse topics and tasks and reveal significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the anonymous code of our benchmark at https://github.com/RUC-NLPIR/OmniEval.
pdf
bib
abs
AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
Xiaopeng Ke
|
Hexuan Deng
|
Xuebo Liu
|
Jun Rao
|
Zhenxi Song
|
Jun Yu
|
Min Zhang
Despite the impressive performance of large language models (LLMs) in general domains, they often underperform in specialized domains. Existing approaches typically rely on data synthesis methods and yield promising results by using unlabeled data to capture domain-specific features. However, these methods either incur high computational costs or suffer from performance limitations, while also demonstrating insufficient generalization across different tasks. To address these challenges, we propose AQuilt, a framework for constructing instruction-tuning data for any specialized domains from corresponding unlabeled data, including Answer, Question, Unlabeled data, Inspection, Logic, and Task type. By incorporating logic and inspection, we encourage reasoning processes and self-inspection to enhance model performance. Moreover, customizable task instructions enable high-quality data generation for any task. As a result, we construct a dataset of 703K examples to train a powerful data synthesis model. Experiments show that AQuilt is comparable to DeepSeek-V3 while utilizing just 17% of the production cost. Further analysis demonstrates that our generated data exhibits higher relevance to downstream tasks. Source code, models, and scripts are available at https://github.com/Krueske/AQuilt.
pdf
bib
abs
MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds
Junxi Wu
|
Jinpeng Wang
|
Zheng Liu
|
Bin Chen
|
Dongjian Hu
|
Hao Wu
|
Shu-Tao Xia
The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.
pdf
bib
abs
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
Lin Lu
|
Zhigang Zuo
|
Ziji Sheng
|
Pan Zhou
Model merging has emerged as a promising approach for updating large language models (LLMs) by integrating multiple domain-specific models into a cross-domain merged model. Despite its utility and plug-and-play nature, unmonitored mergers can introduce significant security vulnerabilities, such as backdoor attacks and model merging abuse. In this paper, we identify a novel and more realistic attack surface where a malicious merger can extract targeted personally identifiable information (PII) from an aligned model with model merging. Specifically, we propose Merger-as-a-Stealer, a two-stage framework to achieve this attack: First, the attacker fine-tunes a malicious model to force it to respond to any PII-related queries. The attacker then uploads this malicious model to the model merging conductor and obtains the merged model. Second, the attacker inputs direct PII-related queries to the merged model to extract targeted PII. Extensive experiments demonstrate that Merger-as-a-Stealer successfully executes attacks against various LLMs and model merging methods across diverse settings, highlighting the effectiveness of the proposed framework. Given that this attack enables character-level extraction for targeted PII without requiring any additional knowledge from the attacker, we stress the necessity for improved model alignment and more robust defense mechanisms to mitigate such threats.
pdf
bib
abs
Pragmatic Inference Chain (PIC) Improving LLMs’ Reasoning of Authentic Implicit Toxic Language
Xi Chen
|
Shuo Wang
The rapid development of large language models (LLMs) gives rise to ethical concerns about their performance, while opening new avenues for developing toxic language detection techniques. However, LLMs’ unethical output and their capability of detecting toxicity have primarily been tested on language data that do not demand complex meaning inference, such as the biased associations of ‘he’ with programmer and ‘she’ with household. Nowadays toxic language adopts a much more creative range of implicit forms, thanks to advanced censorship. In this study, we collect authentic toxic interactions that evade online censorship and that are verified by human annotators as inference-intensive. To evaluate and improve LLMs’ reasoning of the authentic implicit toxic language, we propose a new prompting method, Pragmatic Inference Chain (PIC), drawn on interdisciplinary findings from cognitive science and linguistics. The PIC prompting significantly improves the success rate of GPT-4o, Llama-3.1-70B-Instruct, DeepSeek-v2.5, and DeepSeek-v3 in identifying implicit toxic language, compared to five baseline prompts, such as CoT and rule-based baselines. In addition, it also facilitates the models to produce more explicit and coherent reasoning processes, hence can potentially be generalized to other inference-intensive tasks, e.g., understanding humour and metaphors.
pdf
bib
abs
Beyond Demonstrations: Dynamic Vector Construction from Latent Representations
Wang Cai
|
Hsiu-Yuan Huang
|
Zhixiang Wang
|
Yunfang Wu
In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability.To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment.Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.
pdf
bib
abs
Detoxifying Large Language Models via the Diversity of Toxic Samples
Ying Zhao
|
Yuanzhao Guo
|
Xuemeng Weng
|
Yuan Tian
|
Wei Wang
|
Yi Chang
Eliminating toxicity from Large Language Models (LLMs) is crucial for ensuring user safety. However, current methods have limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we discover that toxic samples exhibit diversity and, within this diversity, there lies specificity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in detoxifying LLMs. To this end, we propose a novel diverse detoxification framework, DivDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive DPO (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from the LLM, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and different detoxification tasks verify the effectiveness of our architecture.
pdf
bib
abs
LLM-Driven Implicit Target Augmentation and Fine-Grained Contextual Modeling for Zero-Shot and Few-Shot Stance Detection
Yanxu Ji
|
Jinzhong Ning
|
Yijia Zhang
|
Zhi Liu
|
Hongfei Lin
Stance detection aims to identify the attitude expressed in text towards a specific target. Recent studies on zero-shot and few-shot stance detection focus primarily on learning generalized representations from explicit targets. However, these methods often neglect implicit yet semantically important targets and fail to adaptively adjust the relative contributions of text and target in light of contextual dependencies. To overcome these limitations, we propose a novel two-stage framework: First, a data augmentation framework named Hierarchical Collaborative Target Augmentation (HCTA) employs Large Language Models (LLMs) to identify and annotate implicit targets via Chain-of-Thought (CoT) prompting and multi-LLM voting, significantly enriching training data with latent semantic relations. Second, we introduce DyMCA, a Dynamic Multi-level Context-aware Attention Network, integrating a joint text-target encoding and a content-aware mechanism to dynamically adjust text-target contributions based on context. Experiments on the benchmark dataset demonstrate that our approach achieves state-of-the-art results, confirming the effectiveness of implicit target augmentation and fine-grained contextual modeling.
pdf
bib
abs
Dial-In LLM: Human-Aligned LLM-in-the-loop Intent Clustering for Customer Service Dialogues
Mengze Hong
|
Wailing Ng
|
Chen Jason Zhang
|
Yuanfeng Song
|
Di Jiang
Discovering customer intentions is crucial for automated service agents, yet existing intent clustering methods often fall short due to their reliance on embedding distance metrics and neglect of underlying semantic structures. To address these limitations, we propose an **LLM-in-the-loop (LLM-ITL)** intent clustering framework, integrating the language understanding capabilities of LLMs into conventional clustering algorithms. Specifically, this paper (1) examines the effectiveness of fine-tuned LLMs in semantic coherence evaluation and intent cluster naming, achieving over 95% accuracy aligned with human judgments; (2) designs an LLM-ITL framework that facilitates the iterative discovery of coherent intent clusters and the optimal number of clusters; and (3) introduces context-aware techniques tailored for customer service dialogue. Since existing English benchmarks lack sufficient semantic diversity and intent coverage, we further present a comprehensive Chinese dialogue intent dataset comprising over 100k real customer service calls with 1,507 human-annotated clusters. The proposed approaches significantly outperform LLM-guided baselines, achieving notable improvements in clustering quality, cost efficiency, and downstream applications. Combined with several best practices, our findings highlight the prominence of LLM-in-the-loop techniques for scalable dialogue data mining.
pdf
bib
abs
Superficial Self-Improved Reasoners Benefit from Model Merging
Xiangchi Yuan
|
Chunhui Zhang
|
Zheyuan Liu
|
Dachuan Shi
|
Leyan Pan
|
Soroush Vosoughi
|
Wenke Lee
Large Language Models (LLMs) rely heavily on large-scale reasoning data, but as such data becomes increasingly scarce, model self-improvement offers a promising alternative. However, this process can lead to model collapse, as the model’s output becomes overly deterministic with reduced diversity. In this work, we identify a new risk beyond model collapse, which we term the Superficial Self-Improved Reasoners phenomenon. This phenomenon indicates that while self-improvement enhances in-domain (ID) reasoning accuracy, it degrades the model’s generalized reasoning capability on out-of-domain (OOD) datasets, as the model tends to memorize the training data. Our analyses of layer importance and parameter changes reveal that reasoning-critical layers receive fewer updates compared to less relevant layers during self-improvement. To address this, we propose Iterative Model Merging (IMM), which balances reasoning improvements and generalization by merging the weights of the original and self-improved models. IMM effectively mitigates model collapse and improves generalized reasoning capability. Code is available at https://github.com/xiangchi-yuan/merge_syn
pdf
bib
abs
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Wenqiao Zhu
|
Ji Liu
|
Rongjunchen Zhang
|
Haipang Wu
|
Yulun Zhang
Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., CARFT, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of CARFT in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%).
pdf
bib
abs
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
Mengze Hong
|
Wailing Ng
|
Chen Jason Zhang
|
Di Jiang
The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98% reveals the current gaps in domain coverage within model capabilities. Furthermore, we identify performance degradation caused by LLM crowdsourcing, assess data contamination, and illustrate the effectiveness of prompt engineering and model fine-tuning, suggesting opportunities for future improvements through multi-domain RAG and Federated Learning. Data and code are publicly available at https://github.com/mengze-hong/QualBench.
pdf
bib
abs
VideoEraser: Concept Erasure in Text-to-Video Diffusion Models
Naen Xu
|
Jinghuai Zhang
|
Changjiang Li
|
Zhi Chen
|
Chunyi Zhou
|
Qingming Li
|
Tianyu Du
|
Shouling Ji
The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
pdf
bib
abs
Diagram-Driven Course Questions Generation
Xinyu Zhang
|
Lingling Zhang
|
Yanrui Wu
|
Muye Huang
|
Wenjun Wu
|
Bo Li
|
Shaowei Wang
|
Basura Fernando
|
Jun Liu
Visual Question Generation (VQG) research focuses predominantly on natural images while neglecting the diagram, which is a critical component in educational materials. To meet the needs of pedagogical assessment, we propose the Diagram-Driven Course Questions Generation (DDCQG) task and construct DiagramQG, a comprehensive dataset with 15,720 diagrams and 25,798 questions across 37 subjects and 371 courses. Our approach employs course and input text constraints to generate course-relevant questions about specific diagram elements. We reveal three challenges of DDCQG: domain-specific knowledge requirements across courses, long-tail distribution in course coverage, and high information density in diagrams. To address these, we propose the Hierarchical Knowledge Integration framework (HKI-DDCQG), which utilizes trainable CLIP for identifying relevant diagram patches, leverages frozen vision-language models for knowledge extraction, and generates questions with trainable T5. Experiments demonstrate that HKI-DDCQG outperforms existing models on DiagramQG while maintaining strong generalizability across natural image datasets, establishing a strong baseline for DDCQG.
pdf
bib
abs
ECC: An Emotion-Cause Conversation Dataset for Empathy Response
Yuanyuan He
|
Yongsen Pan
|
Wei Li
|
Jiali You
|
Jiawen Deng
|
Fuji Ren
The empathy dialogue system requires understanding emotions and their underlying causes. However, existing datasets mainly focus on emotion labels, while cause annotations are added post hoc through costly and subjective manual processes. This leads to three limitations: subjective bias in cause labels, weak rationality due to ambiguous cause-emotion relationships, and high annotation costs that hinder scalability. To address these challenges, we propose ECC (Emotion-Cause Conversation Dataset), a scalable dataset with 2.4K dialogues, which is also the first dialogue dataset where conversations and their emotion-cause labels are automatically generated synergistically during creation. We create an automatic extension framework EC-DD for ECC that utilizes knowledge and large language models (LLMs) to automatically generate conversations, and train a causality-aware empathetic response model CAER on this dataset. Experimental results show that ECC can achieve comparable or even superior performance to artificially constructed empathy dialogue datasets. Our code will be publicly released on https://github.com/Yuan-23/ECC
pdf
bib
abs
ThoughtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations
Zijian Wang
|
Chang Xu
This paper introduces ThoughtProbe, a novel inference-time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree-structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch-aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework’s comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.
pdf
bib
abs
JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling
Jinwang Song
|
Hongying Zan
|
Kunli Zhang
|
Lingling Mu
|
Yingjie Han
|
Haobo Hua
|
Min Peng
Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency.
pdf
bib
abs
DMDTEval: An Evaluation and Analysis of LLMs on Disambiguation in Multi-domain Translation
Zhibo Man
|
Yuanmeng Chen
|
Yujie Zhang
|
Jinan Xu
Currently, Large Language Models (LLMs) have achieved remarkable results in machine translation. However, their performance in multi-domain translation (MDT) is less satisfactory, the meanings of words can vary across different domains, highlighting the significant ambiguity inherent in MDT. Therefore, evaluating the disambiguation ability of LLMs in MDT, remains an open problem. To this end, we present an evaluation and analysis of LLMs on disambiguation in multi-domain translation (DMDTEval), our systematic evaluation framework consisting of three aspects: (1) we construct a translation test set with multi-domain ambiguous word annotation, (2) we curate a diverse set of disambiguation prompt strategies, and (3) we design precise disambiguation metrics, and study the efficacy of various prompt strategies on multiple state-of-the-art LLMs. We conduct comprehensive experiments across 4 language pairs and 13 domains, our extensive experiments reveal a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the disambiguation of LLMs.
pdf
bib
abs
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
David Wadden
|
Kejian Shi
|
Jacob Morrison
|
Alan Li
|
Aakanksha Naik
|
Shruti Singh
|
Nitzan Barzilay
|
Kyle Lo
|
Tom Hope
|
Luca Soldaini
|
Shannon Zejiang Shen
|
Doug Downey
|
Hannaneh Hajishirzi
|
Arman Cohan
We present ScIRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. ScIRIFF is unique in being the only entirely expert-written, high-quality instruction-following dataset designed for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general domain and ScIRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over our baselines trained only on general-domain instructions. ScIRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.
pdf
bib
abs
MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition
Xinkui Lin
|
Yuhui Zhang
|
Yongxiu Xu
|
Kun Huang
|
Hongzhang Mu
|
Yubin Wang
|
Gaopeng Gou
|
Li Qian
|
Li Peng
|
Wei Liu
|
Jian Luan
|
Hongbo Xu
Grounded Multimodal Named Entity Recognition (GMNER), which aims to extract textual entities, their types, and corresponding visual regions from image-text data, has become a critical task in multimodal information extraction. However, existing methods face two major challenges. First, they fail to address the semantic ambiguity caused by polysemy and the long-tail distribution of datasets. Second, unlike visual grounding which provides descriptive phrases, entity grounding only offers brief entity names which carry less semantic information. Current methods lack sufficient semantic interaction between text and image, hindering accurate entity-visual region matching. To tackle these issues, we propose MAKAR, a Multi-Agent framework based Knowledge-Augmented Reasoning, comprising three agents: Knowledge Enhancement, Entity Correction, and Entity Reasoning Grounding. Specifically, in the named entity recognition phase, the Knowledge Enhancement Agent leverages a Multimodal Large Language Model (MLLM) as an implicit knowledge base to enhance ambiguous image-text content with its internal knowledge. For samples with low-confidence entity boundaries and types, the Entity Correction Agent uses web search tools to retrieve and summarize relevant web content, thereby correcting entities using both internal and external knowledge. In the entity grounding phase, the Entity Reasoning Grounding Agent utilizes multi-step Chain-of-Thought reasoning to perform grounding for each entity. Extensive experiments show that MAKAR achieves state-of-the-art performance on two benchmark datasets. Code is available at: https://github.com/Nikol-coder/MAKAR.
pdf
bib
abs
VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models
Bingrui Sima
|
Linhua Cong
|
Wenxuan Wang
|
Kun He
The emergence of Multimodal Large Reasoning Models (MLRMs) has enabled sophisticated visual reasoning capabilities by integrating reinforcement learning and Chain-of-Thought (CoT) supervision. However, while these enhanced reasoning capabilities improve performance, they also introduce new and underexplored safety risks. In this work, we systematically investigate the security implications of advanced visual reasoning in MLRMs. Our analysis reveals a fundamental trade-off: as visual reasoning improves, models become more vulnerable to jailbreak attacks. Motivated by this critical finding, we introduce VisCRA (Visual Chain Reasoning Attack), a novel jailbreak framework that exploits the visual reasoning chains to bypass safety mechanisms. VisCRA combines targeted visual attention masking with a two-stage reasoning induction strategy to precisely control harmful outputs. Extensive experiments demonstrate VisCRA’s significant effectiveness, achieving high attack success rates on leading closed-source MLRMs: 76.48% on Gemini 2.0 Flash Thinking, 68.56% on QvQ-Max, and 56.60% on GPT-4o. Our findings highlight a critical insight: the very capability that empowers MLRMs — their visual reasoning — can also serve as an attack vector, posing significant security risks. Warning: This paper contains unsafe examples.
pdf
bib
abs
Investigating Neurons and Heads in Transformer-based LLMs for Typographical Errors
Kohei Tsuji
|
Tatsuya Hiraoka
|
Yuchang Cheng
|
Eiji Aramaki
|
Tomoya Iwakura
This paper investigates how LLMs encode inputs with typos. We hypothesize that specific neurons and attention heads recognize typos and fix them internally using local and global contexts. We introduce a method to identify typo neurons and typo heads that work actively when inputs contain typos. Our experimental results suggest the following: 1) LLMs can fix typos with local contexts when the typo neurons in either the early or late layers are activated, even if those in the other are not. 2) Typo neurons in the middle layers are the core of typo-fixing with global contexts. 3) Typo heads fix typos by widely considering the context not focusing on specific tokens. 4) Typo neurons and typo heads work not only for typo-fixing but also for understanding general contexts.
pdf
bib
abs
LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research
Shuo Yan
|
Ruochen Li
|
Ziming Luo
|
Zimu Wang
|
Daoyang Li
|
Liqiang Jing
|
Kaiyu He
|
Peilin Wu
|
Juntong Ni
|
George Michalopoulos
|
Yue Zhang
|
Ziyang Zhang
|
Mian Zhang
|
Zhiyu Chen
|
Xinya Du
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents’ ability to autonomously reproduce scientific research.
pdf
bib
abs
RAV: Retrieval-Augmented Voting for Tactile Descriptions Without Training
Jinlin Wang
|
Yulong Ji
|
Hongyu Yang
Tactile perception is essential for human-environment interaction, and deriving tactile descriptions from multimodal data is a key challenge for embodied intelligence to understand human perception. Conventional approaches relying on extensive parameter learning for multimodal perception are rigid and computationally inefficient. To address this, we introduce Retrieval-Augmented Voting (RAV), a parameter-free method that constructs visual-tactile cross-modal knowledge directly. RAV retrieves similar visual-tactile data for given visual and tactile inputs and generates tactile descriptions through a voting mechanism. In experiments, we applied three voting strategies, SyncVote, DualVote and WeightVote, achieving performance comparable to large-scale cross-modal models without training. Comparative experiments across datasets of varying quality—defined by annotation accuracy and data diversity—demonstrate that RAV’s performance improves with higher-quality data at no additional computational cost. Code, and model checkpoints are opensourced at https://github.com/PluteW/RAV.
pdf
bib
abs
Static Word Embeddings for Sentence Semantic Representation
Takashi Wada
|
Yuki Hirakawa
|
Ryotaro Shimizu
|
Takahiro Kawashima
|
Yuki Saito
We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even surpasses a basic Sentence Transformer model (SimCSE) on a text embedding benchmark. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are not highly relevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.
pdf
bib
abs
PropRAG: Guiding Retrieval with Beam Search over Proposition Paths
Jingjin Wang
|
Jiawei Han
Retrieval Augmented Generation (RAG) has become the standard approach for equipping Large Language Models (LLMs) with up-to-date knowledge. However, standard RAG, relying on independent passage retrieval, often fails to capture the interconnected nature of information required for complex, multi-hop reasoning. While structured RAG methods attempt to address this using knowledge graphs built from triples, we argue that the inherent context loss of triples (context collapse) limits the fidelity of the knowledge representation. We introduce PropRAG, a novel RAG framework that shifts from triples to context-rich propositions and introduces an efficient, LLM-free online beam search over proposition paths to discover multi-step reasoning chains. By coupling a higher-fidelity knowledge representation with explicit path discovery, PropRAG achieves state-of-the-art zero-shot Recall@5 and F1 scores on 2Wiki, HotpotQA, and MuSiQue, advancing non-parametric knowledge integration by improving evidence retrieval through richer representation and efficient reasoning path discovery.
pdf
bib
abs
Rethinking Backdoor Detection Evaluation for Language Models
Jun Yan
|
Wenjie Jacky Mo
|
Xiang Ren
|
Robin Jia
Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. As a countermeasure, backdoor detection methods aim to detect whether a released model contains a backdoor. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods based on trigger inversion or meta classifiers highly depends on how intensely the model is trained on poisoned data. Specifically, backdoors planted with more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.
pdf
bib
abs
Glider: Global and Local Instruction-Driven Expert Router
Pingzhi Li
|
Prateek Yadav
|
Jaehong Yoon
|
Jie Peng
|
Yi-Lin Sung
|
Mohit Bansal
|
Tianlong Chen
The development of performant pre-trained models has driven the advancement of routing-based expert models tailored to specific tasks. However, these methods often favor generalization over performance on held-in tasks. This limitation adversely impacts practical applicability, as real-world deployments require robust performance across both known and novel tasks. We observe that current token-level routing mechanisms neglect the global semantic context of the input task. To address this, we propose a novel method, Global and Local Instruction Driven Expert Router (GLIDER) that proposes a multi-scale routing mechanism, encompassing a semantic global router and a learned local router. The global router leverages recent LLMs’ semantic reasoning capabilities to generate task-specific instructions from the input query, guiding expert selection across all layers. This global guidance is complemented by a local router that facilitates token-level routing decisions within each module, enabling finer control and enhanced performance on unseen and challenging tasks. Our experiments using T5-based expert models for T0 and FLAN tasks demonstrate that Glider achieves substantially improved held-in performance while maintaining strong generalization on held-out tasks. Additionally, we perform ablations experiments to dive deeper into the components of Glider and plot routing distributions to show that Glider can effectively retrieve the correct expert for held-in tasks while also demonstrating compositional capabilities for held-out tasks. Our experiments highlight the importance of our multi-scale routing that leverages LLM-driven semantic reasoning for MoErging methods.
pdf
bib
abs
CoVoGER: A Multilingual Multitask Benchmark for Speech-to-text Generative Error Correction with Large Language Models
Zhengdong Yang
|
Zhen Wan
|
Sheng Li
|
Chao-Han Huck Yang
|
Chenhui Chu
Large language models (LLMs) can rewrite the N-best hypotheses from a speech-to-text model, often fixing recognition or translation errors that traditional rescoring cannot. Yet research on generative error correction (GER) has been focusing on monolingual automatic speech recognition (ASR), leaving its multilingual and multitask potential underexplored. We introduce CoVoGER, a benchmark for GER that covers both ASR and speech-to-text translation (ST) across 15 languages and 28 language pairs. CoVoGER is constructed by decoding Common Voice 20.0 and CoVoST-2 with Whisper of three model sizes and SeamlessM4T of two model sizes, providing 5-best lists obtained via a mixture of beam search and temperature sampling. We evaluated various instruction-tuned LLMs, including commercial models in zero-shot mode and open-sourced models with LoRA fine-tuning, and found that the mixture decoding strategy yields the best GER performance in most settings. CoVoGER will be released to promote research on reliable language-universal speech-to-text GER. The code and data for the benchmark are available at https://github.com/N-Orien/CoVoGER.
pdf
bib
abs
Tiny Budgets, Big Gains: Parameter Placement Strategy in Parameter Super-Efficient Fine-Tuning
Jinman Zhao
|
Xueyan Zhang
|
Jiaru Li
|
Jingcheng Niu
|
Yulan Hu
|
Erxue Min
|
Gerald Penn
In this work, we propose FoRA-UA, a novel method that, using only 1–5% of the standard LoRA’s parameters, achieves state-of-the-art performance across a wide range of tasks. Specifically, we explore scenarios with extremely limited parameter budgets and derive two key insights: (1) fix-sized sparse frequency representations approximate small matrices more accurately; and (2) with a fixed number of trainable parameters, introducing a smaller intermediate representation to approximate larger matrices results in lower construction error. These findings form the foundation of our FoRA-UA method. By inserting a small intermediate parameter set, we achieve greater model compression without sacrificing performance. We evaluate FoRA-UA across diverse tasks, including natural language understanding (NLU), natural language generation (NLG), instruction tuning, and image classification, demonstrating strong generalisation and robustness under extreme compression.
pdf
bib
abs
Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction
Junkai Liu
|
Yujie Tong
|
Hui Huang
|
Bowen Zheng
|
Yiran Hu
|
Peicheng Wu
|
Chuan Xiao
|
Makoto Onizuka
|
Muyun Yang
|
Shuyuan Zheng
Legal judgment prediction (LJP), which enables litigants and their lawyers to forecast judgment outcomes and refine litigation strategies, has emerged as a crucial legal NLP task. Existing studies typically utilize legal facts, i.e., facts that have been established by evidence and determined by the judge, to predict the judgment. However, legal facts are often difficult to obtain in the early stages of litigation, significantly limiting the practical applicability of fact-based LJP. To address this limitation, we propose a novel legal NLP task: legal fact prediction (LFP), which takes the evidence submitted by litigants for trial as input to predict legal facts, thereby empowering fact-based LJP technologies to make predictions in the absence of ground-truth legal facts. We also propose the first benchmark dataset, LFPBench, for evaluating the LFP task. Our extensive experiments on LFPBench demonstrate the effectiveness of LFP-empowered LJP and highlight promising research directions for LFP.
pdf
bib
abs
DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models
Xu Zhang
|
Xunjian Yin
|
Dinghao Jing
|
Huixuan Zhang
|
Xinyu Hu
|
Xiaojun Wan
While large language models (LLMs) demonstrate remarkable capabilities across a wide range of tasks, they remain vulnerable to generating outputs that are potentially harmful. Red teaming, which involves crafting adversarial inputs to expose vulnerabilities, is a widely adopted approach for evaluating the robustness of these models. Prior studies have indicated that LLMs are susceptible to vulnerabilities exposed through multi-turn interactions as opposed to single-turn scenarios. Nevertheless, existing methods for multi-turn attacks mainly utilize a predefined dialogue pattern, limiting their effectiveness in realistic situations. Effective attacks require adaptive dialogue strategies that respond dynamically to the initial user prompt and the evolving context of the conversation. To address these limitations, we propose DAMON, a novel multi-turn jailbreak attack method. DAMON leverages Monte Carlo Tree Search (MCTS) to systematically explore multi-turn conversational spaces, efficiently identifying sub-instruction sequences that induce harmful responses. We evaluate DAMON’s efficacy across five LLMs and three datasets. Our experimental results show that DAMON can effectively induce undesired behaviors.
pdf
bib
abs
Multilingual Prompting for Improving LLM Generation Diversity
Qihan Wang
|
Shidong Pan
|
Tal Linzen
|
Emily Black
Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and persona prompting. Further analyses show that the benefits of multilingual prompting vary between high and low resource languages and across model sizes, and that aligning the prompting language with cultural cues reduces hallucination about culturally-specific information.
pdf
bib
abs
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
Genglin Liu
|
Vivian T. Le
|
Salman Rahman
|
Elisa Kreiss
|
Marzyeh Ghassemi
|
Saadia Gabriel
We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents’ articulated reasoning for their social interactions truly aligns with their collective engagement patterns.
pdf
bib
abs
Identification of Multiple Logical Interpretations in Counter-Arguments
Wenzhi Wang
|
Paul Reisert
|
Shoichi Naito
|
Naoya Inoue
|
Machi Shimmei
|
Surawat Pothong
|
Jungmin Choi
|
Kentaro Inui
Counter-arguments (CAs) are a good means to improve the critical-thinking skills of learners, especially given that one has to thoroughly consider the logic of initial arguments (IA) when composing their CA. Although several tasks have been created for identifying the logical structure of CAs, no prior work has focused on capturing multiple interpretations of logical structures due to their complexity. In this work, we create CALSA+, a dataset consisting of 134 CAs annotated with 13 logical predicate questions. CALSA+ contains 1,742 instances annotated by 3 expert annotators (5,226 total annotations) with good agreement (Krippendorff 𝛼=0.46). Using CALSA+, we train a model with Reinforcement Learning with Verifiable Rewards (RLVR) to identify multiple logical interpretations and show that models trained with RLVR can perform on par with much bigger proprietary models. Our work is the first to attempt to annotate all the interpretations of logical structure on top of CAs. We publicly release our dataset to facilitate research in CA logical structure identification.
pdf
bib
abs
LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
Peng Wang
|
Biyu Zhou
|
Xuehai Tang
|
Jizhong Han
|
Songlin Hu
Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, **LyapLock** is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on https://github.com/caskcsg/LyapLock.
pdf
bib
abs
AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment
Mengyu Bu
|
Shaolei Zhang
|
Zhongjun He
|
Hua Wu
|
Yang Feng
Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities. However, their performance and cross-lingual alignment often lag for non-dominant languages. A common solution is to fine-tune LLMs on large-scale and more balanced multilingual corpus, but such approaches often lead to imprecise alignment and suboptimal knowledge transfer, struggling with limited improvements across languages. In this paper, we propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework for enhancing multilingual performance of pre-trained LLMs. In the first stage, we align multilingual representations with multilingual semantic alignment and language feature integration. In the second stage, we stimulate the multilingual capability of LLMs via multilingual instruction fine-tuning. Experimental results on several pre-trained LLMs demonstrate that our approach enhances LLMs’ multilingual general and cross-lingual generation capability. Further analysis indicates that AlignX brings the multilingual representations closer and improves the cross-lingual alignment.
pdf
bib
abs
What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning
Gangwei Jiang
|
Yahui Liu
|
Zhaoyi Li
|
Wei Bi
|
Fuzheng Zhang
|
Linqi Song
|
Ying Wei
|
Defu Lian
Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
pdf
bib
abs
HD-PiSSA: High-Rank Distributed Orthogonal Adaptation
Yiding Wang
|
Fanxu Meng
|
Xuefeng Zhang
|
Fan Jiang
|
Pingzhi Tang
|
Muhan Zhang
Existing parameter-efficient fine-tuning (PEFT) methods for large language models (LLMs), such as LoRA and PiSSA, constrain model updates to low-rank subspaces, limiting their expressiveness and leading to suboptimal performance on complex tasks. To address this, we introduce **H**igh-rank **D**istributed **PiSSA (HD-PiSSA)**, a distributed PEFT approach that initializes **orthogonal adapters** across different devices and aggregates their delta updates collectively on (W) for fine-tuning. Unlike Data Parallel LoRA or PiSSA, which maintain identical adapters across all devices, HD-PiSSA assigns different principal components of the pre-trained weights to each GPU, significantly expanding the range of update directions. This results in over 16× higher effective updated ranks than data-parallel LoRA or PiSSA when fine-tuning on 8 GPUs with the same per-device adapter rank. Empirically, HD-PiSSA benefits from this extra optimization flexibility and outperforms both LoRA and PiSSA across a variety of challenging downstream tasks, including mathematics, code, and multi-task learning.
pdf
bib
abs
Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs
Runyu Peng
|
Yunhua Zhou
|
Kai Lv
|
Yang Gao
|
Qipeng Guo
|
Xipeng Qiu
The rapid advancement of Large Language Models (LLMs) has significantly enhanced performance across various natural language processing (NLP) tasks, yet the high computational costs and latency associated with deploying such models continue to pose critical bottlenecks, limiting their broader applicability. To mitigate these challenges, we propose a dynamic hybrid inference framework, Firewall Routing, which efficiently selects between a strong and a weak LLMs based on the complexity of the query. A lightweight routing model is trained to optimize resource allocation by learning from response quality and preventing long-tail queries, which are often too hard to solve by LLMs, from being routed to the stronger model. Moreover, our method incorporates multiple sampling to enhance query evaluation reliability while leveraging Hard Blocking and Soft Blocking to handle long-tail queries along with refining labels for model selection. Extensive experiments show our method outperforms existing routing strategies by up to 5.29% in APGR, demonstrating state-of-the-art performance across multiple benchmarks.
pdf
bib
abs
SPE Attention: Making Attention Equivariant to Semantic-Preserving Permutation for Code Processing
Chengyu Jiao
|
Shuhao Chen
|
Yu Zhang
Codes serve as the fundamental language for human to communicate with machines, and various Transformer-based models are trained to process codes in recent advancements. A unique symmetry of code is its semantic-preserving permutation, which allows certain lines to be rearranged without altering the overall meaning. To capture such symmetry, we propose a novel attention mechanism that incorporates semantic-preserving permutation equivariance, called the SPE attention. By leveraging the symmetry relationships within code, we introduce a directed layered graph to represent the code structure, and this graph is then summarized into a symmetry mask. The SPE attention integrates those symmetry masks, granting semantic-preserving permutations equivariance to the model. Experiments on various code related tasks, including code summarization and error detection, demonstrate the effectiveness of the proposed SPE attention.
pdf
bib
abs
Audio-centric Video Understanding Benchmark without Text Shortcut
Yudong Yang
|
Jimin Zhuang
|
Guangzhi Sun
|
Changli Tang
|
Yixuan Li
|
Peihan Li
|
Yifan Jiang
|
Wei Li
|
Zejun Ma
|
Chao Zhang
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (AVUT) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. AVUT introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. Moreover, this work points out the text shortcut problem that largely exists in other benchmarks where the correct answer can be found from question text alone without needing videos. AVUT addresses this problem by proposing a answer permutation-based filtering mechanism.A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos and data are available at https://github.com/lark-png/AVUT.
pdf
bib
abs
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text
Songshuo Lu
|
Hua Wang
|
Yutian Rong
|
Zhi Chen
|
Yaohua Tang
Current Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a hybrid offline–online paradigm that (i) pre‐computes chunk‐level key-value (KV) caches, (ii) stitches them together at inference time using independent–attention and reordered‐RoPE techniques, and (iii) preserves answer quality without changing the model architecture. Hence, online computation of KV caches is eliminated during inference. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.
pdf
bib
abs
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
|
Kangjia Zhao
|
Tiancheng Zhao
|
Ruochen Xu
|
Zilun Zhang
|
Mingwei Zhu
|
Jianwei Yin
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model’s ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial—where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series of MLLMs with large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) but also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o.
pdf
bib
abs
Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation
Enci Zhang
|
Xingang Yan
|
Wei Lin
|
Tianxiang. Zhang
|
Lu Qianchun
Despite impressive progress in areas like mathematical reasoning, large language models still face challenges in consistently solving complex problems. Drawing inspiration from key human learning strategies, we propose two novel strategies to enhance the capability of large language models to solve these complex problems. First, Adaptive Difficulty Curriculum Learning (ADCL) is a novel curriculum learning strategy that tackles the Difficulty Shift phenomenon (i.e., a model’s perception of problem difficulty dynamically changes during training) by periodically re-estimating difficulty within upcoming data batches to maintain alignment with the model’s evolving capabilities. Second, Expert-Guided Self-Reformulation (EGSR) is a novel reinforcement learning strategy that bridges the gap between imitation learning and pure exploration by guiding models to reformulate expert solutions within their own conceptual framework, rather than relying on direct imitation, fostering deeper understanding and knowledge assimilation. Extensive experiments on challenging mathematical reasoning benchmarks, using Qwen2.5-7B as the base model, demonstrate that these human-inspired strategies synergistically and significantly enhance performance. Notably, their combined application improves performance over the standard Zero-RL baseline by 10% on the AIME24 benchmark and 16.6% on AIME25.
pdf
bib
abs
VersaTune: An Efficient Data Composition Framework for Training Multi-Capability LLMs
Keer Lu
|
Keshi Zhao
|
Zhuoran Zhang
|
Zheng Liang
|
Bin Cui
|
Tengjiao Wang
|
Wentao Zhang
As demonstrated by the proprietary Large Language Models (LLMs) such as GPT and Claude series, LLMs have the potential to achieve remarkable proficiency across a wide range of domains, including law, medicine, finance, science, code, etc., all within a single model. These capabilities are further augmented during the Supervised Fine-Tuning (SFT) phase. Despite their potential, existing work mainly focuses on domain-specific enhancements during fine-tuning, the challenge of which lies in catastrophic forgetting of knowledge across other domains. In this study, we introduce **VersaTune**, a novel data composition framework designed for enhancing LLMs’ overall multi-domain capabilities during training. We begin with detecting the distribution of domain-specific knowledge within the base model, followed by the training data composition that aligns with the model’s existing knowledge distribution. During the subsequent training process, domain weights are dynamically adjusted based on their learnable potential and forgetting degree. Experimental results indicate that VersaTune is effective in multi-domain fostering, with an improvement of 29.77% in the overall multi-ability performances compared to uniform domain weights. Furthermore, we find that Qwen-2.5-32B + VersaTune even surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 0.86%, 4.76% and 4.60%. Additionally, in scenarios where flexible expansion of a specific domain is required, VersaTune reduces the performance degradation in other domains by 38.77%, while preserving the training efficacy of the target domain.
pdf
bib
abs
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models
Hengxing Cai
|
Jinhan Dong
|
Jingjun Tan
|
Jingcheng Deng
|
Sihang Li
|
Zhifeng Gao
|
Haidong Wang
|
Zicheng Su
|
Agachai Sumalee
|
Renxin Zhong
Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.
pdf
bib
abs
Multimodal Language Models See Better When They Look Shallower
Haoran Chen
|
Junyan Lin
|
Xinghao Chen
|
Yue Fan
|
Jianfeng Dong
|
Xin Jin
|
Hui Su
|
Jinlan Fu
|
Xiaoyu Shen
Multimodal large language models (MLLMs) typically extract visual features from the final layers of a pretrained Vision Transformer (ViT). This widespread deep-layer bias, however, is largely driven by empirical convention rather than principled analysis. While prior studies suggest that different ViT layers capture different types of information—shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, the impact of this variation on MLLM performance remains underexplored. We present the first comprehensive study of visual layer selection for MLLMs, analyzing representation similarity across ViT layers to establish shallow, middle, and deep layer groupings. Through extensive evaluation of MLLMs (1.4B–7B parameters) across 10 benchmarks encompassing 60+ tasks, we find that while deep layers excel in semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks including counting, positioning, and object localization. Building on these insights, we propose a lightweight feature fusion method that strategically incorporates shallower layers, achieving consistent improvements over both single-layer and specialized fusion baselines. Our work offers the first principled study of visual layer selection in MLLMs, showing that MLLMs can often see better when they look shallower.
pdf
bib
abs
LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization
Xujia Wang
|
Yunjia Qi
|
Bin Xu
Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA (**Lo**w-Resources **S**ubnet **I**ntegration **A**daptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about 27% compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training.
pdf
bib
abs
Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking
Tianle Gu
|
Zongqi Wang
|
Kexin Huang
|
Yuanqi Yao
|
Xiangliang Zhang
|
Yujiu Yang
|
Xiuying Chen
Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it struggles in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we developed a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99% while achieving performance on par with state-of-the-art methods: https://anonymous.4open.science/r/IE-Official.
pdf
bib
abs
Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases
Bufan Gao
|
Elisa Kreiss
As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs.Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that prompts that more clearly align with (gender bias) evaluation framing elicit distinct gender output distributions compared to less evaluation-framed prompts. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM testing mode performance, and what does this mean for the ecological validity of future benchmarks.
pdf
bib
abs
Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification
Jikai Wang
|
Zhenxu Tian
|
Juntao Li
|
Qingrong Xia
|
Xinyu Duan
|
Zhefeng Wang
|
Baoxing Huai
|
Min Zhang
Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23×.
pdf
bib
abs
ViLBench: A Suite for Vision-Language Process Reward Modeling
Haoqin Tu
|
Weitao Feng
|
Hardy Chen
|
Hui Liu
|
Xianfeng Tang
|
Cihang Xie
Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI’s GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, challenging current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models—by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1’s generations. We will release our code, model, and data at https://ucsc-vlaa.github.io/ViLBench.
pdf
bib
abs
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
Hwan Chang
|
Yumin Kim
|
Yonghyun Jun
|
Hwanhee Lee
As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to **user-defined security policies** within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for **contextual security** preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, **CoPriva**, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.
pdf
bib
abs
Route Sparse Autoencoder to Interpret Large Language Models
Wei Shi
|
Sihang Li
|
Tao Liang
|
Mingyang Wan
|
Guojun Ma
|
Xiang Wang
|
Xiangnan He
Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at
https://github.com/swei2001/RouteSAEs.
pdf
bib
abs
BTS: Harmonizing Specialized Experts into a Generalist LLM
Qizhen Zhang
|
Prajjwal Bhargava
|
Chloe Bi
|
Chris X. Cai
|
Jakob Nicolaus Foerster
|
Jeremy Fu
|
Punit Singh Koura
|
Ruan Silva
|
Sheng Shen
|
Emily Dinan
|
Suchin Gururangan
|
Mike Lewis
We present Branch-Train-Stitch (BTS), an efficient and flexible training algorithm for combining independently trained large language model (LLM) experts into a single, capable generalist model. Following Li et al., we start with a single seed language model which is branched into domain-specific (e.g., coding or math) experts with continual pretraining. BTS combines experts into a generalist model using lightweight stitch layers, which are inserted between frozen experts and the seed LLM, and trained on a small datamix of the expert domains. Stitch layers enable the seed LLM to integrate representations from any number of experts during the forward pass, allowing it to generalize to new domains, despite remaining frozen. Because BTS does not alter the constituent LLMs, BTS provides a modular and flexible approach: experts can be easily removed and new experts can be added with only a small amount of training. Compared to alternative model merging approaches, BTS yields the best generalist performance on a variety of downstream tasks, retaining the specialized capabilities of each of the experts.
pdf
bib
abs
CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models
Anant Khandelwal
|
Manish Gupta
|
Puneet Agrawal
Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA’s state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.
pdf
bib
abs
R-Bind: Unified Enhancement of Attribute and Relation Binding in Text-to-Image Diffusion Models
Huixuan Zhang
|
Xiaojun Wan
Text-to-image models frequently fail to achieve perfect alignment with textual prompts, particularly in maintaining proper semantic binding between semantic elements in the given prompt. Existing approaches typically require costly retraining or focus on only correctly generating the attributes of entities (entity-attribute binding), ignoring the cruciality of correctly generating the relations between entities (entity-relation-entity binding), resulting in unsatisfactory semantic binding performance. In this work, we propose a novel training-free method R-Bind that simultaneously improves both entity-attribute and entity-relation-entity binding. Our method introduces three inference-time optimization losses that adjust attention maps during generation. Comprehensive evaluations across multiple datasets demonstrate our approach’s effectiveness, validity, and flexibility in enhancing semantic binding without additional training.
pdf
bib
abs
Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
Zinan Tang
|
Xin Gao
|
Qizhi Pei
|
Zhuoshi Pan
|
Mengzhang Cai
|
Jiang Wu
|
Conghui He
|
Lijun Wu
Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce **Middo**, a self-evolving **M**odel-**i**nformed **d**ynamic **d**ata **o**ptimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - *loss patterns (complexity)*, *embedding cluster dynamics (diversity)*, and *self-alignment scores (quality)*; (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models.
pdf
bib
abs
Information Integration in Large Language Models is Gated by Linguistic Structural Markers
Wei Liu
|
Nai Ding
Language comprehension relies on integrating information across both local words and broader context. We propose a method to quantify the information integration window of large language models (LLMs) and examine how sentence and clause boundaries constrain this window. Specifically, LLMs are required to predict a target word based on either a local window (local prediction) or the full context (global prediction), and we use Jensen-Shannon (JS) divergence to measure the information loss from relying solely on the local window, termed the local-prediction deficit. Results show that integration windows of both humans and LLMs are strongly modulated by sentence boundaries, and predictions primarily rely on words within the same sentence or clause: The local-prediction deficit follows a power-law decay as the window length increases and drops sharply at the sentence boundary. This boundary effect is primarily attributed to linguistic structural markers, e.g., punctuation, rather than implicit syntactic or semantic cues. Together, these results indicate that LLMs rely on explicit structural cues to guide their information integration strategy.
pdf
bib
abs
Why and How LLMs Benefit from Knowledge Introspection in Commonsense Reasoning
Chengfeng Zhao
|
Shizhu He
|
Shanshan Jiang
|
Bin Dong
|
Jun Zhao
|
Kang Liu
Large Language Models (LLMs) can improve commonsense reasoning through generating intermediate knowledge. However, the effectiveness of this knowledge introspection is not always guaranteed. This paper first systematically investigates and reveals an **introspection paradox**: while simple introspection tends to benefit weaker models, it often degrades the performance of stronger ones, particularly on simpler tasks. Our deep analysis indicates that this paradox arises from a complex interplay among model capability, task difficulty and the quality of generated knowledge. Further interpretability analysis reveals the origins of low-quality knowledge generation. To better employ introspected knowledge in LLM, this paper proposes a training-free **Adaptive Introspection Strategy** that operates in two stages using only the model’s internal states: **Knowledge Detection**, which dynamically identifies and discards potentially low-quality knowledge, and **Knowledge Regeneration**, which employs attention smoothing to guide the model away from harmful failure modes during knowledge generation. Extensive experiments on five Llama models with different sizes and eight commonsense reasoning benchmarks demonstrate that our approach effectively mitigates the limitations of standard introspection and has consistent performance gains across almost all settings.
pdf
bib
abs
GraDaSE: Graph-Based Dataset Search with Examples
Jing He
|
Mingyang Lv
|
Qing Shi
|
Gong Cheng
Dataset search is a specialized information retrieval task. In the emerging scenario of Dataset Search with Examples (DSE), the user submits a query and a few target datasets that are known to be relevant as examples. The retrieved datasets are expected to be relevant to the query and also similar to the target datasets. Distinguished from existing text-based retrievers, we propose a graph-based approach GraDaSE. Besides the textual metadata of the datasets, we identify their provenance-based and topic-based relationships to construct a graph, and jointly encode their structural and textual information for ranking candidate datasets. GraDaSE outperforms a variety of strong baselines on two test collections, including DataFinder-E that we construct.
pdf
bib
abs
Confidence-guided Refinement Reasoning for Zero-shot Question Answering
Youwon Jang
|
Woo Suk Choi
|
Minjoon Jung
|
Minsu Lee
|
Byoung-Tak Zhang
We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.
pdf
bib
abs
DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction
Yiqi Li
|
Yusheng Liao
|
Zhe Chen
|
Yanfeng Wang
|
Yu Wang
When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.
pdf
bib
abs
CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor
Zhenhua Xu
|
Xixiang Zhao
|
Xubin Yue
|
Shengwei Tian
|
Changting Lin
|
Meng Han
The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability—being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns—such as counterfactual—rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios.
pdf
bib
abs
Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess
Yikuan Xia
|
Jiazun Chen
|
Sujian Li
|
Jun Gao
The wide use of abbreviated column names (derived from English words or Chinese Pinyin) in database tables poses significant challenges for table-centric tasks in natural language processing and database management. Such a column name expansion task, referred to as the NameGuess task, has previously been addressed by fine-tuning Large Language Models (LLMs) on synthetically generated rule-based data. However, the current approaches yield suboptimal performance due to two fundamental limitations: 1) the rule-generated abbreviation data fails to reflect real-world distribution, and 2) the failure of LLMs to follow the rule-sensitive patterns in NameGuess persistently. For the data realism issue, we propose a novel approach that integrates a subsequence abbreviation generator trained on human-annotated data and collects non-subsequence abbreviations to improve the training set. For the rule violation issue, we propose a decoding system constrained on an automaton that represents the rules of abbreviation expansion. We extended the original English NameGuess test set to include non-subsequence and PinYin scenarios. Experimental results show that properly tuned 7/8B moderate-size LLMs with a refined decoding system can surpass the few-shot performance of state-of-the-art LLMs, such as the GPT-4 series. The code and data are presented in the supplementary material.
pdf
bib
abs
EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint
Zhenhua Xu
|
Meng Han
|
Wenpeng Xing
The proliferation of large language models (LLMs) has intensified concerns over model theft and license violations, necessitating robust and stealthy ownership verification. Existing fingerprinting methods either require impractical white-box access or introduce detectable statistical anomalies. We propose EverTracer, a novel gray-box fingerprinting framework that ensures stealthy and robust model provenance tracing. EverTracer is the first to repurpose Membership Inference Attacks (MIAs) for defensive use, embedding ownership signals via memorization instead of artificial trigger-output overfitting. It consists of Fingerprint Injection, which fine-tunes the model on any natural language data without detectable artifacts, and Verification, which leverages calibrated probability variation signal to distinguish fingerprinted models. This approach remains robust against adaptive adversaries, including input level modification, and model-level modifications. Extensive experiments across architectures demonstrate EverTracer’s state-of-the-art effectiveness, stealthness, and resilience, establishing it as a practical solution for securing LLM intellectual property.
pdf
bib
abs
Selective Preference Optimization via Token-Level Reward Function Estimation
Kailai Yang
|
Zhiwei Liu
|
Qianqian Xie
|
Jimin Huang
|
Erxue Min
|
Sophia Ananiadou
Recent advancements in LLM alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection without requiring strong, fine-grained supervision signals. We theoretically prove the feasibility of Direct Preference Optimization (DPO) as token-level reward function estimators, which applies to any existing alignment datasets and enables cost-efficient token selection with small-scale model sizes and training data. We then train an oracle model with DPO on the target data and utilize the estimated reward function to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing on 30% key tokens with up to 60% reduction in GPU training hours. We also explore SePO as a new paradigm for weak-to-strong generalization, showing that weak oracle models effectively supervise strong policy models with up to 16.8 more parameters. SePO also selects useful supervision signals from out-of-distribution data, alleviating the over-optimization problem.
pdf
bib
abs
Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
Seonil Son
|
Ju-Min Oh
|
Heegon Jin
|
Cheolhun Jang
|
Jeongbeom Jeong
|
KunTae Kim
As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines.This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems.We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison.The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings.We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges.We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on https://huggingface.co/spaces/NCSOFT/ArenaLite
pdf
bib
abs
Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Ruiyi Yan
|
Yugo Murawaki
Large language models have significantly enhanced the capacities and efficiency of text generation. On the one hand, they have improved the quality of text-based *steganography*. On the other hand, they have also underscored the importance of *watermarking* as a safeguard against malicious misuse. In this study, we focus on tokenization inconsistency (TI) between Alice and Bob in steganography and watermarking, where TI can undermine robustness. Our investigation reveals that the problematic tokens responsible for TI exhibit two key characteristics: **infrequency** and **temporariness**. Based on these findings, we propose two tailored solutions for TI elimination: *a stepwise verification* method for steganography and *a post-hoc rollback* method for watermarking. Experiments show that (1) compared to traditional disambiguation methods in steganography, directly addressing TI leads to improvements in fluency, imperceptibility, and anti-steganalysis capacity; (2) for watermarking, addressing TI enhances detectability and robustness against attacks.
pdf
bib
abs
ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
Minghua He
|
Yue Chen
|
Fangkai Yang
|
Pu Zhao
|
Wenjie Yin
|
Yu Kang
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn the contextual semantics of code during pre-training, neglecting executability information closely related to the execution state of the code, which results in unguaranteed code executability and unreliable automated code translation. To address this issue, we propose ExeCoder, an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and variable dependencies to enhance the capabilities of LLMs in code translation. To evaluate the effectiveness of ExeCoder, we manually enhanced the widely used benchmark TransCoder-test, resulting in a benchmark called TransCoder-test-X that serves LLMs. Evaluation of TransCoder-test-X indicates that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics, and even outperforms the renowned closed-source LLM GPT-4o. Code is available at https://aka.ms/execoder
pdf
bib
abs
TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
Junnan Zhu
|
Jingyi Wang
|
Bohan Yu
|
Xiaoyu Wu
|
Junbo Li
|
Lei Wang
|
Nan Xu
LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements.
pdf
bib
abs
NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines
Jinyang Zhang
|
Kexin Yang
|
Yu Wan
|
Muyang Ye
|
Baosong Yang
|
Fei Huang
|
Junyang Lin
|
Dayiheng Liu
The multilingual capabilities of large language models (LLMs) have attracted considerable attention over the past decade. Assessing the accuracy with which LLMs provide answers in multilingual contexts is essential for determining their level of multilingual proficiency. Nevertheless, existing multilingual benchmarks generally reveal severe drawbacks, such as overly translated content (translationese), the absence of difficulty control, constrained diversity, and disciplinary imbalance, making the benchmarking process unreliable and showing low convincingness. To alleviate those shortcomings, we introduce NOVA-63 (Native Omni-lingual Versatile Assessments of 63 Disciplines), a comprehensive, difficult multilingual benchmark featuring 93,536 questions sourced from native speakers across 14 languages and 63 academic disciplines. Leveraging a robust pipeline that integrates LLM-assisted formatting, expert quality verification, and multi-level difficulty screening, NOVA-63 is balanced on disciplines with consistent difficulty standards while maintaining authentic linguistic elements. Extensive experimentation with current LLMs has shown significant insights into cross-lingual consistency among language families, and exposed notable disparities in models’ capabilities across various disciplines. This work provides valuable benchmarking data for the future development of multilingual models. Furthermore, our findings underscore the importance of moving beyond overall scores and instead conducting fine-grained analyses of model performance.
pdf
bib
abs
InfoGain-RAG: Boosting Retrieval-Augmented Generation through Document Information Gain-based Reranking and Filtering
Zihan Wang
|
Zihan Liang
|
Zhou Shao
|
Yufei Ma
|
Huangyu Dai
|
Ben Chen
|
Lingtao Mao
|
Chenyi Lei
|
Yuqing Ding
|
Han Li
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reliable reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document’s value by computing the difference of LLM’s generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrieval paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary models GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications.
pdf
bib
abs
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji
|
Jun Zhang
|
Heming Xia
|
Jinpeng Chen
|
Lidan Shou
|
Gang Chen
|
Huan Li
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning.Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner.Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
pdf
bib
abs
What Do Indonesians Really Need from Language Technology? A Nationwide Survey
Muhammad Dehan Al Kautsar
|
Lucky Susanto
|
Derry Tanti Wijaya
|
Fajri Koto
Despite emerging efforts to develop NLP for Indonesia’s 700+ local languages, progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native Indonesian speakers. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.
pdf
bib
abs
LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts
Yimu Wang
|
Mozhgan Nasr Azadani
|
Sean Sedwards
|
Krzysztof Czarnecki
Redundancy of visual tokens in multi-modal large language models (MLLMs) significantly reduces their computational efficiency. Recent approaches, such as resamplers and summarizers, have sought to reduce the number of visual tokens, but at the cost of visual reasoning ability. To address this, we propose LEO-Mini, a novel MLLM that significantly reduces the number of visual tokens and simultaneously boosts visual reasoning capabilities. For efficiency, LEO-Mini incorporates CoTR, a novel token reduction module to consolidate a large number of visual tokens into a smaller set of tokens, using the similarity between visual tokens, text tokens, and a compact learnable query. For effectiveness, to scale up the model’s ability with minimal computational overhead, LEO-Mini employs MMoE, a novel mixture of multi-modal experts module. MMoE employs a set of LoRA experts with a novel router to switch between them based on the input text and visual tokens instead of only using the input hidden state. MMoE also includes a general LoRA expert that is always activated to learn general knowledge for LLM reasoning. For extracting richer visual features, MMoE employs a set of vision experts trained on diverse domain-specific data. To demonstrate LEO-Mini’s improved efficiency and performance, we evaluate it against existing efficient MLLMs on various benchmark vision-language tasks.
pdf
bib
abs
Confounding Factors in Relating Model Performance to Morphology
Wessel Poelman
|
Thomas Bauwens
|
Miryam de Lhoneux
The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.
pdf
bib
abs
Context-Aware Membership Inference Attacks against Pre-trained Large Language Models
Hongyan Chang
|
Ali Shahin Shamsabadi
|
Kleomenis Katevas
|
Hamed Haddadi
|
Reza Shokri
Membership Inference Attacks (MIAs) on pre-trained Large Language Models (LLMs) aim at determining if a data point was part of the model’s training set. Prior MIAs that are built for classification models fail at LLMs, due to ignoring the generative nature of LLMs across token sequences. In this paper, we present a novel attack on pre-trained LLMs that adapts MIA statistical tests to the perplexity dynamics of subsequences within a data point. Our method significantly outperforms prior approaches, revealing context-dependent memorization patterns in pre-trained LLMs.
pdf
bib
abs
Formalizing Style in Personal Narratives
Gustave Cortal
|
Alain Finkel
Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.
pdf
bib
abs
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition
Yulin Chen
|
Haoran Li
|
Yuexin Li
|
Yue Liu
|
Yangqiu Song
|
Bryan Hooi
Large language models (LLMs) have shown remarkable performance across a range of NLP tasks. However, their strong instruction-following capabilities and inability to distinguish instructions from data content make them vulnerable to indirect prompt injection attacks. In such attacks, instructions with malicious purposes are injected into external data sources, such as web documents. When LLMs retrieve this injected data through tools, such as a search engine and execute the injected instructions, they provide misled responses. Recent attack methods have demonstrated potential, but their abrupt instruction injection often undermines their effectiveness. Motivated by the limitations of existing attack methods, we propose **TopicAttack**, which prompts the LLM to generate a fabricated conversational transition prompt that gradually shifts the topic toward the injected instruction, making the injection smoother and enhancing the plausibility and success of the attack. Through comprehensive experiments, TopicAttack achieves state-of-the-art performance, with an attack success rate (ASR) over 90% in most cases, even when various defense methods are applied. We further analyze its effectiveness by examining attention scores. We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
pdf
bib
abs
PSET: a Phonetics-Semantics Evaluation Testbed
Gianluca Sperduti
|
Dong Nguyen
We introduce the Phonetics-Semantics Evaluation Testbed (PSET), a new English-based testbed to evaluate phonetic embeddings. Our testbed is built on the assumption that phonetic embeddings should always prioritize phonetics over semantics, and it therefore leverages homophones and synonyms.We use PSET to test three phonetic embedding models: articulatory embeddings, Phoneme2Vec, and XPhoneBERT. The phonetic-based embeddings solve the task with varying degrees of success, with Phoneme2Vec performing the best.We also test five recent LLMs, GPT-4o, Gemini 2.5 Flash, Llama 3.1-8B, OLMo-7B and OLMo 2-7B. Gemini 2.5 Flash performs better than the other models. With this testbed, we hope to advance the development and evaluation of phonetic embedding models.
pdf
bib
abs
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen
|
Wen Lai
|
Shuo Wang
|
Ge Gao
|
Kangyang Luo
|
Alexander Fraser
|
Maosong Sun
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multi-way parallel data consistently outperform those trained on unaligned multilingual data.
pdf
bib
abs
GATEAU: Selecting Influential Samples for Long Context Alignment
Shuzheng Si
|
Haozhe Zhao
|
Gang Chen
|
Yunshui Li
|
Kangyang Luo
|
Chuancheng Lv
|
Kaikai An
|
Fanchao Qi
|
Baobao Chang
|
Maosong Sun
Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies have attempted to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model’s performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
pdf
bib
abs
Teach Small Models to Reason by Curriculum Distillation
Wangyi Jiang
|
Yaojie Lu
|
Hongyu Lin
|
Xianpei Han
|
Le Sun
Large Reasoning Models (LRMs) show strong System-2-style reasoning, but at the cost of significant computational overhead. In contrast, efficient System-1-style Large Language Models (LLMs) often struggle on complex tasks. We identify a critical asymmetry between these two paradigms: LRMs can implicitly self-distill their own reasoning, solving hard problems with near System-1-style efficiency while retaining superior performance. LLMs, however, lack such deep internal modes and collapse when forced to rely on their own reasoning rather than imitating external traces. This asymmetry explains why direct distillation from strong LRMs to weaker LLMs often fails: student models struggle to learn from LRMs’ overly complex explicit reasoning and gain little from their overly compact implicit solutions. To address this, we introduce a two-stage curriculum distillation framework, which first builds a robust internal problem-solving student model and then teaches the student model to externalize this latent knowledge as explicit reasoning. On challenging mathematical benchmarks, our method significantly outperforms single-stage baselines, creating compact models with strong reasoning ability.
pdf
bib
abs
Enhancing Reasoning Abilities of Small LLMs with Cognitive Alignment
Wenrui Cai
|
Chengyu Wang
|
Junbing Yan
|
Jun Huang
|
Xiangzhong Fang
The reasoning capabilities of large language reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) results from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoT qualities according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Based on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.
pdf
bib
abs
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
Wei Liu
|
Siya Qi
|
Xinyu Wang
|
Chen Qian
|
Yali Du
|
Yulan He
Recent advances, such as DeepSeek R1-Zero, highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model’s output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding, where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train.In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7%. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.
pdf
bib
abs
Genre Matters: How Text Types Interact with Decoding Strategies and Lexical Predictors in Shaping Reading Behavior
Lena Sophia Bolliger
|
Lena Ann Jäger
The type of a text profoundly shapes reading behavior, yet little is known about how different text types interact with word-level features and the properties of machine-generated texts and how these interactions influence how readers process language. In this study, we investigate how different text types affect eye movements during reading, how neural decoding strategies used to generate texts interact with text type, and how text types modulate the influence of word-level psycholinguistic features such as surprisal, word length, and lexical frequency. Leveraging EMTeC (Bolliger et al., 2025), the first eye-tracking corpus of LLM-generated texts across six text types and multiple decoding algorithms, we show that text type strongly modulates cognitive effort during reading, that psycholinguistic effects induced by word-level features vary systematically across genres, and that decoding strategies interact with text types to shape reading behavior. These findings offer insights into genre-specific cognitive processing and have implications for the human-centric design of AI-generated texts. Our code is publicly available at https://github.com/DiLi-Lab/Genre-Matters.
pdf
bib
abs
RTE-GMoE: A Model-agnostic Approach for Relation Triplet Extraction via Graph-based Mixture-of-Expert Mutual Learning
Aziguli Wulamu
|
Kaiyuan Gong
|
Lyu Zhengyu
|
Yu Han
|
Zhihong Zhu
|
Bowen Xing
Relation Triplet Extraction (RTE) is a fundamental while challenge task in knowledge acquisition, which identifies and extracts all triplets from unstructured text. Despite the recent advancements, the deep integration of the entity-, relation- and triplet-specific information remains a challenge. In this paper, we propose a Graph-based Mixture-of-Experts mutual learning framework for RTE, namely RTE-GMoE, to address this limitation. As a model-agnostic framework, RTE-GMoE distinguishes itself by including and modeling the mutual interactions among three vital task-specific experts: entity expert, RTE expert, and relation expert. RTE expert corresponds to the main RTE task and can be implemented by any model and the other two correspond to the two auxiliary tasks: entity recognition and relation extraction. We construct an expert graph and achieve comprehensive and adaptive graph-based MoE interactions with a novel mutual learning mechanism. In our framework, these experts perform knowledge extractions collaboratively via dynamic information exchange and knowledge sharing. We conduct extensive experiments on four state-of-the-art backbones and evaluate them on several widely-used benchmarks. The results demonstrate that our framework brings consistent and promising improvements on all backbones and benchmarks. Component study and model analysis further verify the effectiveness and advantages of our method.
pdf
bib
abs
Avoidance Decoding for Diverse Multi-Branch Story Generation
Kyeongman Park
|
Nakyeong Yang
|
Kyomin Jung
Large Language Models (LLMs) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel decoding strategy, ***Avoidance Decoding***, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to **2.6** times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model’s intrinsic creative capacity.
pdf
bib
abs
Probabilistic Soundness Guarantees in LLM Reasoning Chains
Weiqiu You
|
Anton Xue
|
Shreya Havaldar
|
Delip Rao
|
Helen Jin
|
Chris Callison-Burch
|
Eric Wong
In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
pdf
bib
abs
SQLWOZ: A Realistic Task-Oriented Dialogue Dataset with SQL-Based Dialogue State Representation for Complex User Requirements
Heng-Da Xu
|
Xian-Ling Mao
|
Fanshu Sun
|
Tian-Yi Che
|
Cheng-Xin Xin
|
Heyan Huang
High-quality datasets are essential for building effective task-oriented dialogue (TOD) systems. The existing TOD datasets often present overly simplified interactions, where users incrementally express straightforward requests that can be managed with basic slot-value style dialogue states, such as “hotel-area = east.” However, this approach does not reflect real-life scenarios in which users may express complex constraints and preferences. To address this gap, in this paper, we propose SQLWOZ, a novel TOD dataset designed to capture complex, real-world user requirements. The user requirements in SQLWOZ include the four categories: 1) multiple values for a slot, 2) excluded values within a slot, 3) preferred or prioritized values, and 4) conditional values based on other conditions. We utilize SQL statements as a formalized and expressive representation of dialogue states within SQLWOZ. To evaluate the dataset, we adapt large language models as dialogue agents and conduct extensive experiments on the SQL-based dialogue state tracking, dialogue response generation and end-to-end TOD tasks. The experimental results demonstrate the complexity and quality of SQLWOZ, establishing it as a new benchmark for advancing TOD research.
pdf
bib
abs
SURE: Safety Understanding and Reasoning Enhancement for Multimodal Large Language Models
Yuxin Gou
|
Xiaoning Dong
|
Qin Li
|
Shishen Gu
|
Richang Hong
|
Wenbo Hu
Multimodal large language models (MLLMs) demonstrate impressive capabilities by integrating visual and textual information. However, the incorporation of visual modalities also introduces new and complex safety risks, rendering even the most advanced models vulnerable to sophisticated jailbreak attacks. This paper first analyzes the impact of inserting safety reasoning prompt on various aspects of the model. We find that this external method can help the model resist jailbreak attacks to some extent, but the model still fails to distinguish specific semantic scenarios, resulting in a significantly increased refusal rate for benign queries. Inspired by this, we propose a novel training framework,
SURE (Safety Understanding and Reasoning Enhancement for Multimodal Large Language Models), designed to help models internalize chain-of-thought-based safety decision-making capabilities. Extensive experiments demonstrate that SURE significantly improves model safety while effectively avoiding over-defense, achieving a good balance between safety and generality. Finally, we create a large-scale multimodal safety reasoning dataset, MLLM-SCoT-Plus, to facilitate research on safety alignment in multimodal models.Our code and the dataset are publicly available at
https://github.com/hfutml/SURE.
pdf
bib
abs
EMO: Embedding Model Distillation via Intra-Model Relation and Optimal Transport Alignments
Minh-Phuc Truong
|
Hai An Vu
|
Tu Vu
|
Nguyen Thi Ngoc Diep
|
Linh Ngo Van
|
Thien Huu Nguyen
|
Trung Le
Knowledge distillation (KD) is crucial for compressing large text embedding models, but faces challenges when teacher and student models use different tokenizers (Cross-Tokenizer KD - CTKD). Vocabulary mismatches impede the transfer of relational knowledge encoded in deep representations, such as hidden states and attention matrices, which are vital for producing high-quality embeddings. Existing CTKD methods often focus on direct output alignment, neglecting this crucial structural information. We propose a novel framework tailored for CTKD embedding model distillation. We first map tokens one-to-one via Minimum Edit Distance (MinED). Then, we distill intra-model relational knowledge by aligning attention matrix patterns using Centered Kernel Alignment, focusing on the top-m most important tokens of the directly mapped tokens. Simultaneously, we align final hidden states via Optimal Transport with Importance-Scored Mass Assignment, which emphasizes semantically important token representations, based on importance scores derived from attention weights. We evaluate distillation from state-of-the-art embedding models (e.g., LLM2Vec, BGE) to a Bert-base-uncased model on embedding-reliant tasks such as text classification, sentence pair classification, and semantic textual similarity. Our proposed framework significantly outperforms existing CTKD baselines. By preserving attention structure and prioritizing key representations, our approach yields smaller, high-fidelity embedding models despite tokenizer differences.
pdf
bib
abs
AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment
Kun Li
|
Lai Man Po
|
Hongzheng Yang
|
Xuyuan Xu
|
Kangcheng Liu
|
Yuzhi Zhao
Multimodal Large Language Models (MLLMs) are increasingly applied in Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to expert evaluations. However, their predictions may reflect subtle biases influenced by demographic factors such as gender, age, and education. In this work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two complementary dimensions: (1) stereotype bias, quantified by measuring variations in aesthetic evaluations across demographic groups; and (2) alignment between model outputs and genuine human aesthetic preferences. Our benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and introduces structured metrics (IFD, NRD, AAS) to assess both bias and alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o, Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL). Results indicate that smaller models exhibit stronger stereotype biases, whereas larger models align more closely with human preferences. Incorporating identity information often exacerbates bias, particularly in emotional judgments. These findings underscore the importance of identity-aware evaluation frameworks in subjective vision-language tasks.
pdf
bib
abs
DA-Pred: Performance Prediction for Text Summarization under Domain-Shift and Instruct-Tuning
Anum Afzal
|
Florian Matthes
|
Alexander Fabbri
Large Language Models (LLMs) often don’t perform as expected under Domain Shift or after Instruct-tuning. A reliable indicator of LLM performance in these settings could assist in decision-making. We present a method that uses the known performance in high-resource domains and fine-tuning settings to predict performance in low-resource domains or base models, respectively. In our paper, we formulate the task of performance prediction, construct a dataset for it, and train regression models to predict the said change in performance. Our proposed methodology is lightweight and, in practice, can help researchers & practitioners decide if resources should be allocated for data labeling and LLM Instruct-tuning.
pdf
bib
abs
UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER
Jielong Tang
|
Yang Yang
|
Jianxing Yu
|
Zhen-Xing Wang
|
Haoyuan Liang
|
Liang Yao
|
Jian Yin
Grounded Multimodal Named Entity Recognition (GMNER) is a new information extraction task. It requires models to extract named entities and ground them to real-world visual objects. Previous methods, relying on domain-specific fine-tuning, struggle with unseen multimodal entities due to limited knowledge and generalization. Recently, multimodal large language models (MLLMs) have demonstrated strong open-set abilities. However, their performance is hindered by the lack of in-domain knowledge due to costly training for GMNER datasets. To address these limitations, we propose **UnCo**, a two-stage Uncertainty-driven Collaborative framework that leverages the complementary strengths of small fine-tuned models and MLLMs. Specifically, **in stage one**, we equip the small model with a unified uncertainty estimation (UE) for multimodal entities. This enables the small model to express "I do not know" when recognizing unseen entities beyond its capabilities. Predictions with high uncertainty are then filtered and delegated to the MLLM. **In stage two**, an Uncertainty-aware Hierarchical Correction mechanism guides the MLLM to refine uncertain predictions using its open-domain knowledge. Ultimately, UnCo effectively retains the in-domain knowledge of small models while utilizing the capabilities of MLLMs to handle unseen samples. Extensive experiments demonstrate UnCo’s effectiveness on two GMNER benchmarks.
pdf
bib
abs
An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint
Yi Sun
|
Han Wang
|
Jiaqiang Li
|
Jiacheng Liu
|
Xiangyu Li
|
Hao Wen
|
Yizhen Yuan
|
Huiwen Zheng
|
Yan Liang
|
Yuanchun Li
|
Yunxin Liu
Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation.However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints.We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets.The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.
pdf
bib
abs
Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching
Songze Li
|
Zhiqiang Liu
|
Zhengke Gui
|
Huajun Chen
|
Wen Zhang
Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs’ prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance.
pdf
bib
abs
Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making
Yuanjun Feng
|
Vivek Choudhary
|
Yash Raj Shrestha
Large language models (LLMs) are increasingly used for social-science simulations, yet most evaluations target task optimality rather than the variability and adaptation characteristic of human decision-making. We propose a process-oriented evaluation framework with progressive interventions (Intrinsicality, Instruction, and Imitation), and apply it to two classic economics tasks: the second-price auction and the newsvendor inventory problem.By default, LLMs adopt stable, conservative strategies that diverge from observed human behavior. Giving LLMs risk-framed instructions makes them behave more like humans. However, this also causes complex irregularities. Incorporating human decision trajectories via in-context learning further narrows distributional gaps, indicating that models can absorb human patterns. However, across all interventions, LLMs underexpress round-to-round variability relative to humans, revealing a persistent alignment gap in behavioral fidelity. Future evaluations of LLM-based social simulations should prioritize process-level realism.
pdf
bib
abs
Structuring Radiology Reports: Challenging LLMs with Lightweight Models
Johannes Moll
|
Louisa Fay
|
Asfandyar Azhar
|
Sophie Ostmeier
|
Sergios Gatidis
|
Tim C. Lueth
|
Curtis Langlotz
|
Jean-Benoit Delbrouck
Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)—specifically T5 and BERT2BERT—for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B–70B parameters), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.
pdf
bib
abs
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Yunuo Liu
|
Dawei Zhu
|
Zena Al-Khalili
|
Dai Cheng
|
Yanjun Chen
|
Dietrich Klakow
|
Wei Zhang
|
Xiaoyu Shen
We present PricingLogic, the first benchmarkthat probes whether Large Language Mod-els (LLMs) can reliably automate tourism-booking prices when multiple, overlapping farerules apply. Travel agencies are eager to of-fload this error-prone task to AI systems; how-ever, deploying LLMs without verified reliabil-ity could result in significant financial lossesand erode customer trust. PricingLogic com-prises 300 natural-language questions based onbooking requests derived from 42 real-worldpricing policies, spanning two levels of diffi-culty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interactingdiscounts. Evaluations of a line of LLMs re-veal a steep performance drop on the harder tier,exposing systematic failures in rule interpreta-tion and arithmetic reasoning. These resultshighlight that, despite their general capabilities,today’s LLMs remain unreliable for revenue-critical applications without further safeguardsor domain adaptation. Our code and dataset areavaliable in https://github.com/EIT-NLP/PricingLogic.
pdf
bib
abs
EcoTune: Token-Efficient Multi-Fidelity Hyperparameter Optimization for Large Language Model Inference
Yuebin Xu
|
Zhiyi Chen
|
Zeyi Wen
Tuning inference hyperparameters, such as temperature and maximum output tokens, on downstream tasks can enhance inference performance. However, directly applying hyperparameter optimization to these hyperparameters is token-expensive. Multi-fidelity optimization improves HPO efficiency with low-fidelity evaluations, but its static scheduling strategies ignore token consumption, leading to high costs. To address these limitations, we propose a token-efficient multi-fidelity optimization method, which enhances inference performance and minimizes token usage. Our method is empowered by (i) a token-based fidelity definition with explicit token cost modeling on configurations; (ii) a novel Token-Aware Expected Improvement acquisition function that selects configurations based on performance gain per token; and (iii) a dynamic fidelity scheduling mechanism that adapts to real-time budget status. We evaluate our method on LLaMA-2 and LLaMA-3 series across MMLU, Humaneval, MedQA, and OpenBookQA. Our method improves over the HELM leaderboard by 7.1%, 24.3%, 21.9%, and 4.6%, respectively. Compared to existing multi-fidelity HPO baselines, our method reduces token consumption by over 80% while maintaining or surpassing performance, demonstrating the state-of-the-art token efficiency for inference-time optimization.
pdf
bib
abs
Investigating Value-Reasoning Reliability in Small Large Language Models
Xia Du
|
Shuhan Sun
|
Pengyuan Liu
|
Dong Yu
Although small Large Language models (sLLMs) have been widely deployed in practical applications, little attention has been paid to their value-reasoning abilities, particularly in terms of reasoning reliability. To address this gap, we propose a systematic evaluation framework for assessing the Value-Reasoning Reliability of sLLMs. We define Value-Reasoning Reliability as comprising: (1) Output consistency under identical prompts, (2) Output Robustness under semantically equivalent prompts, (3) Maintaining stable value reasoning in the face of attacks, and (4) Consistency of value reasoning in open-ended value expression tasks. Our framework includes three core tasks: Repetition Consistency task, Interaction Stability task, and Open-ended Expression Consistency task. We further incorporate self-reported confidence scores to evaluate the model’s value reasoning reliability from two perspectives: the model’s self-awareness of its values, and its value-based decision-making. Our findings show that models vary significantly in their stability when responding to value-related questions. Moreover, we observe considerable output randomness, which is not always correlated with the self-reported confidence or expressed value preferences. This suggests that current models lack a reliable internal mechanism for stable value reasoning when addressing value-sensitive queries.
pdf
bib
abs
Can LLMs Explain Themselves Counterfactually?
Zahra Dehghanighobadi
|
Asja Fischer
|
Muhammad Bilal Zafar
Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance.Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems.Owing to the remarkable reasoning abilities of LLMs, *self-explanation*, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm.We study a specific type of self-explanations, *self-generated counterfactual explanations* (SCEs).We test LLMs’ ability to generate SCEs across families, sizes, temperatures, and datasets. We find that LLMs sometimes struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.
pdf
bib
abs
Self-Adjust Softmax
Chuanyang Zheng
|
Yihang Gao
|
Guoxuan Chen
|
Han Shi
|
Jing Xiong
|
Xiaozhe Ren
|
Chao Huang
|
Zhenguo Li
|
Yu Li
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one. **Usually, tokens with larger attention scores are important for the final prediction.However, the softmax function can face a gradient vanishing issue for such important tokens (e.g., probabilities close to one), leading to optimization difficulties for the important tokens so that the performance may not be better.**In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying softmax(z) to z ⋅ softmax(z) and its normalized variant (z - min(z\min,0))⁄max(0,zmax)-min(zmin,0) ⋅ softmax(z).We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function.Moreover, Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments.We conducted experiments to evaluate the empirical performance of Transformer models using compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.
pdf
bib
abs
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Shaoqing Lin
|
Chong Teng
|
Fei Li
|
Donghong Ji
|
Lizhen Qu
|
Zhuang Li
Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers built for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. We introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each caption averages 9 sentences, and each graph contains at least 3× more triples than those in existing datasets. Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE than the strongest sentence-merging baseline. However, its high inference cost and licensing restrict open-source use, and smaller fine-tuned open-source models (e.g., Flan-T5) perform poorly on dense graph generation. To bridge this gap, we propose DiscoSG-Refiner, which drafts a base graph using a seed parser and iteratively refines it with a second model, improving robustness for complex graph generation. Using two small fine-tuned Flan-T5-Base models, DiscoSG-Refiner improves SPICE by ~30% over the baseline while achieving 86× faster inference than GPT-4o. It also delivers consistent gains on downstream VLM tasks, including discourse-level caption evaluation and hallucination detection, outperforming alternative parsers. Code and data are available at https://github.com/ShaoqLin/DiscoSG .
pdf
bib
abs
XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML
Ernesto Luis Estevanell Valladares
|
Suilan Estevez-Velarde
|
Yoan Gutierrez
|
Andrés Montoyo
|
Ruslan Mitkov
Experts in machine learning leverage domain knowledge to navigate decisions in model selection, hyperparameter optimization, and resource allocation. This is particularly critical for fine-tuning language models (LMs), where repeated trials incur substantial computational overhead and environmental impact. However, no existing automated framework simultaneously tackles the entire model selection and hyperparameter optimization (HPO) task for resource-efficient LM fine-tuning. We introduce XAutoLM, a meta-learning-augmented AutoML framework that reuses past experiences to optimize discriminative and generative LM fine-tuning pipelines efficiently. XAutoLM learns from stored successes and failures by extracting task- and system-level meta-features to bias its sampling toward valuable configurations and away from costly dead ends. On four text classification and two question-answering benchmarks, XAutoLM surpasses zero-shot optimizer’s peak F1 on five of six tasks, cuts mean evaluation time of pipelines by up to 4.5x, reduces search error ratios by up to sevenfold, and uncovers up to 50% more pipelines above the zero-shot Pareto front. In contrast, simpler memory-based baselines suffer negative transfer. We release XAutoLM and our experience store to catalyze resource-efficient, Green AI fine-tuning in the NLP community.
pdf
bib
abs
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
Roman Vashurin
|
Maiya Goloburda
|
Preslav Nakov
|
Maxim Panov
Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models. We release our code publicly at https://github.com/stat-ml/uncertainty-line
pdf
bib
abs
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Zhepei Wei
|
Wenlin Yao
|
Yao Liu
|
Weizhi Zhang
|
Qin Lu
|
Liang Qiu
|
Changlong Yu
|
Puyang Xu
|
Chao Zhang
|
Bing Yin
|
Hyokun Yun
|
Lihong Li
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
pdf
bib
abs
Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models
Tobias Domhan
|
Dawei Zhu
Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.
pdf
bib
abs
PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements
Raptopoulos Petros
|
Giorgos Filandrianos
|
Maria Lymperaiou
|
Giorgos Stamou
Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward—ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
pdf
bib
abs
PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization
Xu Sun
|
Lionel Delphin-Poulat
|
Christèle Tarnec
|
Anastasia Shimorina
Large language models (LLMs) are increasingly used for zero-shot conversation summarization, but often exhibit positional bias—tending to overemphasize content from the beginning or end of a conversation while neglecting the middle. To address this issue, we introduce PoSum-Bench, a comprehensive benchmark for evaluating positional bias in conversational summarization, featuring diverse English and French conversational datasets spanning formal meetings, casual conversations, and customer service interactions. We propose a novel semantic similarity-based sentence-level metric to quantify the direction and magnitude of positional bias in model-generated summaries, enabling systematic and reference-free evaluation across conversation positions, languages, and conversational contexts.Our benchmark and methodology thus provide the first systematic, cross-lingual framework for reference-free evaluation of positional bias in conversational summarization, laying the groundwork for developing more balanced and unbiased summarization models.
pdf
bib
abs
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
Ziqing Qiao
|
Yongheng Deng
|
Jiali Zeng
|
Dong Wang
|
Lai Wei
|
Guanbo Wang
|
Fandong Meng
|
Jie Zhou
|
Ju Ren
|
Yaoxue Zhang
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs, increasing computational overhead. Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to remove redundant content thoroughly. To address these limitations, this work begins by framing two key patterns of redundant reflection in LRMs—Confidence Deficit, wherein the model reflects on correct intermediate steps, and Termination Delay, where reflection continues after a verified, confident answer—through a confidence-guided perspective. Based on this, we introduce ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that compared to baseline methods, fine-tuning LRMs on ConCISE-generated data yields a better balance between compression and task performance, reducing length by up to ~50% under SimPO, while maintaining high task accuracy.
pdf
bib
abs
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Hao Li
|
Lijun Li
|
Zhenghao Lu
|
Xianyi Wei
|
Rui Li
|
Jing Shao
|
Lei Sha
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated.
pdf
bib
abs
Cross-domain Rumor Detection via Test-Time Adaptation and Large Language Models
Yuxia Gong
|
Shuguo Hu
|
Huaiwen Zhang
Rumor detection on social media has become crucial due to the rapid spread of misinformation. Existing approaches primarily focus on within-domain tasks, resulting in suboptimal performance in cross-domain scenarios due to domain shift. To address this limitation, we draw inspiration from the strong generalization capabilities of Test-Time Adaptation (TTA) and propose a novel framework to enhance rumor detection performance across different domains. Specifically, we introduce Test-Time Adaptation for Rumor Detection (T2ARD), which incorporates both single-domain model and target graph adaptation strategies tailored to the unique requirements of cross-domain rumor detection. T2ARD utilizes a graph adaptation module that updates the graph structure and node attributes through multi-level self-supervised contrastive learning, aiming to derive invariant graph representations. To mitigate the impact of significant distribution shifts on self-supervised signals, T2ARD performs model adaptation by using annotations from Large Language Models (LLMs) on target graph to produce pseudo-labels as supervised signals. Experiments conducted on four widely used cross-domain datasets demonstrate that T2ARD achieves state-of-the-art performance, surpassing existing methods in rumor detection.
pdf
bib
abs
MLWQ: Efficient Small Language Model Deployment via Multi-Level Weight Quantization
Chun Hu
|
Junhui He
|
Shangyu Wu
|
Yuxin He
|
Chun Jason Xue
|
Qingan Li
Small language models (SLMs) are gaining attention for their lower computational and memory needs while maintaining strong performance. However, efficiently deploying SLMs on resource-constrained devices remains a significant challenge. Post-training quantization(PTQ) is a widely used compression technique that reduces memory usage and inference computation, yet existing methods face challenges in inefficient bit-width allocation and insufficient fine-grained quantization adjustments, leading to suboptimal performance, particularly at lower bit-widths. To address these challenges, we propose multi-level weight quantization (MLWQ), which facilitates the efficient deployment of SLMs. Our method enables more effective bit-width allocation by jointly considering inter-layer loss and intra-layer salience. Furthermore, we propose a fine-grained partitioning of intra-layer salience to support the tweaking of quantization parameters within each group. Experimental results indicate that MLWQ achieves competitive performance compared to state-of-the-art methods, providing an effective approach for the efficient deployment of SLMs while maintaining model accuracy.
pdf
bib
abs
ToDi: Token-wise Distillation via Fine-Grained Divergence Control
Seongryong Jung
|
Suwan Yoon
|
DongGeon Kim
|
Hwanhee Lee
Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi’s effectiveness and practicality.
pdf
bib
abs
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
Qingyao Li
|
Wei Xia
|
Xinyi Dai
|
Kounianhua Du
|
Weiwen Liu
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
|
Weinan Zhang
Tree search methods have demonstrated impressive performance in code generation. Previous methods combine tree search with reflection that summarizes past mistakes to achieve iterative improvement. However, these methods face significant challenges. First, they search directly within the code language space, neglecting the underlying reasoning process critical for effective code generation. Second, reflection-based approaches merely accumulate historical errors in memory without providing correct reasoning pathways, making it difficult for subsequent search iterations to identify optimal solutions, resulting in decreased search quality. In this work, we propose RethinkMCTS, a framework that systematically explores and refines the reasoning process for code generation. Specifically, we employ MCTS to search for thoughts before code generation and integrate MCTS with a refinement mechanism called rethink, which incorporates fine-grained code execution feedback to refine erroneous thoughts during the search. It ensures the search path aligns with better reasoning, improving overall search quality. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-based and feedback-enhanced code generation baselines.
pdf
bib
abs
Probing for Arithmetic Errors in Language Models
Yucheng Sun
|
Alessandro Stolfo
|
Mrinmaya Sachan
We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct answer from hidden states, regardless of whether the model’s output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.
pdf
bib
abs
NILE: Internal Consistency Alignment in Large Language Models
Minda Hu
|
Qiyuan Zhang
|
Yufei Wang
|
Bowei He
|
Hongru Wang
|
Jingyan Zhou
|
Liangyou Li
|
Yasheng Wang
|
Chen Ma
|
Irwin King
Recent advances show that the world knowledge in the Instruction Fine-Tuning (IFT) dataset, which is incompatible with LLMs’ internal knowledge, can greatly hurt the IFT performance. However, the effective integration and balancing of the internal knowledge of LLMs, acquired during pre-training, with existing IFT datasets remains a largely underexplored area of research. To address this gap, this work introduces NILE, a novel framework to optimize the effectiveness of IFT by adjusting IFT datasets through carefully aligning the world and internal knowledge. NILE employs a three-stage pipeline to effectively quantify and adjust consistency with the internal knowledge of target LLMs. Our analysis provides compelling evidence that balancing such consistency with pre-trained internal knowledge is pivotal for unleashing LLM potential, and confirms that NILE can systematically contribute to these substantial performance improvements. Experimental results demonstrate that NILE-aligned IFT datasets sharply boost LLM performance across multiple LLM ability evaluation datasets, achieving up to 66.6% gain on Arena-Hard and 68.5% on Alpaca-Eval V2.
pdf
bib
abs
Mining the Past with Dual Criteria: Integrating Three types of Historical Information for Context-aware Event Forecasting
Rong Ma
|
Lei Wang
|
Yating Yang
|
Bo Ma
|
Rui Dong
|
Fengyi Yang
|
Ahtamjan Ahmat
|
Kaiwen Lu
|
Xinyue Wang
Event forecasting requires modeling historical event data to predict future events, and achieving accurate predictions depends on effectively capturing the relevant historical information that aids forecasting. Most existing methods focus on entities and structural dependencies to capture historical clues but often overlook implicitly relevant information. This limitation arises from overlooking event semantics and deeper factual associations that are not explicitly connected in the graph structure but are nonetheless critical for accurate forecasting. To address this, we propose a dual-criteria constraint strategy that leverages event semantics for relevance modeling and incorporates a self-supervised semantic filter based on factual event associations to capture implicitly relevant historical information. Building on this strategy, our method, termed ITHI (Integrating Three types of Historical Information), combines sequential event information, periodically repeated event information, and relevant historical information to achieve context-aware event forecasting. We evaluated the proposed ITHI method on three public benchmark datasets, achieving state-of-the-art performance and significantly outperforming existing approaches. Additionally, we validated its effectiveness on two structured temporal knowledge graph forecasting dataset.
pdf
bib
abs
RAGferee: Building Contextual Reward Models for Retrieval-Augmented Generation
Andrei Catalin Coman
|
Ionut Teodor Sorodoc
|
Leonardo F. R. Ribeiro
|
Bill Byrne
|
James Henderson
|
Adrià de Gispert
Existing Reward Models (RMs), typically trained on general preference data, struggle in Retrieval Augmented Generation (RAG) settings, which require judging responses for faithfulness to retrieved context, relevance to the user query, appropriate refusals when context is insufficient, completeness and conciseness of information. To address the lack of publicly available RAG-centric preference datasets and specialised RMs, we introduce RAGferee, a methodology that repurposes question-answering (QA) datasets into preference pairs that prioritise groundedness over stylistic features, enabling the training of contextual RMs better suited to judging RAG responses. Using RAGferee, we curate a small preference dataset of 4K samples and fine-tune RMs ranging from 7B to 24B parameters. Our RAG-centric RMs achieve state-of-the-art performance on ContextualJudgeBench, surpassing existing 70B+ RMs trained on much larger (up to 2.4M samples) general corpora, with an absolute improvement of +15.5%.
pdf
bib
abs
Large Language Models Discriminate Against Speakers of German Dialects
Minh Duc Bui
|
Carolin Holtermann
|
Valentin Hofmann
|
Anne Lauscher
|
Katharina von der Wense
Dialects represent a significant component of human culture and are found across all regions of the world. In Germany, more than 40% of the population speaks a regional dialect (Adler and Hansen, 2022). However, despite cultural importance, individuals speaking dialects often face negative societal stereotypes. We examine whether such stereotypes are mirrored by large language models (LLMs). We draw on the sociolinguistic literature on dialect perception to analyze traits commonly associated with dialect speakers. Based on these traits, we assess the dialect naming bias and dialect usage bias expressed by LLMs in two tasks: association task and decision task. To assess a model’s dialect usage bias, we construct a novel evaluation corpus that pairs sentences from seven regional German dialects (e.g., Alemannic and Bavarian) with their standard German counterparts. We find that: (1) in the association task, all evaluated LLMs exhibit significant dialect naming and dialect usage bias against German dialect speakers, reflected in negative adjective associations; (2) all models reproduce these dialect naming and dialect usage biases in their decision making; and (3) contrary to prior work showing minimal bias with explicit demographic mentions, we find that explicitly labeling linguistic demographics—German dialect speakers—amplifies bias more than implicit cues like dialect usage.
pdf
bib
abs
Uncovering Argumentative Flow: A Question-Focus Discourse Structuring Framework
Yini Wang
|
Xian Zhou
|
Shengan Zheng
|
Linpeng Huang
|
Zhunchen Luo
|
Wei Luo
|
Xiaoying Bai
Understanding the underlying argumentative flow in analytic argumentative writing is essential for discourse comprehension, especially in complex argumentative discourse such as think-tank commentary. However, existing structure modeling approaches often rely on surface-level topic segmentation, failing to capture the author’s rhetorical intent and reasoning process. To address this limitation, we propose a Question-Focus discourse structuring framework that explicitly models the underlying argumentative flow by anchoring each argumentative unit to a guiding question (reflecting the author’s intent) and a set of attentional foci (highlighting analytical pathways). To assess its effectiveness, we introduce an argument reconstruction task in which the modeled discourse structure guides both evidence retrieval and argument generation. We construct a high-quality dataset comprising 600 authoritative Chinese think-tank articles for experimental analysis. To quantitatively evaluate performance, we propose two novel metrics: (1) Claim Coverage, measuring the proportion of original claims preserved or similarly expressed in reconstructions, and (2) Evidence Coverage, assessing the completeness of retrieved supporting evidences. Experimental results show that our framework uncovers the author’s argumentative logic more effectively and offers better structural guidance for reconstruction, yielding up to a 10% gain in claim coverage and outperforming strong baselines across both curated and LLM-based metrics.
pdf
bib
abs
AbsVis – Benchmarking How Humans and Vision-Language Models “See” Abstract Concepts in Images
Tarun Tater
|
Diego Frassinelli
|
Sabine Schulte im Walde
Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.
pdf
bib
abs
A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
Tatiana Anikina
|
Jan Cegin
|
Jakub Simko
|
Simon Ostermann
Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed—such as demonstrations, label-based summaries, and self-revision—their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods — particularly target-language demonstrations with LLM-based revisions — yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.
pdf
bib
abs
Alignment with Fill-In-the-Middle for Enhancing Code Generation
Houxing Ren
|
Zimu Lu
|
Weikang Shi
|
Haotian Hou
|
Yunqiao Yang
|
Ke Wang
|
Aojun Zhou
|
Junting Pan
|
Mingjie Zhan
|
Hongsheng Li
The code generation capabilities of Large Language Models (LLMs) have advanced applications like tool invocation and problem-solving. However, improving performance in code-related tasks remains challenging due to limited training data that is verifiable with accurate test cases. While Direct Preference Optimization (DPO) has shown promise, existing methods for generating test cases still face limitations. In this paper, we propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench. Code and data are available at https://github.com/SenseLLM/StructureCoder.
pdf
bib
abs
A Middle Path for On-Premises LLM Deployment: Preserving Privacy Without Sacrificing Model Confidentiality
Hanbo Huang
|
Yihan Li
|
Bowen Jiang
|
Bo Jiang
|
Lin Liu
|
Zhuotao Liu
|
Ruoyu Sun
|
Shiyu Liang
Privacy-sensitive users require deploying large language models (LLMs) within their own infrastructure (on-premises) to safeguard private data and enable customization. However, vulnerabilities in local environments can lead to unauthorized access and potential model theft. To address this, prior research on small models has explored securing only the output layer within hardware-secured devices to balance model confidentiality and customization. Yet this approach fails to protect LLMs effectively. In this paper, we discover that (1) query-based distillation attacks targeting the secured top layer can produce a functionally equivalent replica of the victim model; (2) securing the same number of layers, bottom layers before a transition layer provide stronger protection against distillation attacks than top layers, with comparable effects on customization performance; and (3) the number of secured layers creates a trade-off between protection and customization flexibility. Based on these insights, we propose SOLID, a novel deployment framework that secures a few bottom layers in a secure environment and introduces an efficient metric to optimize the trade-off by determining the ideal number of hidden layers. Extensive experiments on five models (1.3B to 70B parameters) demonstrate that SOLID outperforms baselines, achieving a better balance between protection and downstream customization.
pdf
bib
abs
Variance Sensitivity Induces Attention Entropy Collapse and Instability in Transformers
Jonghyun Hong
|
Sungyoon Lee
Attention-based language models commonly rely on the softmax function to convert attention logits into probability distributions. However, this softmax re-weighting can lead to *attention entropy collapse*, in which attention disproportionately concentrates on a single token, ultimately causing training instability. In this work, we identify the high *variance sensitivity* of softmax as a primary cause of this collapse. We show that *entropy-stable* attention methods, which either control or are insensitive to the variance of attention logits, can prevent entropy collapse and enable more stable training. We provide empirical evidence of this effect in both large language models (LLMs) and a small Transformer model composed solely of self-attention and support our findings with theoretical analysis. Moreover, we identify that the concentration of attention probabilities increases the probability matrix norm, leading to the gradient exploding.
pdf
bib
abs
X-FLoRA: Cross-modal Federated Learning with Modality-expert LoRA for Medical VQA
Min Hyuk Kim
|
Changheon Kim
|
Seok Bong Yoo
Medical visual question answering (VQA) and federated learning (FL) have emerged as vital approaches for enabling privacy-preserving, collaborative learning across clinical institutions. However, both these approaches face significant challenges in cross-modal FL scenarios, where each client possesses unpaired images from only one modality. To address this limitation, we propose X-FLoRA, a cross-modal FL framework that uses modality-expert low-rank adaptation (LoRA) for medical VQA. Specifically, X-FLoRA enables the synthesis of images from one modality to another without requiring data sharing between clients. This is achieved by training a backward translation model within a federated asymmetric translation scheme that integrates clinical semantics from textual data. Additionally, X-FLoRA introduces modality-expert LoRA, which fine-tunes separate LoRA modules to strengthen modality-specific representations in the VQA task. The server aggregates the trained backward translation models and fine-tuned LoRA modules using discriminator quality scores and expert-aware weighting, which regulate the relative contributions from different clients. Experiments were conducted on VQA datasets encompassing different medical modalities, and the results demonstrate that X-FLoRA outperforms existing FL methods in terms of VQA performance.
pdf
bib
abs
Robust Native Language Identification through Agentic Decomposition
Ahmet Yavuz Uluslu
|
Tannon Kew
|
Tilia Ellendorff
|
Gerold Schneider
|
Rico Sennrich
Large language models (LLMs) often achieve high performance in native language identification (NLI) benchmarks by leveraging superficial contextual clues such as names, locations, and cultural stereotypes, rather than the underlying linguistic patterns indicative of native language (L1) influence. To improve robustness, previous work has instructed LLMs to disregard such clues. In this work, we demonstrate that such a strategy is unreliable and model predictions can be easily altered by misleading hints. To address this problem, we introduce an agentic NLI pipeline inspired by forensic linguistics, where specialized agents accumulate and categorize diverse linguistic evidence before an independent final overall assessment. In this final assessment, a goal-aware coordinating agent synthesizes all evidence to make the NLI prediction. On two benchmark datasets, our approach significantly enhances NLI robustness against misleading contextual clues and performance consistency compared to standard prompting methods.
pdf
bib
abs
ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch
Jiawei Chen
|
Xinyan Guan
|
Qianhao Yuan
|
Mo Guozhao
|
Weixiang Zhou
|
Yaojie Lu
|
Hongyu Lin
|
Ben He
|
Le Sun
|
Xianpei Han
Current instruction data synthesis methods primarily focus on single-turn instructions and often neglect cross-turn coherence, resulting in context drift and reduced task completion rates in extended conversations. To address this limitation, we propose Skeleton-Guided Multi-Turn Dialogue Generation, a framework that constrains multi-turn instruction synthesis by explicitly modeling human conversational intent. It operates in two stages: (1) Intent Modeling, which captures the global structure of human dialogues by assigning each conversation to one of nine well-defined intent trajectories, ensuring a coherent and goal-oriented information flow; and (2) Skeleton Generation, which constructs a structurally grounded sequence of user queries aligned with the modeled intent, thereby serving as a scaffold that constrains and guides the downstream instruction synthesis process. Based on this process, we construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances. Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20–30% improvement in chat consistency and up to a 15% increase in task success rate, significantly outperforming models trained on existing single-turn and multi-turn instruction datasets.
pdf
bib
abs
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
Yizheng Sun
|
Hao Li
|
Chang Xu
|
Hongpeng Zhou
|
Chenghua Lin
|
Riza Batista-Navarro
|
Jingyuan Sun
Vision-Language Models (VLMs) are powerful yet computationally intensive for widespread practical deployments. To address such challenge without costly re-training, post-training acceleration techniques like quantization and token reduction are extensively explored. However, current acceleration evaluations primarily target minimal overall performance degradation, overlooking a crucial question: does the accelerated model still give the same answers to the same questions as it did before acceleration? This is vital for stability-centered industrial applications where consistently correct answers for specific, known situations are paramount, such as in AI-based disease diagnosis. We systematically investigate this for accelerated VLMs, testing four leading models (LLaVA-1.5, LLaVA-Next, Qwen2-VL, Qwen2.5-VL) with eight acceleration methods on ten multi-modal benchmarks. Our findings are stark: despite minimal aggregate performance drops, accelerated models changed original answers up to 20% of the time. Critically, up to 6.5% of these changes converted correct answers to incorrect. Input perturbations magnified these inconsistencies, and the trend is confirmed by case studies with the medical VLM LLaVA-Med. This research reveals a significant oversight in VLM acceleration, stressing an urgent need for instance-level stability checks to ensure trustworthy real-world deployment.
pdf
bib
abs
When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
Nisrine Rair
|
Alban Goupil
|
Valeriu Vrabie
|
Emmanuel Chochoy
Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances.Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over 98% of connected components exhibit ≥ 90% prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty.Unlike traditional tool such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.
pdf
bib
abs
Self-Critique and Refinement for Faithful Natural Language Explanations
Yingming Wang
|
Pepa Atanasova
With the rapid development of Large Language Models (LLMs), Natural Language Explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model’s actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations – specifically, post-hoc NLEs – through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline – an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.
pdf
bib
abs
The Psychology of Falsehood: A Human-Centric Survey of Misinformation Detection
Arghodeep Nandi
|
Megha Sundriyal
|
Euna Mehnaz Khan
|
Jikai Sun
|
Emily K. Vraga
|
Jaideep Srivastava
|
Tanmoy Chakraborty
Misinformation remains one of the most significant issues in the digital age. While automated fact-checking has emerged as a viable solution, most current systems are limited to evaluating factual accuracy. However, the detrimental effect of misinformation transcends simple falsehoods; it takes advantage of how individuals perceive, interpret, and emotionally react to information. This underscores the need to move beyond factuality and adopt more human-centered detection frameworks. In this survey, we explore the evolving interplay between traditional fact-checking approaches and psychological concepts such as cognitive biases, social dynamics, and emotional responses. By analyzing state-of-the-art misinformation detection systems through the lens of human psychology and behavior, we reveal critical limitations of current methods and identify opportunities for improvement. Additionally, we outline future research directions aimed at creating more robust and adaptive frameworks, such as neuro-behavioural models that integrate technological factors with the complexities of human cognition and social influence. These approaches offer promising pathways to more effectively detect and mitigate the societal harms of misinformation.
pdf
bib
abs
SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval
Xinhao Huang
|
Zhibo Ren
|
Yipeng Yu
|
Ying Zhou
|
Zulong Chen
|
Zeyi Wen
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose SEAL, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release StructDocRetrieval, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both the released and industrial datasets across various modern PLMs, and online A/B testing demonstrate consistent improvements, boosting NDCG@10 from 73.96% to 77.84% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.
pdf
bib
abs
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
Yu Zhang
|
Dong Guo
|
Fang Wu
|
Guoliang Zhu
|
Dian Ding
|
Yiming Zhang
Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic pattern matching and block-sparse low-level implementations. However, their reliance on local information for pattern identification fails to capture global contexts, and the coarse granularity of blocks leads to persistent internal sparsity, resulting in suboptimal accuracy and efficiency. To address these limitations, we propose AnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions at a finer stripe granularity while adapting to global contextual information, achieving superior speed and accuracy. AnchorAttention comprises three key components: (1) Pattern-based Anchor Computation, leveraging the commonalities present across all inputs to rapidly compute a set of near-maximum scores as anchor; (2) Difference-aware Stripe Sparsity Identification, performing difference-aware comparisons with anchor to quickly obtain discrete coordinates of significant regions in a stripe-like sparsity pattern; (3) Fine-grained Sparse Computation, replacing the traditional contiguous loading strategy with a discrete key-value loading approach to maximize sparsity rates while preserving hardware computational potential. Additionally, we integrate the identification strategy into a single operator to maximize parallelization potential. With its finer-grained sparsity strategy, AnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time. Compared to previous state-of-the-art methods, at a text length of 128k, it achieves a speedup of 1.44× while maintaining higher recall rates.
pdf
bib
abs
Attacks by Content: Automated Fact-checking is an AI Security Issue
Michael Sejr Schlichtkrull
When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents – attackers could instead supply biased, misleading, or false information. We term this an *attack by content*. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.
pdf
bib
abs
MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models
Yuezhang Peng
|
Yuxin Liu
|
Fei Wen
|
Xie Chen
Fine-tuning pre-trained large language models (LLMs) on downstream tasks has achieved significant success across various domains. However, as model sizes grow, traditional first-order fine-tuning algorithms incur substantial memory overhead due to the need for activation storage for back-propagation (BP). The BP-free Memory-Efficient Zeroth-Order Optimization (MeZO) method estimates gradients through finite differences, avoiding the storage of activation values, and has been demonstrated as a viable approach for fine-tuning large language models. This work proposes the Multiple-query Memory Efficient Zeroth-Order (MUZO) method, which is based on variance-reduced multiple queries to obtain the average of gradient estimates. When combined with Adam optimizer, MUZO-Adam demonstrates superior performance in fine-tuning various LLMs. Furthermore, we provide theoretical guarantees for the convergence of the MUZO-Adam optimizer. Extensive experiments empirically demonstrate that MUZO-Adam converges better than MeZO-SGD and achieves near first-order optimizer performance on downstream classification, multiple-choice, and generation tasks.
pdf
bib
abs
Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors
Hao Fang
|
Jiawei Kong
|
Tianqu Zhuang
|
Yixiang Qiu
|
Kuofeng Gao
|
Bin Chen
|
Shu-Tao Xia
|
Yaowei Wang
|
Min Zhang
The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose Contrastive Paraphrase Attack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.
pdf
bib
abs
Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA
Sergey Pletenev
|
Maria Marina
|
Nikolay Ivanov
|
Daria Galimzianova
|
Nikita Krayko
|
Mikhail Salnikov
|
Vasily Konovalov
|
Alexander Panchenko
|
Viktor Moskvoretskii
Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions – whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o’s retrieval behavior.
pdf
bib
abs
Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect
Alina Klerings
|
Jannik Brinkmann
|
Daniel Ruffinelli
|
Simone Paolo Ponzetto
Large language models (LLMs) are able to generate grammatically well-formed text, but how do they encode their syntactic knowledge internally? While prior work has focused largely on binary grammatical contrasts, in this work, we study the representation and control of two multidimensional hierarchical grammar phenomena—verb tense and aspect—and for each, identify distinct, orthogonal directions in residual space using linear discriminant analysis. Next, we demonstrate causal control over both grammatical features through concept steering across three generation tasks. Then, we use these identified features in a case study to investigate factors influencing effective steering in multi-token generation. We find that steering strength, location, and duration are crucial parameters for reducing undesirable side effects such as topic shift and degeneration. Our findings suggest that models encode tense and aspect in structurally organized, human-like ways, but effective control of such features during generation is sensitive to multiple factors and requires manual tuning or automated optimization.
pdf
bib
abs
DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers
Navve Wasserman
|
Oliver Heinimann
|
Yuval Golbari
|
Tal Zimbalist
|
Eli Schwartz
|
Michal Irani
Rerankers play a critical role in multimodal Retrieval-Augmented Generation (RAG) by refining ranking of an initial set of retrieved documents. Rerankers are typically trained using hard negative mining, whose goal is to select pages for each query which rank high, but are actually irrelevant. However, this selection process is typically passive and restricted to what the retriever can find in the available corpus, leading to several inherent limitations. These include: limited diversity, negative examples which are often not hard enough, low controllability, and frequent false negatives which harm training. Our paper proposes an alternative approach: Single-Page Hard Negative Query Generation, which goes the other way around. Instead of retrieving negative pages per query, we generate hard negative queries per page. Using an automated LLM-VLM pipeline, and given a page and its positive query, we create hard negatives by rephrasing the query to be as similar as possible in form and context, yet not answerable from the page. This paradigm enables fine-grained control over the generated queries, resulting in diverse, hard, and targeted negatives. It also supports efficient false negative verification. Our experiments show that rerankers trained with data generated using our approach outperform existing models and significantly improve retrieval performance.
pdf
bib
abs
Reason to Rote: Rethinking Memorization in Reasoning
Yupei Du
|
Philipp Mondorf
|
Silvia Casola
|
Yuekun Yao
|
Robert Litschko
|
Barbara Plank
Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning tasks with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.
pdf
bib
abs
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
Kazuki Matsuda
|
Yuiga Wada
|
Shinnosuke Hirano
|
Seitaro Otsuki
|
Komei Sugiura
In this study, we focus on the automatic evaluation of long and detailed image captions generated by multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, the corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.
pdf
bib
abs
LLM-Independent Adaptive RAG: Let the Question Speak for Itself
Maria Marina
|
Nikolay Ivanov
|
Sergey Pletenev
|
Mikhail Salnikov
|
Daria Galimzianova
|
Nikita Krayko
|
Vasily Konovalov
|
Alexander Panchenko
|
Viktor Moskvoretskii
Large Language Models (LLMs) are prone to hallucinations, and Retrieval-Augmented Generation (RAG) helps mitigate this, but at a high computational cost while risking misinformation. Adaptive retrieval aims to retrieve only when necessary, but existing approaches rely on LLM-based uncertainty estimation, which remains inefficient and impractical.In this study, we introduce lightweight LLM-independent adaptive retrieval methods based on external information. We investigated 27 features, organized into 7 groups, and their hybrid combinations. We evaluated these methods on 6 QA datasets, assessing the QA performance and efficiency. The results show that our approach matches the performance of complex LLM-based methods while achieving significant efficiency gains, demonstrating the potential of external information for adaptive retrieval.
pdf
bib
abs
TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route
Hongyi Luo
|
Qing Cheng
|
Daniel Matos
|
Hari Krishna Gadi
|
Yanfeng Zhang
|
Lu Liu
|
Yongliang Wang
|
Niclas Zeller
|
Daniel Cremers
|
Liqiu Meng
Humans can interpret geospatial information through natural language, while the geospatial cognition capabilities of Large Language Models (LLMs) remain underexplored. Prior research in this domain has been constrained by non-quantifiable metrics, limited evaluation datasets; unclear research hierarchies further compound these limitations. Therefore, we propose a scalable benchmark and conduct a comprehensive evaluation of the geospatial route cognition of LLMs. We create a large-scale evaluation dataset comprised of 36000 routes from 12 metropolises. Then, we introduce PathBuilder, a novel tool for converting natural language instructions into navigation routes, and vice versa, bridging the gap between geospatial information and natural language. Finally, we propose a new evaluation framework and metrics to rigorously assess 9 state-of-the-art (SOTA) LLMs, on the task of route reversal. The benchmark reveals that LLMs exhibit limited ability to reverse routes: most of the reverse routes neither return to the starting point nor are similar to the optimal route. Additionally, LLMs face challenges such as low robustness in route generation and high confidence for their incorrect answers.
pdf
bib
abs
Certainty in Uncertainty: Reasoning over Uncertain Knowledge Graphs with Statistical Guarantees
Yuqicheng Zhu
|
Jingcheng Wu
|
Yizhen Wang
|
Hongkuan Zhou
|
Jiaoyan Chen
|
Evgeny Kharlamov
|
Steffen Staab
Uncertain knowledge graph embedding (UnKGE) methods learn vector representations that capture both structural and uncertainty information to predict scores of unseen triples. However, existing methods produce only point estimates, without quantifying predictive uncertainty—limiting their reliability in high-stakes applications where understanding confidence in predictions is crucial. To address this limitation, we propose UnKGCP, a framework that generates prediction intervals guaranteed to contain the true score with a user-specified level of confidence. The length of the intervals reflects the model’s predictive uncertainty. UnKGCP builds on the conformal prediction framework but introduces a novel nonconformity measure tailored to UnKGE methods and an efficient procedure for interval construction. We provide theoretical guarantees for the intervals and empirically verify these guarantees. Extensive experiments on standard UKG benchmarks across diverse UnKGE methods further demonstrate that the intervals are sharp and effectively capture predictive uncertainty.
pdf
bib
abs
Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation
Shengxiang Gao
|
Jey Han Lau
|
Jianzhong Qi
Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements and their novel compositions at test time, we introduce SG-KBQA — a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It exploits information about the semantics and structure of the knowledge base provided by schema contexts to enhance generalizability. We show that achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Our source code is available at
https://github.com/gaosx2000/SG_KBQA.
pdf
bib
abs
A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation
Yan Li
|
Tianyi Zhang
|
Zechuan Li
|
Caren Han
Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and training-free methods, face challenges like inefficiency, redundant interpolation, logit outliers, or loss of local positional information. We propose Greedy Attention Logit Interpolation (GALI), a training-free method that improves length extrapolation by greedily reusing pretrained positional intervals and interpolating attention logits to eliminate outliers. GALI achieves stable and superior performance across a wide range of long-context tasks without requiring input-length-specific tuning. Our analysis further reveals that LLMs interpret positional intervals unevenly and that restricting interpolation to narrower ranges improves performance, even on short-context tasks. GALI represents a step toward more robust and generalizable long-text processing in LLMs.
pdf
bib
abs
Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance
Yilun Liu
|
Minggui He
|
Feiyu Yao
|
Yuhe Ji
|
Shimin Tao
|
Jingzhou Du
|
Justin Li
|
Jian Gao
|
Zhang Li
|
Hao Yang
|
Boxing Chen
|
Osamu Yoshie
The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).
pdf
bib
abs
We Need to Measure Data Diversity in NLP — Better and Broader
Dong Nguyen
|
Esther Ploeger
Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.
pdf
bib
abs
Sheaf Discovery with Joint Computation Graph Pruning and Flexible Granularity
Lei Yu
|
Jingcheng Niu
|
Zining Zhu
|
Xi Chen
|
Gerald Penn
In this paper, we introduce DiscoGP, a novel framework for extracting self-contained modular units, or sheaves, within neural language models (LMs). Sheaves extend the concept of functional circuits, a unit widely explored in interpretability research, by considering not only subsets of edges in an LM’s computation graph but also the model’s weight parameters. Our framework identifies sheaves through a gradient-based pruning algorithm that operates on both of these in such a way that reduces the original LM to a sparse skeleton that preserves certain core capabilities. Experimental results demonstrate that, across a range of linguistic and reasoning tasks, DiscoGP extracts sheaves that preserve 93-100% of a model’s performance on the identified task while comprising only 1-7% of the original weights and connections. Furthermore, our analysis reveals that, compared to previously identified LM circuits, the sheaves discovered by DiscoGP exhibit superior modularity and functional fidelity. Extending our method to the neuron level also unveils novel insights into the inner workings of LLMs.
pdf
bib
abs
Hierarchical Bracketing Encodings Work for Dependency Graphs
Ana Ezquerro
|
Carlos Gómez-Rodríguez
|
David Vilares
We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with n tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.
pdf
bib
abs
Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
Zhenqi Jia
|
Rui Liu
|
Berrak Sisman
|
Haizhou Li
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.
pdf
bib
abs
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Mehdi Ali
|
Manuel Brack
|
Max Lübbering
|
Elias Wendt
|
Abbas Goher Khan
|
Richard Rutmann
|
Alex Jude
|
Maurice Kraus
|
Alexander Arno Weber
|
Felix Stollenwerk
|
David Kaczér
|
Florian Mai
|
Lucie Flek
|
Rafet Sifa
|
Nicolas Flores-Herr
|
Joachim Koehler
|
Patrick Schramowski
|
Michael Fromm
|
Kristian Kersting
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs’ annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
pdf
bib
abs
Conditional [MASK] Discrete Diffusion Language Model
Hyukhun Koh
|
Minha Jhang
|
Dohyung Kim
|
Sangmook Lee
|
Kyomin Jung
Although auto-regressive models excel in natural language processing, they often struggle to generate diverse text and provide limited controllability. Non-auto-regressive methods could be an alternative but often produce degenerate outputs and exhibit shortcomings in conditional generation. To address these challenges, we propose Diffusion-EAGS, a novel framework that integrates conditional masked language models into diffusion language models through the theoretical lens of a conditional Markov Random Field. In doing so, we propose entropy-adaptive Gibbs sampling and entropy-based noise scheduling to counterbalance each model’s shortcomings. Experimental results show that Diffusion-EAGS outperforms baselines and achieves the best quality-diversity tradeoff, demonstrating its effectiveness in non-autoregressive text generation.
pdf
bib
abs
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Yogesh Kumar
Vision Language Models (VLMs) struggle with long-form videos due to the quadratic complexity of attention mechanisms. We propose Language-Guided Temporal Token Pruning (LGTTP), which leverages temporal cues from queries to adaptively prune video tokens, preserving contextual continuity while reducing computational overhead. Unlike uniform pruning or keyframe selection, LGTTP retains higher token density in temporally relevant segments. Our model-agnostic framework integrates with TimeChat and LLaVA-Video, achieving a 65% reduction in computation while preserving 97-99% of the original performance. On QVHighlights, LGTTP improves HIT@1 by +9.5%, and on Charades-STA, it retains 99.6% of R@1. It excels on queries with explicit temporal markers and remains effective across general video understanding tasks.
pdf
bib
abs
A Fully Probabilistic Perspective on Large Language Model Unlearning: Evaluation and Optimization
Anda Cheng
|
Wei Huang
|
Yinggui Wang
Large Language Model Unlearning (LLMU) is a promising way to remove private or sensitive information from large language models. However, the comprehensive evaluation of LLMU remains underexplored. The dominant deterministic evaluation can yield overly optimistic assessments of unlearning efficacy. To mitigate this, we propose a Fully Probabilistic Evaluation (FPE) framework that incorporates input and output distributions in LLMU evaluation. FPE obtains a probabilistic evaluation result by querying unlearned models with various semantically similar inputs and multiple sampling attempts. We introduce an Input Distribution Sampling method in FPE to select high-quality inputs, enabling a stricter measure of information leakage risks. Furthermore, we introduce a Contrastive Embedding Loss (CEL) to advance the performance of LLMU. CEL employs contrastive learning to distance latent representations of unlearned samples from adaptively clustered contrast samples while aligning them with random vectors, leading to improved efficacy and robustness for LLMU. Our experiments show that FPE uncovers more unlearned information leakage risks than prior evaluation methods, and CEL improves unlearning effectiveness by at least 50.1% and robustness by at least 37.2% on Llama-2-7B while retaining high model utility.
pdf
bib
abs
IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method
Xinyu Liu
|
Bei Li
|
Jiahao Liu
|
Junhao Ruan
|
Kechen Jiao
|
Hongyin Tang
|
Jingang Wang
|
Tong Xiao
|
JingBo Zhu
High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the Iterative Implicit Euler Transformer (IIET), which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce Iteration Influence-Aware Distillation (IIAD). Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65% over vanilla Transformers and 0.8% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55% while retaining 99.4% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.
pdf
bib
abs
WebEvolver: Enhancing Web Agent Self-Improvement with Co-evolving World Model
Tianqing Fang
|
Hongming Zhang
|
Zhisong Zhang
|
Kaixin Ma
|
Wenhao Yu
|
Haitao Mi
|
Dong Yu
Agent self-improvement, where agents autonomously train their underlying Large Language Model (LLM) on self-sampled trajectories, shows promising results but often stagnates in web environments due to limited exploration and under-utilization of pretrained web knowledge. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. The World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent’s policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models.
pdf
bib
abs
Leveraging Semantic Triples for Private Document Generation with Local Differential Privacy Guarantees
Stephen Meisenbacher
|
Maulik Chevli
|
Florian Matthes
Many works at the intersection of Differential Privacy (DP) in Natural Language Processing aim to protect privacy by transforming texts under DP guarantees. This can be performed in a variety of ways, from word perturbations to full document rewriting, and most often under *local* DP. Here, an input text must be made indistinguishable from any other potential text, within some bound governed by the privacy parameter 𝜀. Such a guarantee is quite demanding, and recent works show that privatizing texts under local DP can only be done reasonably under very high 𝜀 values. Addressing this challenge, we introduce **DP-ST**, which leverages semantic triples for neighborhood-aware private document generation under local DP guarantees. Through the evaluation of our method, we demonstrate the effectiveness of the *divide-and-conquer* paradigm, particularly when limiting the DP notion (and privacy guarantees) to that of a *privatization neighborhood*. When combined with LLM post-processing, our method allows for coherent text generation even at lower 𝜀 values, while still balancing privacy and utility. These findings highlight the importance of coherence in achieving balanced privatization outputs at reasonable 𝜀 levels.
pdf
bib
abs
HVGuard: Utilizing Multimodal Large Language Models for Hateful Video Detection
Yiheng Jing
|
Mingming Zhang
|
Yong Zhuang
|
Jiacheng Guo
|
Juan Wang
|
Xiaoyang Xu
|
Wenzhe Yi
|
Keyan Guo
|
Hongxin Hu
The rapid growth of video platforms has transformed information dissemination and led to an explosion of multimedia content. However, this widespread reach also introduces risks, as some users exploit these platforms to spread hate speech, which is often concealed through complex rhetoric, making hateful video detection a critical challenge. Existing detection methods rely heavily on unimodal analysis or simple feature fusion, struggling to capture cross-modal interactions and reason through implicit hate in sarcasm and metaphor. To address these limitations, we propose HVGuard, the first reasoning-based hateful video detection framework with multimodal large language models (MLLMs). Our approach integrates Chain-of-Thought (CoT) reasoning to enhance multimodal interaction modeling and implicit hate interpretation. Additionally, we design a Mixture-of-Experts (MoE) network for efficient multimodal fusion and final decision-making. The framework is modular and extensible, allowing flexible integration of different MLLMs and encoders. Experimental results demonstrate that HVGuard outperforms all existing advanced detection tools, achieving an improvement of 6.88% to 13.13% in accuracy and 9.21% to 34.37% in M-F1 on two public datasets covering both English and Chinese.
pdf
bib
abs
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Yijiong Yu
|
Wei Wang
|
Ran Chen
|
Ji Pei
Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning steps exist, we decode multiple tokens per forward pass via a tree-like attention mask within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves up to nearly 100% speedup in decoding while basically maintaining the answer quality. Our code is available in https://github.com/yuyijiong/parallel-decoding-in-one-sequence
pdf
bib
abs
SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design
Wenxin Tang
|
Jingyu Xiao
|
Wenxuan Jiang
|
Xi Xiao
|
Yuhang Wang
|
Xuxin Tang
|
Qing Li
|
Yuehe Ma
|
Junliang Liu
|
Shisong Tang
|
Michael R. Lyu
Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at https://github.com/vinsontang1/SlideCoder.
pdf
bib
abs
LLM-OREF: An Open Relation Extraction Framework Based on Large Language Models
Hongyao Tu
|
Liang Zhang
|
Yujie Lin
|
Xin Lin
|
Haibo Zhang
|
Long Zhang
|
Jinsong Su
The goal of open relation extraction (OpenRE) is to develop an RE model that can generalize to new relations not encountered during training. Existing studies primarily formulate OpenRE as a clustering task. They first cluster all test instances based on the similarity between the instances, and then manually assign a new relation to each cluster. However, their reliance on human annotation limits their practicality. In this paper, we propose an OpenRE framework based on large language models (LLMs), which directly predicts new relations for test instances by leveraging their strong language understanding and generation abilities, without human intervention. Specifically, our framework consists of two core components: (1) a relation discoverer (RD), designed to predict new relations for test instances based on
demonstrations formed by training instances with known relations; and (2) a relation predictor (RP), used to select the most likely relation for a test instance from
n candidate relations, guided by
demonstrations composed of their instances. To enhance the ability of our framework to predict new relations, we design a self-correcting inference strategy composed of three stages: relation discovery, relation denoising, and relation prediction. In the first stage, we use RD to preliminarily predict new relations for all test instances. Next, we apply RP to select some high-reliability test instances for each new relation from the prediction results of RD through a cross-validation method. During the third stage, we employ RP to re-predict the relations of all test instances based on the demonstrations constructed from these reliable test instances. Extensive experiments on three OpenRE datasets demonstrate the effectiveness of our framework. We release our code at
https://github.com/XMUDeepLIT/LLM-OREF.git.
pdf
bib
abs
Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization
Jian Li
|
Shenglin Yin
|
Yujia Zhang
|
Alan Zhao
|
Xi Chen
|
Xiaohui Zhou
|
Pengfei Xu
Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. The study of token importance has attracted widespread attention in DPO. Researchers have found that token importance is crucial for improving the effectiveness of DPO. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.
pdf
bib
abs
Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations
Leonardo Ranaldi
|
Federico Ranaldi
|
Fabio Massimo Zanzotto
|
Barry Haddow
|
Alexandra Birch
Retrieval-augmented generation (RAG) is key to improving large language models (LLMs) in systematically accessing richer factual knowledge. Yet, using RAG mechanisms brings intrinsic challenges, as LLMs must deal with conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. Our experiments demonstrate that significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.
pdf
bib
abs
Predicate-Guided Generation for Mathematical Reasoning
Jiajun Chen
|
Yik-Cheung Tam
We present Prolog-MATH, a curated corpus designed to support mathematical reasoning in large language models (LLMs) through logic programming. Each verbal math problem in the dataset is paired with a chain-of-thought explanation to generate Prolog program via a two-stage automated pipeline. In the first stage, an LLM (e.g., Deepseek-V3) predicts a set of relevant mathematical predicates that could be useful in solving the problem. In the second stage, the LLM uses these suggested predicates along with the expected answer type to gen- erate a complete Prolog program. To improve coverage, we fine-tune an open-source LLM us- ing supervised fine-tuning, followed by GRPO (Group Relative Policy Optimization) training to address problems that Deepseek-V3 fails to solve. To support this training, we propose a predicate-aware reward function that evaluates how well the generated solution incorporates the suggested predicates, complementing the standard binary reward. Experimental results show that: 1) Our two-stage pipeline achieves 81.3% solution coverage on the MATH training set; 2) GRPO training with the predicate-aware reward function enables a series of base models to correctly solve additional problems missed by Deepseek-V3, further increasing solution coverage to 97.4%. Data and source code can be obtained at the Github repository.
pdf
bib
abs
ComplexTempQA: A 100m Dataset for Complex Temporal Question Answering
Raphael Gruber
|
Abdelrahman Abdallah
|
Michael Färber
|
Adam Jatowt
We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks in scale and scope. Utilizing Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatched scale. We introduce a new taxonomy that categorizes questions as attributes, comparisons, and counting questions, revolving around events, entities, and time periods, respectively. A standout feature of ComplexTempQA is the high complexity of its questions, which demand reasoning capabilities for answering such as across-time comparison, temporal aggregation, and multi-hop reasoning involving temporal event ordering and entity recognition. Additionally, each question is accompanied by detailed metadata, including specific time scopes, allowing for comprehensive evaluation of temporal reasoning abilities of large language models.
pdf
bib
abs
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Qiuchen Wang
|
Ruixue Ding
|
Zehui Chen
|
Weiqi Wu
|
Shihang Wang
|
Pengjun Xie
|
Feng Zhao
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model’s reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code will be available.
pdf
bib
abs
IndoSafety: Culturally Grounded Safety for LLMs in Indonesian Languages
Muhammad Falensi Azmi
|
Muhammad Dehan Al Kautsar
|
Alfan Farizki Wicaksono
|
Fajri Koto
Although region-specific large language models (LLMs) are increasingly developed, their safety remains underexplored, particularly in culturally diverse settings like Indonesia, where sensitivity to local norms is essential and highly valued by the community. In this work, we present IndoSafety, the first high-quality, human-verified safety evaluation dataset tailored for the Indonesian context, covering five language varieties: formal and colloquial Indonesian, along with three major local languages: Javanese, Sundanese, and Minangkabau. IndoSafety is constructed by extending prior safety frameworks to develop a taxonomy that captures Indonesia’s sociocultural context. We find that existing Indonesian-centric LLMs often generate unsafe outputs, particularly in colloquial and local language settings, while fine-tuning on IndoSafety significantly improves safety while preserving task performance. Our work highlights the critical need for culturally grounded safety evaluation and provides a concrete step toward responsible LLM deployment in multilingual settings. Warning: This paper contains example data that may be offensive, harmful, or biased.
pdf
bib
abs
Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments
Harsh Vishwakarma
|
Ankush Agarwal
|
Ojas Patil
|
Chaitanya Devaguptapu
|
Mahesh Chandran
Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
pdf
bib
abs
Steering LLM Reasoning Through Bias-Only Adaptation
Viacheslav Sinii
|
Alexey Gorbatovski
|
Artem Cherepanov
|
Boris Shaposhnikov
|
Nikita Balagansky
|
Daniil Gavrilov
We show that training a single d-dimensional steering vector per layer with reinforcement learning, while freezing all base weights, matches the accuracy of fully RL-tuned reasoning models on mathematical-reasoning tasks.On an 8 billion-parameter model this adds only ≈ 0.0016% additional parameters and reproduces performance across a range of base models and mathematical-reasoning benchmarks.These results tighten the upper bound on the parameter budget required for high-level chain-of-thought reasoning, indicating that millions of adapter weights are unnecessary.The minimal trainable footprint reduces optimizer memory and inter-GPU communication, lowering the overall cost of fine-tuning.Moreover, a logit-lens analysis shows that the learned vectors amplify coherent token directions, providing clearer insight into the model’s internal computations.
pdf
bib
abs
VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making
Zuojin Tang
|
Bin Hu
|
Chenyang Zhao
|
De Ma
|
Gang Pan
|
Bin Liu
Recent large pretrained models such as LLMs (e.g., GPT series) and VLAs (e.g., OpenVLA) have achieved notable progress on multimodal tasks, yet they are built upon a multi-input single-output (MISO) paradigm. We show that this paradigm fundamentally limits performance in multi-input multi-output (MIMO) scenarios, where parallel task execution is required. In MISO architectures, tasks compete for a shared output channel, creating mutual exclusion effects that cause unbalanced optimization and degraded performance. To address this gap, we introduce MIMO-VLA (VLASCD), a unified training framework that enables concurrent multi-task outputs, exemplified by simultaneous dialogue generation and decision-making. Inspired by human cognition, MIMO-VLA eliminates interference between tasks and supports efficient parallel processing. Experiments on the CARLA autonomous driving platform demonstrate that MIMO-VLA substantially outperforms state-of-the-art MISO-based LLMs, reinforcement learning models, and VLAs in MIMO settings, establishing a new direction for multimodal and multitask learning.
pdf
bib
abs
M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia
|
Liying Cheng
|
Hou Pong Chan
|
Maojia Song
|
Chaoqun Liu
|
Mahani Aljunied
|
Soujanya Poria
|
Lidong Bing
The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended explanations and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enhance open models, we construct a training corpus in a fully automatic manner. Experiments show that our tuning approach significantly improves the correctness of model responses by 4.6%.
pdf
bib
abs
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models
Pu Jian
|
Junhong Wu
|
Wei Sun
|
Chen Wang
|
Shuo Ren
|
Jiajun Zhang
Recent advances in text-only “slow-thinking” reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (VRMs). However, such transfer faces critical challenges: Effective “slow thinking” in VRMs requires visual reflection, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM Reflection-V, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, Reflection-V demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, Reflection-V maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.
pdf
bib
abs
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback
Youquan Li
|
Miao Zheng
|
Fan Yang
|
Guosheng Dong
|
Bin Cui
|
Weipeng Chen
|
Zenan Zhou
|
Wentao Zhang
Human feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user utterances are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs’ responsiveness to human feedback under real-world usage scenarios in Chinese. Drawing from the two main interaction scenarios, FB-Bench comprises 591 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios. Further analysis indicates that task, human feedback, and deficiencies of previous responses can also significantly impact LLMs’ responsiveness. Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research.
pdf
bib
abs
HYDRA: A Multi-Head Encoder-only Architecture for Hierarchical Text Classification
Fabian Karl
|
Ansgar Scherp
We introduce HYDRA, a simple yet effective multi-head encoder-only architecture for hierarchical text classification that treats each level in the hierarchy as a separate classification task with its own label space. State-of-the-art approaches rely on complex components like graph encoders, label semantics, and autoregressive decoders. We demonstrate that such complexity is often unnecessary. Through parameter sharing and level-specific parameterization, HYDRA enables flat models to incorporate hierarchical awareness without architectural complexity. Experiments on four benchmarks (NYT, RCV1-V2, BGC, and WOS) demonstrate that HYDRA always increases the performance over flat models and matches or exceeds the performance of complex state-of-the-art methods.
pdf
bib
abs
CARD: Cross-modal Agent Framework for Generative and Editable Residential Design
Pengyu Zeng
|
Jun Yin
|
Miao Zhang
|
Yuqin Dai
|
Jizhizi Li
|
ZhanXiang Jin
|
Shuai Lu
In recent years, architectural design automation has made significant progress, but the complexity of open-world environments continues to make residential design a challenging task, often requiring experienced architects to perform multiple iterations and human-computer interactions. Therefore, assisting ordinary users in navigating these complex environments to generate and edit residential design is crucial. In this paper, we present the CARD framework, which leverages a system of specialized cross-modal agents to adapt to complex open-world environments. The framework includes a point-based cross-modal information representation (CMI-P) that encodes the geometry and spatial relationships of residential rooms, a cross-modal residential generation model, supported by our customized Text2FloorEdit model, that acts as the lead designer to create standardized floor plans, and an embedded expert knowledge base for evaluating whether the designs meet user requirements and residential codes, providing feedback accordingly. Finally, a 3D rendering module assists users in visualizing and understanding the layout. CARD enables cross-modal residential generation from free-text input, empowering users to adapt to complex environments without requiring specialized expertise.
pdf
bib
abs
DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off
Jusheng Zhang
|
Yijia Fan
|
Kaitong Cai
|
Zimeng Huang
|
Xiaofei Sun
|
Jian Wang
|
Chengpei Tang
|
Keze Wang
This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O(n2) to O(n) while maintaining model performance. Finally, we propose a Semantic Anchor States (SAS) module that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.
pdf
bib
abs
FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data
Thibaut Thonet
|
Germán Kruszewski
|
Jos Rozen
|
Pierre Erbacher
|
Marc Dymetman
LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization – tailoring models to align with specific user preferences – has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user – a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets – DnD and ELIP – and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance.
pdf
bib
abs
On LLM-Based Scientific Inductive Reasoning Beyond Equations
Brian S. Lin
|
Jiaxin Yuan
|
Zihan Zhou
|
Shouli Wang
|
Shuo Wang
|
Cunliang Kong
|
Qi Shi
|
Yuxuan Li
|
Liner Yang
|
Zhiyuan Liu
|
Maosong Sun
As large language models (LLMs) increasingly exhibit human-like capabilities, a fundamental question emerges: How can we enable LLMs to learn the underlying patterns from limited examples in entirely novel environments and apply them effectively? This question is central to the ability of LLMs in inductive reasoning. Existing research on LLM-based inductive reasoning can be broadly categorized based on whether the underlying rules are expressible via explicit mathematical equations. However, many recent studies in the beyond-equations category have emphasized rule design without grounding them in specific scenarios. Inspired by the parallels between inductive reasoning and human scientific discovery, we propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations and introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings. Our experimental results show that current LLMs still struggle with this task, underscoring its difficulty and the need for further advancement in this area.
pdf
bib
abs
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Xiaofu Chen
|
Israfel Salazar
|
Yova Kementchedjhieva
As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development.We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.
pdf
bib
abs
LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
Yuxuan Hu
|
Jihao Liu
|
Ke Wang
|
Jinliang Zheng
|
Weikang Shi
|
Manyuan Zhang
|
Qi Dou
|
Rui Liu
|
Aojun Zhou
|
Hongsheng Li
Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search.
pdf
bib
abs
Does quantization affect models’ performance on long-context tasks?
Anmol Mekala
|
Anirudh Atmakuru
|
Yixiao Song
|
Marzena Karpinska
|
Mohit Iyyer
Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long inputs (≥64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long-context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and for languages other than English.
pdf
bib
abs
Token-Aware Editing of Internal Activations for Large Language Model Alignment
Tianbo Wang
|
Yuqing Ma
|
Kewei Liao
|
Chengzhao Yang
|
Zhange Zhang
|
Jiakai Wang
|
Xianglong Liu
Intervening the internal activations of large language models (LLMs) provides an effective inference-time alignment approach to mitigate undesirable behaviors, such as generating erroneous or harmful content, thereby ensuring safe and reliable applications of LLMs. However, previous methods neglect the misalignment discrepancy among varied tokens, resulting in deviant alignment direction and inflexible editing strength. To address these issues, we propose a token-aware editing (TAE) approach to fully utilize token-level alignment information in the activation space, therefore realizing superior post-intervention performance. Specifically, a Mutual Information-guided Graph Aggregation (MIG) module first develops an MI-guided graph to exploit the tokens’ informative interaction for activation enrichment, thus improving alignment probing and facilitating intervention. Subsequently, Misalignment-aware Adaptive Intervention (MAI) comprehensively perceives the token-level misalignment degree from token representation and prediction to guide the adaptive adjustment of editing strength, thereby enhancing final alignment performance. Extensive experiments on three alignment capabilities demonstrate the efficacy of TAE, notably surpassing baseline by 25.8% on the primary metric of truthfulness with minimal cost.
pdf
bib
abs
Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs
Dawid Jan Kopiczko
|
Tijmen Blankevoort
|
Yuki M Asano
Decoder-only large language models typically rely solely on masked causal attention, which limits their expressiveness by restricting information flow to one direction. We propose Bitune, a method that enhances pretrained decoder-only LLMs by incorporating bidirectional attention into prompt processing. We evaluate Bitune in instruction-tuning and question-answering settings, showing significant improvements in performance on commonsense reasoning, arithmetic, and language understanding tasks. Furthermore, extensive ablation studies validate the role of each component of the method, and demonstrate that Bitune is compatible with various parameter-efficient finetuning techniques and full model finetuning.
pdf
bib
abs
Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey
Mehrab Tanjim
|
Yeonjun In
|
Xiang Chen
|
Victor Bursztyn
|
Ryan A. Rossi
|
Sungchul Kim
|
Guang-Jie Ren
|
Vaishnavi Muppala
|
Shun Jiang
|
Yongsung Kim
|
Chanyoung Park
Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, especially in agentic settings, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable LLM-based systems.
pdf
bib
abs
Plan Dynamically, Express Rhetorically: A Debate-Driven Rhetorical Framework for Argumentative Writing
Xueguan Zhao
|
Wenpeng Lu
|
Chaoqun Zheng
|
Weiyu Zhang
|
Jiasheng Si
|
Deyu Zhou
Argumentative essay generation (AEG) is a complex task that requires advanced semantic understanding, logical reasoning, and organized integration of perspectives. Despite showing a promising performance, current efforts often overlook the dynamical and hierarchical nature of structural argumentative planning, and struggle with flexible rhetorical expression, leading to limited argument divergence and rhetorical optimization. Inspired by human debate behavior and Bitzer’s rhetorical situation theory, we propose a debate-driven rhetorical framework for argumentative writing. The uniqueness lies in three aspects: (1) dynamic assesses the divergence of viewpoints and progressively reveals the hierarchical outline of arguments based on a depth-then-breadth paradigm, improving the perspective divergence within argumentation; (2) simulates human debate through iterative defender-attacker interactions, improving the logical coherence of arguments; (3) incorporates Bitzer’s rhetorical situation theory to flexibly select appropriate rhetorical techniques, enabling the rhetorical expression. Experiments on four benchmarks validate that our approach significantly improves logical depth, argumentative diversity, and rhetorical persuasiveness over existing state-of-the-art models.
pdf
bib
abs
TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
Kechen Jiao
|
Zhirui Fang
|
Jiahao Liu
|
Bei Li
|
Qifan Wang
|
Xinyu Liu
|
Junhao Ruan
|
Zhongjian Qiao
|
Yifan Zhu
|
Yaxin Xu
|
Jingang Wang
|
Xiu Li
Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model’s intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of **26.67%**, achieving a **6%** improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
pdf
bib
abs
Reimagining Safety Alignment with An Image
Yifan Xia
|
Guorui Chen
|
Wenqian Yu
|
Zhijiang Li
|
Philip Torr
|
Jindong Gu
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusing benign queries due to rigid safety mechanisms. These issues severely affect the application of LLMs, especially in the medical and education fields. Existing approaches can be divided into three types: contrastive decoding, activation manipulation, and prompting strategies. However, all these approaches face challenges like inefficiency, fragility, or architectural constraints,ultimately failing to strike a balance between safety and usability. These problems are more obvious in multimodal large language models (MLLMs), especially in terms of heightened over-refusal in cross-modal tasks and new security risks arising from expanded attack surfaces. We propose Magic Image, an optimization-driven visual prompt framework that enhances security and reduces over-refusal at the same time. The Magic Image is optimized using gradients derived from harmful/benign training samples. Using the magic image can modify the model’s original safety alignment, maintaining robust safety while reducing unnecessary denials. Experiments demonstrate its effectiveness in preserving model performance and improving safety-responsiveness balance across datasets, including unseen data, offering a practical solution for reliable MLLM deployment.
pdf
bib
abs
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
Siva Rajesh Kasa
|
Karan Gupta
|
Sumegh Roychowdhury
|
Ashutosh Kumar
|
Yaswanth Biruduraju
|
Santhosh Kumar Kasa
|
Pattisapu Nikhil Priyatam
|
Arindam Bhattacharya
|
Shailendra Agarwal
|
Vijay Huddar
*The comparison between discriminative and generative classifiers has intrigued researchers since [Efron (1975)’s](https://www.jstor.org/stable/2285453) seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures—Auto-regressive, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical “two regimes” phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.*
pdf
bib
abs
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Miao Ziqi
|
Yi Ding
|
Lijun Li
|
Jing Shao
With the emergence of strong vision language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments.Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: vision-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack.VisCo fabricates contextual dialogue using four distinct vision-focused strategies, dynamically generating auxiliary images when necessary to construct a vision-centric jailbreak scenario.To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which achieves a toxicity score of 2.48 and an ASR of 22.2%. Code: https://github.com/Dtc7w3PQ/Visco-Attack.
pdf
bib
abs
Can Large Language Models Win the International Mathematical Games?
Alessio Cocchieri
|
Luca Ragazzi
|
Giuseppe Tagliavini
|
Lorenzo Tordi
|
Antonella Carbonaro
|
Gianluca Moro
Recent advances in large language models (LLMs) have demonstrated strong mathematical reasoning abilities, even in visual contexts, with some models surpassing human performance on existing benchmarks. However, these benchmarks lack structured age categorization, clearly defined skill requirements, and—crucially—were not designed to assess human performance in international competitions. To address these limitations, we introduce MathGames, a new benchmark of 2,183 high-quality mathematical problems (both text-only and multimodal) in an open-ended format, sourced from an international mathematical games championships. Spanning seven age groups and a skill-based taxonomy, MathGames enables a structured evaluation of LLMs’ mathematical and logical reasoning abilities. Our experiments reveal a substantial gap between state-of-the-art LLMs and human participants—even 11-year-olds consistently outperform some of the strongest models—highlighting the need for advancements. Further, our detailed error analysis offers valuable insights to guide future research. The data is publicly available at https://disi-unibo-nlp.github.io/math-games.
pdf
bib
abs
CodeArena: Evaluating and Aligning CodeLLMs on Human Preference
Jian Yang
|
Jiaxi Yang
|
Wei Zhang
|
Jin Ke
|
Yibo Miao
|
Lei Zhang
|
Liqun Yang
|
Zeyu Cui
|
Yichang Zhang
|
Zhoujun Li
|
Binyuan Hui
|
Junyang Lin
We present CodeArena to emulate the complexity/diversity of real-world coding tasks, spanning 40 categories and 44 PLs. A 20B diverse synthetic instruction corpus is created by scaling instructions to help Qwen2.5-SynCoder achieve SOTA performance. Abstract: Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.
pdf
bib
abs
Language models can learn implicit multi-hop reasoning, but only if they have lots of training data
Yuekun Yao
|
Yupei Du
|
Dawei Zhu
|
Michael Hahn
|
Alexander Koller
Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought.We investigate this capability using GPT2-style language models trained from scratch on controlled k-hop reasoning datasets (k = 2, 3, 4). We show that while such models can indeed learn implicit k-hop reasoning,the required training data grows exponentially in k, and the requirednumber of transformer layers grows linearly in k.We offer a theoretical explanation for why this depth growth is necessary.We further find that the data requirement can be mitigated, but not eliminated,through curriculum learning.
pdf
bib
abs
UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Joseph Marvin Imperial
|
Abdullah Barayan
|
Regina Stodden
|
Rodrigo Wilkens
|
Ricardo Muñoz Sánchez
|
Lingyun Gao
|
Melissa Torgbi
|
Dawn Knight
|
Gail Forey
|
Reka R. Jablonkai
|
Ekaterina Kochmar
|
Robert Joshua Reynolds
|
Eugénio Ribeiro
|
Horacio Saggion
|
Elena Volodina
|
Sowmya Vajjala
|
Thomas François
|
Fernando Alva-Manchego
|
Harish Tayyar Madabushi
We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community.
pdf
bib
abs
CROP: Contextual Region-Oriented Visual Token Pruning
Jiawei Guo
|
Feifei Zhai
|
Pu Jian
|
Qianrun Wei
|
Yu Zhou
Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance.
pdf
bib
abs
CR4-NarrEmote: An Open Vocabulary Dataset of Narrative Emotions Derived Using Citizen Science
Andrew Piper
|
Robert Budac
We introduce “Citizen Readers for Narrative Emotions” (CR4-NarrEmote), a large-scale, open-vocabulary dataset of narrative emotions derived through a citizen science initiative. Over a four-month period, 3,738 volunteers contributed more than 200,000 emotion annotations across 43,000 passages from long-form fiction and non-fiction, spanning 150 years, twelve genres, and multiple Anglophone cultural contexts. To facilitate model training and comparability, we provide mappings to both dimensional (Valence-Arousal-Dominance) and categorical (NRC Emotion) frameworks. We evaluate annotation reliability using lexical, categorical, and semantic agreement measures, and find substantial alignment between citizen science annotations and expert-generated labels. As the first open-vocabulary resource focused on narrative emotions at scale, CR4-NarrEmote provides an important foundation for affective computing and narrative understanding.
pdf
bib
abs
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Haoqi Yang
|
Yao Yao
|
Zuchao Li
|
Baoyuan Qi
|
Liu Guoming
|
Hai Zhao
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy. The source code is available at https://github.com/brinenick511/XQuant.
pdf
bib
abs
DINT Transformer
Yueyang Cang
|
Yuhang Liu
|
Xiaoteng Zhang
|
Erlu Zhao
|
Li Shi
The DIFF Transformer mitigates interference from irrelevant contexts by introducing a differential attention mechanism, thereby enhancing focus on critical tokens. However, this architecture suffers from two major limitations: first, its use of two independent attention matrices leads to numerical instability, and second, it lacks global context modeling, which is essential for identifying globally significant tokens. To address these challenges, we propose the DINT Transformer, which extends the DIFF Transformer by incorporating an integral mechanism. By computing global importance scores and integrating them into the attention matrix, the DINT Transformer not only improves overall numerical stability but also significantly enhances its ability to capture global dependencies. Experimental results demonstrate that the DINT Transformer achieves superior accuracy and robustness across various practical applications, including long-context language modeling and key information retrieval. These advancements establish the DINT Transformer as a highly effective and promising architecture.
pdf
bib
abs
ICR: Iterative Clarification and Rewriting for Conversational Search
Zhiyu Cao
|
Peifeng Li
|
Qiaoming Zhu
Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.
pdf
bib
abs
Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment
Tong Zhang
|
Kuofeng Gao
|
Jiawang Bai
|
Leo Yu Zhang
|
Xin Yin
|
Zonghui Wang
|
Shouling Ji
|
Wenzhi Chen
Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process solely relies on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and detriment the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct the image-caption pairs, named OTCCLIP. We involve a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks to 0% in most cases. Also, compared to previous methods, OTCCLIPsignificantly improves CLIP’s zero-shot and linear probing performance trained on poisoned datasets.
pdf
bib
abs
Similarity = Value? Consultation Value-Assessment and Alignment for Personalized Search
Weicong Qin
|
Yi Xu
|
Weijie Yu
|
Teng Shi
|
Chenglei Shen
|
Ming He
|
Jianping Fan
|
Xiao Zhang
|
Jun Xu
Personalized search systems in e-commerce platforms increasingly involve user interactions with AI assistants, where users consult about products, usage scenarios, and more. Leveraging consultation to personalize search services is trending. Existing methods typically rely on semantic similarity to align historical consultations with current queries due to the absence of ‘value’ labels, but we observe that semantic similarity alone often fails to capture the true value of consultation for personalization. To address this, we propose a consultation value assessment framework that evaluates historical consultations from three novel perspectives: (1) Scenario Scope Value, (2) Posterior Action Value, and (3) Time Decay Value. Based on this, we introduce VAPS, a value-aware personalized search model that selectively incorporates high-value consultations through a consultation–user action interaction module and an explicit objective that aligns consultations with user actions. Experiments on both public and commercial datasets show that VAPS consistently outperforms baselines in both retrieval and ranking tasks. Codes are available at https://github.com/E-qin/VAPS.
pdf
bib
abs
RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models
Zhaoyan Gong
|
Juan Li
|
Zhiqiang Liu
|
Lei Liang
|
Huajun Chen
|
Wen Zhang
Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability to handle more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in “Multiple” and “Complex” categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.
pdf
bib
abs
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Yao Wang
|
Di Liang
|
Minlong Peng
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel Core Parameter Isolation Fine-Tuning (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.
pdf
bib
abs
AI Knows Where You Are: Exposure, Bias, and Inference in Multimodal Geolocation with KoreaGEO
Xiaonan Wang
|
Bo Shao
|
Hansaem Kim
Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. Yet, a systematic evaluation of such risks is still lacking: existing benchmarks show coarse granularity, linguistic bias, and a neglect of multimodal privacy risks. To address these gaps, we introduce KoreaGEO, the first fine-grained, multimodal, and privacy-aware benchmark for geolocation, built on Korean street views. The benchmark covers four socio-spatial clusters and nine place types with rich contextual annotations and two captioning styles that simulate real-world privacy exposure. To evaluate mainstream VLMs, we design a three-path protocol spanning image-only, functional-caption, and high-risk-caption inputs, enabling systematic analysis of localization accuracy, spatial bias, and reasoning behavior. Results show that input modality exerts a stronger influence on localization precision and privacy exposure than model scale or architecture, with high-risk captions substantially boosting accuracy. Moreover, they highlight structural prediction biases toward core cities.
pdf
bib
abs
CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models
Kairong Han
|
Wenshuo Zhao
|
Ziyu Zhao
|
Ye Jun Jian
|
Lujia Pan
|
Kun Kuang
Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. The CAT achieves an average improvement of 5.76% on the STG dataset and 1.56% on downstream tasks. Notably, the OOD performance of the Llama-3.1-8B model on STG_M increased from 64.5% to 90.5%, and Qwen’s OOD performance on the STG_H dataset improved from 25.4% to 55.9%. Implementation details can be found at https://github.com/Kairong-Han/CAT.
pdf
bib
abs
Enhancing LLM Text Detection with Retrieved Contexts and Logits Distribution Consistency
Zhaoheng Huang
|
Yutao Zhu
|
Ji-Rong Wen
|
Zhicheng Dou
Large language models (LLMs) can generate fluent text, raising concerns about misuse in online comments and academic writing, leading to issues like corpus pollution and copyright infringement. Existing LLM text detection methods often rely on features from the logit distribution of the input text. However, the distinction between the LLM-generated and human-written texts may rely on only a few tokens due to the short length or insufficient information in some texts, leading to minimal and hard-to-detect differences in logit distributions. To address this, we propose HALO, an LLM-based detection method that leverages external text corpora to evaluate the difference in the logit distribution of input text under retrieved human-written and LLM-rewritten contexts. HALO also complements basic detection features and can serve as a plug-and-play module to enhance existing detection methods. Extensive experiments on five public datasets with three widely-used source LLMs show that our proposed detection method achieves state-of-the-art performance in AUROC, both in cross-domain and domain-specific scenarios.
pdf
bib
abs
Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps
Martin Tutek
|
Fateme Hashemi Chaleshtori
|
Ana Marasovic
|
Yonatan Belinkov
When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models’ parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters and measures faithfulness as the resulting effect on the model’s prediction. Our experiments with four LMs and five multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models’ prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.
pdf
bib
abs
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen
|
Yifeng Gao
|
Shaobo Wang
|
Junyuan Zhang
|
Qintong Zhang
|
Weijia Li
|
Conghui He
|
Linfeng Zhang
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.
pdf
bib
abs
AgentPro: Enhancing LLM Agents with Automated Process Supervision
Yuchen Deng
|
Shichen Fan
|
Naibo Wang
|
Xinkui Zhao
|
See-Kiong Ng
Large language model (LLM) agents have demonstrated significant potential for addressing complex tasks through mechanisms such as chain-of-thought reasoning and tool invocation. However, current frameworks lack explicit supervision during the reasoning process, which may lead to error propagation across reasoning chains and hinder the optimization of intermediate decision-making stages. This paper introduces a novel framework, AgentPro, which enhances LLM agent performance by automated process supervision. AgentPro employs Monte Carlo Tree Search to automatically generate step-level annotations, and develops a process reward model based on these annotations to facilitate fine-grained quality assessment of reasoning. By employing a rejection sampling strategy, the LLM agent dynamically adjusts generation probability distributions to prevent the continuation of erroneous paths, thereby improving reasoning capabilities. Extensive experiments on four datasets indicate that our method significantly outperforms existing agent-based LLM methods (e.g., achieving a 6.32% increase in accuracy on the HotpotQA dataset), underscoring its proficiency in managing intricate reasoning chains.
pdf
bib
abs
PORTS: Preference-Optimized Retrievers for Tool Selection with Large Language Models
Lorenzo Molfetta
|
Giacomo Frisoni
|
Nicolò Monaldini
|
Gianluca Moro
Integrating external tools with Large Language Models (LLMs) has emerged as a promising paradigm for accomplishing complex tasks. Since LLMs still struggle to effectively manage large tool collections, researchers have begun exploring retrieval-based methods to pre-select the most relevant options, addressing input length and latency constraints. However, existing retrievers are often misaligned with tool-calling LLMs due to their separate training processes. This paper presents PORTS, a novel odds ratio preference optimization method for training retrievers aimed at tool selection. Using a perplexity-inspired preference signal from a frozen LLM, our approach fine-tunes a retriever to find helpful tools by optimizing the correlation between the selection probabilities and the downstream performances while jointly enforcing a contrastive semantic loss between documentation strings. The versatility of PORTS and its ability to significantly improve tool selection accuracy are demonstrated through extensive experiments on six datasets, two encoder models, and three LLMs with diverse prior knowledge. With low computational demands, our alignment process facilitates generalization to new queries and tools, proving valuable for practical applications with evolving toolsets.
pdf
bib
abs
MusKGC: A Flexible Multi-source Knowledge Enhancement Framework for Open-World Knowledge Graph Completion
Xin Song
|
Liu Haiyan
|
Haiyang Wang
|
Ye Wang
|
Kai Chen
|
Bin Zhou
Open-world knowledge graph completion (KGC) aims to infer novel facts by enriching existing graphs with external knowledge sources while maintaining semantic consistency under the open-world assumption (OWA). Generation-based KGC methods leverage the inherent strengths of large language models (LLMs) in language understanding and creative problem-solving, making them promising approaches. However, they face limitations: (1) The unreliable external knowledge from LLMs can lead to hallucinations and undermine KGC reliability. (2) The lack of an automated and rational evaluation strategy for new facts under OWA results in the exclusion of some new but correct entities. In the paper, we propose MusKGC, a novel multi-source knowledge enhancement framework based on an LLM for KGC under OWA. We induce relation templates with entity type constraints to link structured knowledge with natural language, improving the comprehension of the LLM. Next, we combine intrinsic KG facts with reliable external knowledge to guide the LLM in accurately generating missing entities with supporting evidence. Lastly, we introduce a new evaluation strategy for factuality and consistency to validate accurate inferences of new facts, including unknown entities. Extensive experiments show that our proposed model achieves SOTA performance across benchmarks, and our evaluation strategy effectively assesses new facts under OWA.
pdf
bib
abs
Towards Transferable Personality Representation Learning based on Triplet Comparisons and Its Applications
Kai Tang
|
Rui Wang
|
Renyu Zhu
|
Minmin Lin
|
Xiao Ding
|
Tangjie Lv
|
Changjie Fan
|
Runze Wu
|
Haobo Wang
Personality is an important concept in psychology that reflects individual differences in thinking and behavior, and has significant applications across various fields. Most existing personality analysis methods address this issue at the bag level, treating the entire corpus gathered from one individual as a single unit for classification. However, this paradigm presents several challenges. From the data perspective, collecting a large corpus for each individual and performing comprehensive annotations pose significant difficulties in both data collection and labeling. On the application side, concentrating on classifying the entire corpus limits its applicability in more common single-instance scenarios. To address these issues, we propose a new task paradigm in text-based personality representation learning. Specifically, we construct a triplet personality trend comparison dataset to learn single-sentence personality embeddings with desirable metric properties. This approach removes the traditional constraints on data sources, facilitating dataset expansion, and can leverage the transfer capabilities of embeddings to easily adapt to various downstream tasks. Our experiments show that the learned embeddings significantly boost performance by a relative 10% across various applications, including personality detection, personality retrieval, and emotion translation prediction. The code and dataset are available at
https://github.com/zjutangk/PTCD.
pdf
bib
abs
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
Hao Yang
|
Lizhen Qu
|
Ehsan Shareghi
|
Gholamreza Haffari
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model’s representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs, demonstrate that our approach significantly improves LALMs safety under three modality input conditions (audio-text, text-only, and audio-only) while increasing over-rejection rate by only 0.88% on average. Warning: this paper contains harmful examples.
pdf
bib
abs
Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation
Simin Chen
|
Yiming Chen
|
Zexin Li
|
Yifan Jiang
|
Zhongwei Wan
|
Yixin He
|
Dezhi Ran
|
Tianle Gu
|
Haizhou Li
|
Tao Xie
|
Baishakhi Ray
In the era of evaluating large language models (LLMs), data contamination has become an increasingly prominent concern. To address this risk, LLM benchmarking has evolved from a *static* to a *dynamic* paradigm. In this work, we conduct an in-depth analysis of existing *static* and *dynamic* benchmarks for evaluating LLMs. We first examine methods that enhance *static* benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating *dynamic* benchmarks. Based on this observation, we propose a series of optimal design principles for *dynamic* benchmarking and analyze the limitations of existing *dynamic* benchmarks.This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
pdf
bib
abs
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
Tiansheng Hu
|
Tongyan Hu
|
Liuyang Bai
|
Yilun Zhao
|
Arman Cohan
|
Chen Zhao
Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.
pdf
bib
abs
RecGPT: A Foundation Model for Sequential Recommendation
Yangqin Jiang
|
Xubin Ren
|
Lianghao Xia
|
Da Luo
|
Kangyi Lin
|
Chao Huang
This work addresses a fundamental barrier in recommender systems: the inability to generalize across domains without extensive retraining. Traditional ID-based approaches fail entirely in cold-start and cross-domain scenarios where new users or items lack sufficient interaction history. Inspired by foundation models’ cross-domain success, we develop a foundation model for sequential recommendation that achieves genuine zero-shot generalization capabilities. Our approach fundamentally departs from existing ID-based methods by deriving item representations exclusively from textual features. This enables immediate embedding of any new item without model retraining. We introduce unified item tokenization with Finite Scalar Quantization that transforms heterogeneous textual descriptions into standardized discrete tokens. This eliminates domain barriers that plague existing systems. Additionally, the framework features hybrid bidirectional-causal attention that captures both intra-item token coherence and inter-item sequential dependencies. An efficient catalog-aware beam search decoder enables real-time token-to-item mapping. Unlike conventional approaches confined to their training domains, RecGPT naturally bridges diverse recommendation contexts through its domain-invariant tokenization mechanism. Comprehensive evaluations across six datasets and industrial scenarios demonstrate consistent performance advantages.
pdf
bib
abs
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Chih-Kai Yang
|
Neo S. Ho
|
Hung-yi Lee
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs’ performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community.
pdf
bib
abs
Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy
Nikita Balagansky
|
Yaroslav Aksenov
|
Daniil Laptev
|
Vadim Kurochkin
|
Gleb Gerasimov
|
Nikita Koriagin
|
Daniil Gavrilov
Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are constrained by the fixed sparsity level chosen during training; meeting different sparsity requirements therefore demands separate models and increases the computational footprint during both training and evaluation. We introduce a novel training objective, HierarchicalTopK, which trains a single SAE to optimise reconstructions across multiple sparsity levels simultaneously. Experiments with Gemma-2 2B demonstrate that our approach achieves Pareto-optimal trade-offs between sparsity and explained variance, outperforming traditional SAEs trained at individual sparsity levels. Further analysis shows that HierarchicalTopK preserves high interpretability scores even at higher sparsity. The proposed objective thus closes an important gap between flexibility and interpretability in SAE design.
pdf
bib
abs
Learn and Unlearn: Addressing Misinformation in Multilingual LLMs
TaiMing Lu
|
Philipp Koehn
This paper investigates the propagation of information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data we can effectively eliminate it for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across landscapes.
pdf
bib
abs
PRISM: Efficient Long-Range Reasoning With Short-Context LLMs
Dulhan Jayalath
|
James Bradley Wendt
|
Nicholas Monath
|
Sandeep Tata
|
Beliz Gunel
Long-range tasks demand reasoning over long inputs. However, existing solutions are limited, e.g., long-context models require large compute budgets, parameter-efficient fine-tuning (PEFT) needs training data, and retrieval-augmented generation (RAG) entails complex task-specific designs. Though in-context approaches overcome many of these issues, methods with short-context LLMs are inefficient, trading context for processing more tokens. We introduce **PRISM**, a highly token-efficient in-context method based on structured schemas that outperforms baselines on diverse tasks with **4x shorter contexts**. This approach produces concise outputs and efficiently leverages key-value (KV) caches to **reduce costs by up to 54%**. PRISM scales down to tiny contexts without increasing costs or sacrificing quality, and generalizes to new tasks with minimal effort by generating schemas from task descriptions.
pdf
bib
abs
Augmenting Multi-Agent Communication with State Delta Trajectory
Yichen Tang
|
Weihang Su
|
Yujia Zhou
|
Yiqun Liu
|
Min Zhang
|
Shaoping Ma
|
Qingyao Ai
Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing multi-agent systems constructed from a single base LLM mostly use natural language for agent communication.While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to discrete tokens before transferring them to the other model.Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts.To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another.Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process.We propose a State Delta Encoding (SDE) method to represent state transition trajectories.The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. We have open-sourced all the code and data in https://github.com/LittleDinoC/StateDelta/.
pdf
bib
abs
SAEs Are Good for Steering – If You Select the Right Features
Dana Arad
|
Aaron Mueller
|
Yonatan Belinkov
Sparse Autoencoders (SAEs) have been proposed as an unsupervised approach to learn a decomposition of a model’s latent space. This enables useful applications, such as fine-grained steering of model outputs without requiring labeled data. Current steering methods identify SAE features to target by analyzing the input tokens that activate them. However, recent work has highlighted that activations alone do not fully describe the effect of a feature on the model’s output. In this work we draw a distinction between two types of features: input features, which mainly capture patterns in the model’s input, and output features, those that have a human-understandable effect on the model’s output. We propose input and output scores to characterize and locate these types of features, and show that high values for both scores rarely co-occur in the same features. These findings have practical implications: After filtering out features with low output scores, steering with SAEs results in a 2–3x improvement, matching the performance of existing supervised methods.
pdf
bib
abs
CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples
Kyohoon Jin
|
Juhwan Choi
|
JungMin Yun
|
Junho Lee
|
Soojin Jang
|
YoungBin Kim
Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed *counterbias* data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present **CoBA**: **Co**unter**B**ias **A**ugmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, **CoBA** generates *counterbias* data that mitigates spurious patterns. Through extensive experiments, we demonstrate that **CoBA** not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
pdf
bib
abs
Layered Insights: Generalizable Analysis of Human Authorial Style by Leveraging All Transformer Layers
Milad Alshomary
|
Nikhil Reddy Varimalla
|
Vishal Anand
|
Smaranda Muresan
|
Kathleen McKeown
We propose a new approach for the authorship attribution task that leverages the various linguistic representations learned at different layers of pre-trained transformer-based models. We evaluate our approach on two popular authorship attribution models and three evaluation datasets, in in-domain and out-of-domain scenarios. We found that utilizing various transformer layers improves the robustness of authorship attribution models when tested on out-of-domain data, resulting in a much stronger performance. Our analysis gives further insights into how our model’s different layers get specialized in representing certain linguistic aspects that we believe benefit the model when tested out of the domain.
pdf
bib
abs
When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
Yingming Zheng
|
Hanqi Li
|
Kai Yu
|
Lu Chen
Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
pdf
bib
abs
A Case Against Implicit Standards: Homophone Normalization in Machine Translation for Languages that use the Ge’ez Script.
Hellina Hailu Nigatu
|
Atnafu Lambebo Tonja
|
Henok Biadglign Ademtew
|
Hizkiel Mitiku Alemayehu
|
Negasi Haile Abadi
|
Tadesse Destaw Belay
|
Seid Muhie Yimam
Homophone normalization–where characters that have the same sound in a writing script are mapped to one character–is a pre-processing step applied in Amharic Natural Language Processing (NLP) literature. While this may improve performance reported by automatic metrics, it also results in models that are unable to effectively process different forms of writing in a single language. Further, there might be impacts in transfer learning, where models trained on normalized data do not generalize well to other languages. In this paper, we experiment with monolingual training and cross-lingual transfer to understand the impacts of normalization on languages that use the Ge’ez script. We then propose a post-inference intervention in which normalization is applied to model predictions instead of training data. With our simple scheme of post-inference normalization, we show that we can achieve an increase in BLEU score of up to 1.03 while preserving language features in training.
pdf
bib
abs
Evaluating Language Translation Models by Playing Telephone
Syeda Jannatus Saba
|
Steven Skiena
Our ability to efficiently and accurately evaluate the quality of machine translation systems has been outrun by the effectiveness of current language models—which limits the potential for further improving these models on more challenging tasks like long-form and literary translation. We propose an unsupervised method to generate training data for translation evaluation over different document lengths and application domains by repeated rounds of translation between source and target languages. We evaluate evaluation systems trained on texts mechanically generated using both model rotation and language translation approaches, demonstrating improved performance over a popular translation evaluation system (xCOMET) on two different tasks: (i) scoring the quality of a given translation against a human reference and (ii) selecting which of two translations is generationally closer to an original source document.
pdf
bib
abs
Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
Shuo Yang
|
Zheyu Zhang
|
Bardh Prenkaj
|
Gjergji Kasneci
Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500× over LLM-based baselines.
pdf
bib
abs
SPaRC: A Spatial Pathfinding Reasoning Challenge
Lars Benedikt Kaesberg
|
Jan Philip Wahle
|
Terry Ruas
|
Bela Gipp
Existing reasoning datasets saturate and fail to test abstract, multi-step problems, especially pathfinding and complex rule constraint satisfaction. We introduce SPaRC (Spatial Pathfinding Reasoning Challenge), a dataset of 1,000 2D grid pathfinding puzzles to evaluate spatial and rule-based reasoning, requiring step-by-step planning with arithmetic and geometric rules. Humans achieve near-perfect accuracy (98.0%; 94.5% on hard puzzles), while the best reasoning models, such as o4-mini, struggle (15.8%; 1.1% on hard puzzles). Models often generate invalid paths (>50% of puzzles for o4-mini), and reasoning tokens reveal they make errors in navigation and spatial logic. Unlike humans, who take longer on hard puzzles, models fail to scale test-time compute with difficulty. Allowing models to make multiple solution attempts improves accuracy, suggesting potential for better spatial reasoning with improved training and efficient test-time scaling methods. SPaRC can be used as a window into models’ spatial reasoning limitations and drive research toward new methods that excel in abstract, multi-step problem-solving.
pdf
bib
abs
Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Yao-Ching Yu
|
Tsun-Han Chiang
|
Cheng-Wei Tsai
|
Chien-Ming Huang
|
Wen-Kwang Tsao
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continued pre-training on our dataset yields a **15.9%** improvement in the aggregate score, while reasoning distillation leads to a **15.8%** gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community.
pdf
bib
abs
Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework
Yuhang Chen
|
Zhen Tan
|
Ajay Kumar Jaiswal
|
Huaizhi Qu
|
Xinyu Zhao
|
Qi Lin
|
Yu Cheng
|
Andrew Kwong
|
Zhichao Cao
|
Tianlong Chen
Bit-flip errors (BFEs) are hardware faults where individual bits in memory or processing units are unintentionally flipped. These errors pose a significant threat to neural network reliability because even small changes in model parameters can lead to large shifts in outputs. Large language models (LLMs) are particularly vulnerable on resource-constrained or outdated hardware. Such hardware often lacks error-correction mechanisms and faces aging issues, leading to instability under the vast parameter counts and heavy computational loads of LLMs. While the impact of BFEs on traditional networks like CNNs is relatively well-studied, their effect on the complex architecture of transformers remains largely unexplored. Firstly, this paper presents a comprehensive systematic analysis of BFE vulnerabilities in key LLM components, revealing distinct sensitivities across parameters, activations, and gradients during fine-tuning and inference. Secondly, based on our findings, we introduce a novel defense strategy FlipGuard: (i) exponent bit protection, and (ii) a self-correction based fine-tuning mechanism, to address BFE consequences. FlipGuard minimizes performance degradation while significantly enhancing robustness against BFEs. Experiments demonstrate a 9.27 reduction in accuracy drop under 1 BFEs on the SST-2 dataset using BERT, and a 36.35-point improvement in perplexity on the Wikitext-103 dataset using GPT-2, compared to unprotected models. These results show the potential of our approach in enabling reliable LLM deployment on diverse and less reliable hardware platforms.
pdf
bib
abs
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models
Wei Jie Yeo
|
Ranjan Satapathy
|
Erik Cambria
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment-tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model’s internal computations and avoiding out-of-distribution concerns that could otherwise undermine the validity of faithfulness assessments.
pdf
bib
abs
Calibrating LLM Confidence by Probing Perturbed Representation Stability
Reza Khanmohammadi
|
Erfan Miahi
|
Mehrsa Mardikoraem
|
Simerjot Kaur
|
Ivan Brugere
|
Charese Smiley
|
Kundan S Thind
|
Mohammad M. Ghassemi
Miscalibration in Large Language Models (LLMs) undermines their reliability, highlighting the need for accurate confidence estimation. We introduce CCPS (Calibrating LLM Confidence by Probing Perturbed Representation Stability), a novel method analyzing internal representational stability in LLMs. CCPS applies targeted adversarial perturbations to final hidden states, extracts features reflecting the model’s response to these perturbations, and uses a lightweight classifier to predict answer correctness. CCPS was evaluated on LLMs from 8B to 32B parameters (covering Llama, Qwen, and Mistral architectures) using MMLU and MMLU-Pro benchmarks in both multiple-choice and open-ended formats. Our results show that CCPS significantly outperforms current approaches. Across four LLMs and three MMLU variants, CCPS reduces Expected Calibration Error by approximately 55% and Brier score by 21%, while increasing accuracy by 5 percentage points, Area Under the Precision-Recall Curve by 4 percentage points, and Area Under the Receiver Operating Characteristic Curve by 6 percentage points, all relative to the strongest prior method. CCPS delivers an efficient, broadly applicable, and more accurate solution for estimating LLM confidence, thereby improving their trustworthiness.
pdf
bib
abs
SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading
Yuanzhe Shen
|
Yide Liu
|
Zisu Huang
|
Ruicheng Yin
|
Xiaoqing Zheng
|
Xuanjing Huang
Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between performance and cost: high-performing LLMs typically incur substantial expenses, whereas budget-friendly small language models (SLMs) are constrained by limited capabilities. Current research primarily proposes two routing strategies: pre-generation routing and cascade routing. Both approaches have distinct characteristics, with cascade routing typically offering superior cost-effectiveness and accuracy despite its higher latency. To further address the limitations of both approaches, we introduce SATER, a dual-mode compatible approach that fine-tunes models through shortest-response preference optimization and a confidence-aware rejection mechanism. SATER significantly reduces redundant outputs and response times, while improving both the performance of pre-generation routing and the efficiency of cascade routing. Experiments across three SLMs and six datasets, varying in type and complexity, demonstrate that SATER achieves comparable performance while consistently reducing computational costs by over 50% and cascade latency by over 80%.
pdf
bib
abs
DSG-MCTS: A Dynamic Strategy-Guided Monte Carlo Tree Search for Diversified Reasoning in Large Language Models
Rui Ha
|
Chaozhuo Li
|
Rui Pu
|
Litian Zhang
|
Xi Zhang
|
Sen Su
Large language models (LLMs) have shown strong potential in complex reasoning tasks. However, as task complexity increases, their performance often degrades, resulting in hallucinations, errors, and logical inconsistencies. To enhance reasoning capabilities, Monte Carlo Tree Search (MCTS) has been introduced to guide the exploration of reasoning paths in a structured manner. Despite its advantages, traditional MCTS relies on fixed reasoning strategies, limiting the diversity of reasoning paths and the coverage of the solution space. To address these limitations, we propose Dynamic Strategy-Guided MCTS (DSG-MCTS), a novel framework that dynamically integrates multiple reasoning strategies, such as abductive and analogical reasoning, to expand the reasoning space. At the same time, DSG-MCTS enhances reasoning efficiency through a dynamic strategy selection mechanism that adapts to the task context. Experimental results on challenging reasoning benchmarks demonstrate that DSG-MCTS achieves improved accuracy and efficiency, outperforming existing state-of-the-art methods.
pdf
bib
abs
CIFLEX: Contextual Instruction Flow for Sub-task Execution in Multi-Turn Interactions with a Single On-Device LLM
Juntae Lee
|
Jihwan Bang
|
Seunghan Yang
|
Simyung Chang
We present CIFLEX (Contextual Instruction FLow with EXecution), a novel execution system for efficient sub-task handling in multi-turn interactions with a single on-device large language model (LLM). As LLMs become increasingly capable, a single model is expected to handle diverse sub-tasks that more effectively and comprehensively support answering user requests. Naive approach reprocesses the entire conversation context when switching between main and sub-tasks (e.g., query rewriting, summarization), incurring significant computational overhead. CIFLEX mitigates this overhead by reusing the key-value (KV) cache from the main task and injecting only task-specific instructions into isolated side paths. After sub-task execution, the model rolls back to the main path via cached context, thereby avoiding redundant prefill computation. To support sub-task selection, we also develop a hierarchical classification strategy tailored for small-scale models, decomposing multi-choice decisions into binary ones. Experiments show that CIFLEX significantly reduces computational costs without degrading task performance, enabling scalable and efficient multi-task dialogue on-device.
pdf
bib
abs
On the Role of Model Prior in Real-World Inductive Reasoning
Zhuo Liu
|
Ding Yu
|
Hangfeng He
Large Language Models (LLMs) show impressive inductive reasoning capabilities, enabling them to generate hypotheses that could generalize effectively to new instances when guided by in-context demonstrations. However, in real-world applications, LLMs’ hypothesis generation is not solely determined by these demonstrations but is significantly shaped by task-specific model priors. Despite their critical influence, the distinct contributions of model priors versus demonstrations to hypothesis generation have been underexplored. This study bridges this gap by systematically evaluating three inductive reasoning strategies across five real-world tasks with three LLMs. Our empirical findings reveal that, hypothesis generation is primarily driven by the model’s inherent priors; removing demonstrations results in minimal loss of hypothesis quality and downstream usage. Further analysis shows the result is consistent across various label formats with different label configurations, and prior is hard to override, even under flipped labeling. These insights advance our understanding of the dynamics of hypothesis generation in LLMs and highlight the potential for better utilizing model priors in real-world inductive reasoning tasks.
pdf
bib
abs
Viability of Machine Translation for Healthcare in Low-Resourced Languages
Hellina Hailu Nigatu
|
Nikita Mehandru
|
Negasi Haile Abadi
|
Blen Gebremeskel
|
Ahmed Alaa
|
Monojit Choudhury
Machine Translation errors in high-stakes settings like healthcare pose unique risks that could lead to clinical harm. The challenges are even more pronounced for low-resourced languages where human translators are scarce and MT tools perform poorly. In this work, we provide a taxonomy of Machine Translation errors for the healthcare domain using a publicly available MT system. Preparing an evaluation dataset from pre-existing medical datasets, we conduct our study focusing on two low-resourced languages: Amharic and Tigrinya. Based on our error analysis and findings from prior work, we test two pre-translation interventions–namely, paraphrasing the source sentence and pivoting with a related language– for their effectiveness in reducing clinical risk. We find that MT errors for healthcare most commonly happen when the source sentence includes medical terminology and procedure descriptions, synonyms, figurative language, and word order differences. We find that pre-translation interventions are not effective in reducing clinical risk if the base translation model performs poorly. Based on our findings, we provide recommendations for improving MT for healthcare.
pdf
bib
abs
Latent Inter-User Difference Modeling for LLM Personalization
Yilun Qiu
|
Tianhao Shi
|
Xiaoyan Zhao
|
Fengbin Zhu
|
Yang Zhang
|
Fuli Feng
Large language models (LLMs) are increasingly integrated into users’ daily lives, leading to a growing demand for personalized outputs.Previous work focuses on leveraging a user’s own history, overlooking inter-user differences that are crucial for effective personalization.While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions.To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user’s embedding with those of peers who engaged with similar content, highlighting relative behavioral signals.A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM.Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics.Our code is available at https://github.com/SnowCharmQ/DEP.
pdf
bib
abs
IG-Pruning: Input-Guided Block Pruning for Large Language Models
Kangyu Qiao
|
Shaolei Zhang
|
Yang Feng
With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.
pdf
bib
abs
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi
|
Kouta Nakayama
|
Takashi Kodama
|
Saku Sugawara
Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored.We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations.Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring.Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations.Our code is available at https://github.com/momo0817/checklist-effectiveness-study.
pdf
bib
abs
Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks
Kirill Semenov
|
Rico Sennrich
For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems.
pdf
bib
abs
Knowledge Editing through Chain-of-Thought
Changyue Wang
|
Weihang Su
|
Qingyao Ai
|
Yichen Tang
|
Yiqun Liu
Knowledge Editing is a technique that updates large language models (LLMs) with new information to maintain their world knowledge. This approach avoids the need to rebuild the model from scratch, thereby addressing the high costs associated with frequent retraining. Among these, the in-context editing paradigm stands out for its effectiveness in integrating new knowledge while preserving the model’s original capabilities. Despite its potential, existing in-context knowledge editing methods are often task-specific, focusing primarily on multi-hop QA tasks using structured knowledge triples. Moreover, their reliance on few-shot prompting for task decomposition makes them unstable and less effective in generalizing across diverse tasks. In response to these limitations, we propose EditCoT, a novel knowledge editing framework that flexibly and efficiently updates LLMs across various tasks without retraining. EditCoT works by generating a chain-of-thought (CoT) for a given input and then iteratively refining this CoT process using a CoT editor based on updated knowledge. We evaluate EditCoT across a diverse range of benchmarks, covering multiple languages and tasks. The results demonstrate that our approach achieves state-of-the-art performance while offering superior generalization, effectiveness, and stability compared to existing methods, marking a significant advancement in the field of knowledge updating.
pdf
bib
abs
SelfRACG: Enabling LLMs to Self-Express and Retrieve for Code Generation
Qian Dong
|
Jia Chen
|
Qingyao Ai
|
Hongning Wang
|
Haitao Li
|
Yiwu
|
Yao Hu
|
Yiqun Liu
|
Shaoping Ma
Existing retrieval-augmented code generation (RACG) methods typically use an external retrieval module to fetch semantically similar code snippets used for generating subsequent fragments. However, even for consecutive code fragments, the content often diverges due to logical progression, resulting in a content gap. This gap undermines the performance of current RACG methods, as external retrieval modules based on content matching fail to infer the specific information need of LLMs to generate the next code fragment. Therefore, we propose SelfRACG, a novel paradigm that enables large language models (LLMs) to Self-express their information needs to enhance RACG. Specifically, SelfRACG includes an information need expression module and a two-stage information need-guided training strategy, which encourages LLMs to express their information need. Extensive experiments demonstrate that SelfRACG can retrieve external knowledge that better aligns with the LLM’s own information needs, resulting in superior generation performance compared to vanilla RACG. Moreover, both the training and deployment costs for retrieval in our framework are much lower than those of the strongest retrieval model.
pdf
bib
abs
Probing Logical Reasoning of MLLMs in Scientific Diagrams
Yufei Wang
|
Adriana Kovashka
We examine how multimodal large language models (MLLMs) perform logical inference grounded in visual information. We first construct a dataset of food web/chain images, along with questions that follow seven structured templates with progressively more complex reasoning involved. We show that complex reasoning about entities in the images remains challenging (even with elaborate prompts) and that visual information is underutilized.
pdf
bib
abs
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
Huishuai Zhang
|
Bohan Wang
|
Luoxin Chen
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
pdf
bib
abs
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Feiyang Kang
|
Newsha Ardalani
|
Michael Kuchnik
|
Youssef Emad
|
Mostafa Elhoushi
|
Shubhabrata Sengupta
|
Shang-Wen Li
|
Ramya Raghavendra
|
Ruoxi Jia
|
Carole-Jean Wu
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations.We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data alone is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data alone results in notably higher loss on many downstream domains especially at small data budgets. “Good” ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on “model collapse” during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse”. Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.
pdf
bib
abs
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Yumeng Shi
|
Quanyu Long
|
Wenya Wang
Video question answering benefits from the rich information in videos, enabling various applications. However, the large volume of tokens generated from long videos presents challenges to memory efficiency and model performance. To alleviate this, existing works propose to compress video inputs, but often overlook the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. We propose a novel token selection strategy, explore-then-select, that adaptively adjusts static and dynamic information based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Then it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our framework is plug-and-play and can be seamlessly integrated within diverse video language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8%) on multiple video question answering benchmarks. Our code is available at *https://github.com/ANDgate99/Explore-Then-Select*.
pdf
bib
abs
DischargeSim: A Simulation Benchmark for Educational Doctor–Patient Communication at Discharge
Zonghai Yao
|
Michael Sun
|
Won Seok Jang
|
Sunjae Kwon
|
Soie Kwon
|
Hong Yu
Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.
pdf
bib
abs
Can Vision-Language Models Solve Visual Math Equations?
Monjoy Narayan Choudhury
|
Junling Wang
|
Yifan Hou
|
Mrinmaya Sachan
Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.
pdf
bib
abs
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
Benlu Wang
|
Iris Xia
|
Yifan Zhang
|
Junda Wang
|
Feiyun Ouyang
|
Shuo Han
|
Arman Cohan
|
Hong Yu
|
Zonghai Yao
Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.
pdf
bib
abs
Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge
Yi Sui
|
Chaozhuo Li
|
Chen Zhang
|
Dawei Song
|
Qiuchi Li
Retrieval-augmented generation (RAG) aims to mitigate the hallucination of Large Language Models (LLMs) by retrieving and incorporating relevant external knowledge into the generation process. However, the external knowledge may contain noise and conflict with the parametric knowledge of LLMs, leading to degraded performance. Current LLMs lack inherent mechanisms for resolving such conflicts. To fill this gap, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to it is the refinement of the traditional self-attention into a mixed-attention that distinguishes shared and private semantics for a controlled knowledge integration. An unsupervised hallucination detection method that captures the LLMs’ intrinsic cognitive uncertainty ensures that external knowledge is introduced only when necessary. To reduce noise in external knowledge, an Energy Quotient (EQ), defined by attention difference matrices between task-aligned and task-misaligned layers, is proposed. Extensive experiments show that DSSP-RAG achieves a superior performance over strong baselines.
pdf
bib
abs
Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models
Ziliang Qiu
|
Renfen Hu
The evaluation of LLMs’ creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Chains of Associations to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Arena Creative Writing (Spearman’s 𝜌 = 0.739, p < 0.001) on various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, top-performing humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.
pdf
bib
abs
Identifying Unlearned Data in LLMs via Membership Inference Attacks
Advit Deepak
|
Megan Mou
|
Jing Huang
|
Diyi Yang
Unlearning evaluation has traditionally followed the retrieval paradigm, where adversaries attempt to extract residual knowledge of an unlearning target by issuing queries to a language model. However, the absence of retrievable knowledge does not necessarily prevent an adversary from inferring which targets have been intentionally unlearned in the post-training optimization. Such inferences can still pose significant privacy risks, as they may reveal the sensitive data in the model’s training set and the internal policies of model creators. To quantify such privacy risks, we propose a new evaluation framework **Forensic Unlearning Membership Attacks (FUMA)**, drawing on principles from membership inference attacks. FUMA assesses whether unlearning leaves behind detectable artifacts that can be exploited to infer membership in the forget set. Specifically, we evaluate four major optimization-based unlearning methods on 258 models across diverse unlearned settings and show that examples in the forget set can be identified up to 99% accuracy. This highlights privacy risks not covered in existing retrieval-based benchmarks. We conclude by discussing recommendations to mitigate these vulnerabilities.
pdf
bib
abs
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Zihao Li
|
Xu Wang
|
Yuzhe Yang
|
Ziyu Yao
|
Haoyi Xiong
|
Mengnan Du
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
pdf
bib
abs
LLMs cannot spot math errors, even when allowed to peek into the solution
Kv Aditya Srivatsa
|
Kaushal Kumar Maurya
|
Ekaterina Kochmar
Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student’s solution, which helps improve performance.
pdf
bib
abs
Can LLMs be Good Graph Judge for Knowledge Graph Construction?
Haoyu Huang
|
Chong Chen
|
Zeang Sheng
|
Yang Li
|
Wentao Zhang
In real-world scenarios, most of the data obtained from the information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. We identified three limitations with respect to existing KG construction methods: (1) There could be a large amount of noise in real-world documents, which could result in extracting messy information. (2) Naive LLMs usually extract inaccurate knowledge from some domain-specific documents. (3) Hallucination phenomenon cannot be overlooked when directly using LLMs to construct KGs. In this paper, we propose GraphJudge, a KG construction framework to address the aforementioned challenges. In this framework, we designed an entity-centric strategy to eliminate the noise information in the documents. And we fine-tuned a LLM as a graph judge to finally enhance the quality of generated KGs. Experiments conducted on two general and one domain-specific text-graph pair datasets demonstrate state-of-the-art performance against various baseline methods with strong generalization abilities.
pdf
bib
abs
NeuroAda: Activating Each Neuron’s Potential for Parameter-Efficient Fine-Tuning
Zhi Zhang
|
Yixian Shen
|
Congfeng Cao
|
Ekaterina Shutova
Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption.To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen.Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as
≤ 0.02% trainable parameters, while reducing CUDA memory usage by up to 60%.We release our code here:
https://github.com/FightingFighting/NeuroAda.git.
pdf
bib
abs
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abdellah El Mekki
|
Houdaifa Atou
|
Omer Nacar
|
Shady Shehata
|
Muhammad Abdul-Mageed
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities.We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat.
pdf
bib
abs
A Computational Simulation of Language Production in First Language Acquisition
Yuan Gao
|
Weiwei Sun
We introduce a computational framework for modeling child language production, focusing on the acquisition of the competence to map meaning onto linguistic form. Our approach uses graphs to formalize meaning and Synchronous Hyperedge Replacement Grammar (SHRG) to formalize the syntax–semantics interface.The setup provides computationally-sound induction algorithms of statistical grammar knowledge. We induce SHRGs solely from semantic graphs, and the resulting interpretable grammars are evaluated by their ability to generate utterances—providing a novel controlled paradigm to simulate child language acquisition.A notable finding is that unsupervised statistical learning (analogous to children’s implicit learning mechanisms) performs as well as the corresponding supervised oracle when a proper symbolic grammar is assumed (reflecting knowledge gained via comprehension).
pdf
bib
abs
Long-Form Information Alignment Evaluation Beyond Atomic Facts
Danna Zheng
|
Mirella Lapata
|
Jeff Z. Pan
Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities.In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by “montaging” truthful statements without introducing explicit hallucinations.We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%.To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.
pdf
bib
abs
Voice of a Continent: Mapping Africa’s Speech Technology Frontier
AbdelRahim A. Elmadany
|
Sang Yun Kwon
|
Hawau Olamide Toyin
|
Alcides Alcoba Inciarte
|
Hanan Aldarmaki
|
Muhammad Abdul-Mageed
Africa’s rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent’s speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks. Our benchmark analysis reveals critical patterns in resource availability, while our model evaluation demonstrates how dataset quality, domain diversity, and language family relationships influence performance across languages. Our work highlights the need for expanded speech technology resources that better reflect Africa’s linguistic diversity and provides a solid foundation for future research and development efforts toward more inclusive speech technologies.
pdf
bib
abs
Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
Ibne Farabi Shihab
|
Sanjeda Akter
|
Anuj Sharma
Integrating large language models (LLMs) as action proposers in reinforcement learning (RL) significantly boosts performance in text-based environments but incurs prohibitive computational costs. We introduce a cache-efficient framework for Bayesian RL that leverages LLM-derived action suggestions, drastically reducing these costs while maintaining near-optimal performance. Our approach features an adaptive caching mechanism, optimized via meta-learning based on policy performance, to enable efficient inference across text-based games (e.g., TextWorld, ALFWorld) and robotic control tasks (e.g., MuJoCo, MetaWorld). This framework achieves a 3.8×–4.7× reduction in LLM queries and 4.0×–12.0× lower median latencies (85–93ms on consumer hardware), while retaining 96–98% of the uncached policy’s performance. We provide theoretical guarantees on the reliability of cached decisions with Kullback-Leibler (KL) divergence bounds, which are validated empirically by high success rates (90.4–95.6%) in complex text environments. For offline RL, our proposed CQL-Prior variant improves performance by 14–29% and reduces training time by 38–40%. Evaluations across eight diverse tasks demonstrate the framework’s generalizability and practicality for resource-constrained settings, making LLM-guided RL a viable and accessible approach for both text-based and robotic applications.
pdf
bib
abs
Circuit Complexity Bounds for RoPE-based Transformer Architecture
Bo Chen
|
Xiaoyu Li
|
Yingyu Liang
|
Jiangxuan Long
|
Zhenmei Shi
|
Zhao Song
|
Jiahao Zhang
Characterizing the expressive power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, position embedding has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information, which shows great performance for the long context scenario. In this work, we take a circuit complexity perspective and rigorously analyze Transformers augmented with widely adopted positional embeddings. We prove that, under standard complexity assumptions, such models remain incapable of efficiently solving canonical tasks such as arithmetic formula evaluation and Boolean formula value computation. Our results expose a fundamental expressivity limitation that persists despite the remarkable empirical success of positionally-enhanced Transformers. Beyond tightening known complexity bounds, our findings offer new theoretical insights for designing future architectures with provably stronger reasoning and compositional capabilities.
pdf
bib
abs
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
Ibne Farabi Shihab
|
Sanjeda Akter
|
Anuj Sharma
As the deployment of AI models shifts towards edge devices, developing efficient sequence models has become critical. State-space models (SSMs), particularly Mamba, have emerged as strong rivals to Transformers due to their linear-time complexity and impressive performance across a range of tasks. However, their large parameter counts still hinder their use in resource-constrained environments. To address this, we propose a novel unstructured pruning framework specifically tailored for Mamba, achieving up to 70% parameter reduction with only a 3–9% drop in performance. Unlike pruning techniques designed for Transformers, our approach leverages Mamba’s unique recurrent dynamics by incorporating pruning based on both weight and gradient importance to preserve critical parameters, a gradual pruning schedule to maintain model stability, and a global strategy to optimize parameter allocation across the model. Extensive experiments on the WikiText-103, Long Range Arena, and ETT benchmarks demonstrate significant efficiency gains, including 1.77× faster inference and a 46% reduction in memory usage. Our component analysis confirms Mamba’s robustness to pruning, highlighting the framework’s potential for enabling practical deployment while underscoring the need for careful evaluation to avoid introducing biases in sensitive applications.
pdf
bib
abs
Towards Infinite-Long Prefix in Transformer
Yingyu Liang
|
Zhenmei Shi
|
Zhao Song
|
Chiwun Yang
Prompting and context-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks. They are empirically efficient and effective, matching the performance of full parameter fine-tuning, but the theoretical understandings are limited. In this paper, we aim to address this limitation by studying their ability from the perspective of prefix length. In particular, we provide a convergence guarantee for training an ultra-long prefix in a stylized setting using the Neural Tangent Kernel (NTK) framework. Based on this strong theoretical guarantee, we design and implement an algorithm that only needs to introduce and fine-tune a few extra trainable parameters instead of an infinite-long prefix in each layer of a transformer, and can approximate the prefix attention to a guaranteed polynomial-small error.Preliminary experimental results on vision, natural language, and math data show that our method achieves superior or competitive performance compared to existing methods like full parameters fine-tuning, P-Tuning V2, and LoRA. This demonstrates our method is promising for parameter-efficient fine-tuning.
pdf
bib
abs
LATTE: Learning to Think with Vision Specialists
Zixian Ma
|
Jianguo Zhang
|
Zhiwei Liu
|
Jieyu Zhang
|
Juntao Tan
|
Manli Shu
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Caiming Xiong
|
Ranjay Krishna
|
Silvio Savarese
While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.
pdf
bib
abs
SUA: Stealthy Multimodal Large Language Model Unlearning Attack
Xianren Zhang
|
Hui Liu
|
Delvin Ce Zhang
|
Xianfeng Tang
|
Qi He
|
Dongwon Lee
|
Suhang Wang
Multimodal Large Language Models (MLLMs) trained on massive data may memorize sensitive personal information and photos, posing serious privacy risks. To mitigate this, MLLM unlearning methods are proposed, which fine-tune MLLMs to reduce the “forget” sensitive information. However, it remains unclear whether the knowledge has been truly forgotten or just hidden in the model. Therefore, we propose to study a novel problem of LLM unlearning attack, which aims to recover the unlearned knowledge of an unlearned LLM. To achieve the goal, we propose a novel framework Stealthy Unlearning Attack (SUA) framework that learns a universal noise pattern. When applied to input images, this noise can trigger the model to reveal unlearned content. While pixel-level perturbations may be visually subtle, they can be detected in the semantic embedding space, making such attacks vulnerable to potential defenses. To improve stealthiness, we introduce an embedding alignment loss that minimizes the difference between the perturbed and denoised image embeddings, ensuring the attack is semantically unnoticeable. Experimental results show that SUA can effectively recover unlearned information from MLLMs. Furthermore, the learned noise generalizes well: a single perturbation trained on a subset of samples can reveal forgotten content in unseen images. This indicates that knowledge reappearance is not an occasional failure, but a consistent behavior.
pdf
bib
abs
ResFormer: All-Time Reservoir Memory for Long Sequence Classification
Hongbo Liu
|
Jia Xu
Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art performance, have inherent limitations due to quadratic time and memory complexity, restricting their input length. Although extensive efforts have aimed at reducing computational demands, processing extensive contexts remains challenging. To overcome these limitations, we propose ResFormer, a novel neural network architecture designed to model varying context lengths efficiently through a cascaded methodology. ResFormer integrates an reservoir computing network featuring a nonlinear readout to effectively capture long-term contextual dependencies in linear time. Concurrently, short-term dependencies within sentences are modeled using a conventional Transformer architecture with fixed-length inputs. Experiments demonstrate that ResFormer significantly outperforms baseline models of DeepSeek-Qwen and ModernBERT, delivering an accuracy improvement of up to +22.3% on the EmoryNLP dataset and consistent gains on MultiWOZ, MELD, and IEMOCAP. In addition, ResFormer exhibits reduced memory consumption, underscoring its effectiveness and efficiency in modeling extensive contextual information.
pdf
bib
abs
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models
Zeping Yu
|
Yonatan Belinkov
|
Sophia Ananiadou
We investigate how large language models (LLMs) perform latent multi-hop reasoning in prompts like “Wolfgang Amadeus Mozart’s mother’s spouse is”. To analyze this process, we introduce logit flow, an interpretability method that traces how logits propagate across layers and positions toward the final prediction. Using logit flow, we identify four distinct stages in single-hop knowledge prediction: (A) entity subject enrichment, (B) entity attribute extraction, (C) relation subject enrichment, and (D) relation attribute extraction. Extending this analysis to multi-hop reasoning, we find that failures often stem from the relation attribute extraction stage, where conflicting logits reduce prediction accuracy. To address this, we propose back attention, a novel mechanism that enables lower layers to leverage higher-layer hidden states from different positions during attention computation. With back attention, a 1-layer transformer achieves the performance of a 2-layer transformer. Applied to five LLMs, back attention improves accuracy on five reasoning datasets, demonstrating its effectiveness in enhancing latent multi-hop reasoning ability. Code and data is available at https://github.com/zepingyu0512/back-attention.
pdf
bib
abs
Interdisciplinary Research in Conversation: A Case Study in Computational Morphology for Language Documentation
Enora Rice
|
Katharina von der Wense
|
Alexis Palmer
Computational morphology has the potential to support language documentation through tasks like morphological segmentation and the generation of Interlinear Glossed Text (IGT). However, our research outputs have seen limited use in real-world language documentation settings. This position paper situates the disconnect between computational morphology and language documentation within a broader misalignment between research and practice in NLP and argues that the field risks becoming decontextualized and ineffectual without systematic integration of User-Centered Design (UCD). To demonstrate how principles from UCD can reshape the research agenda, we present a case study of GlossLM, a state-of-the-art multilingual IGT generation model. Through a small-scale user study with three documentary linguists, we find that despite strong metric-based performance, the system fails to meet core usability needs in real documentation contexts. These insights raise new research questions around model constraints, label standardization, segmentation, and personalization. We argue that centering users not only produces more effective tools, but surfaces richer, more relevant research directions.
pdf
bib
abs
Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction
Huanxin Sheng
|
Xinyi Liu
|
Hangfeng He
|
Jieyu Zhao
|
Jian Kang
LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.
pdf
bib
abs
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Junyu Zhang
|
Runpei Dong
|
Han Wang
|
Xuying Ning
|
Haoran Geng
|
Peihao Li
|
Xialin He
|
Yutong Bai
|
Jitendra Malik
|
Saurabh Gupta
|
Huan Zhang
This paper presents AlphaOne (𝛼1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. 𝛼1 first introduces 𝛼 moment, which represents the scaled thinking phase with a universal parameter 𝛼.Within this scaled pre-𝛼 moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the 𝛼 moment, 𝛼1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate 𝛼1‘s superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/.
pdf
bib
abs
Dual-Path Dynamic Fusion with Learnable Query for Multimodal Sentiment Analysis
Miao Zhou
|
Lina Yang
|
Thomas Wu
|
Dongnan Yang
|
Xinru Zhang
Multimodal Sentiment Analysis (MSA) is the task of understanding human emotions by analyzing a combination of different data sources, such as text, audio, and visual inputs. Although recent advances have improved emotion modeling across modalities, existing methods still struggle with two fundamental challenges: balancing global and fine-grained sentiment contributions, and over-reliance on the text modality. To address these issues, we propose DPDF-LQ (Dual-Path Dynamic Fusion with Learnable Query), an architecture that processes inputs through two complementary paths: global and local. The global path is responsible for establishing cross-modal dependencies, while the local path captures fine-grained representations. Additionally, we introduce the key module Dynamic Global Learnable Query Attention (DGLQA) in the global path, which dynamically allocates weights to each modality to capture their relevant features and learn global representations. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks demonstrate that DPDF-LQ achieves state-of-the-art performance, particularly in fine-grained sentiment prediction by effectively combining global and local features. Our code will be released at
https://github.com/ZhouMiaoGX/DPDF-LQ.
pdf
bib
abs
CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners
Yunzhi Yao
|
Jizhan Fang
|
Jia-Chen Gu
|
Ningyu Zhang
|
Shumin Deng
|
Huajun Chen
|
Nanyun Peng
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits—the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods.
pdf
bib
abs
DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic
Yuheng Wu
|
Jianwen Xie
|
Denghui Zhang
|
Zhaozhuo Xu
Theory-of-Mind (ToM) tasks pose a unique challenge for large language models (LLMs), which often lack the capability for dynamic logical reasoning. In this work, we propose DEL-ToM, a framework that improves verifiable ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and verifiable dynamic logical reasoning. We use data generated automatically via a DEL simulator to train a verifier, which we call the Process Belief Model (PBM), to score each belief update step. During inference, the PBM evaluates candidate belief traces from the LLM and selects the highest-scoring one. This allows LLMs to allocate extra inference-time compute to yield more transparent reasoning. Experiments across model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision significantly enhances LLMs’ ToM capabilities without retraining. Code is available at https://github.com/joel-wu/DEL-ToM.
pdf
bib
abs
Collaborative Beam Search: Enhancing LLM Reasoning via Collective Consensus
Yangyifan Xu
|
Shuo Ren
|
Jiajun Zhang
Complex multi-step reasoning remains challenging for large language models (LLMs). While parallel inference-time scaling methods, such as step-level beam search, offer a promising solution, existing approaches typically depend on either domain-specific external verifiers, or self-evaluation which is brittle and prompt-sensitive. To address these issues, we propose Collaborative Beam Search (CBS), an iterative framework that harnesses the collective intelligence of multiple LLMs across both generation and verification stages. For generation, CBS leverages multiple LLMs to explore a broader search space, resulting in more diverse candidate steps. For verifications, CBS employs a perplexity-based collective consensus among these models, eliminating reliance on an external verifier or complex prompts. Between iterations, CBS leverages a dynamic quota allocation strategy that reassigns generation budget based on each model’s past performance, striking a balance between candidate diversity and quality. Experimental results on six tasks across arithmetic, logical, and commonsense reasoning show that CBS outperforms single‐model scaling and multi-model ensemble baselines by over 4 percentage points in average accuracy, demonstrating its effectiveness and general applicability.
pdf
bib
abs
Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation
Keane Ong
|
Rui Mao
|
Deeksha Varshney
|
Paul Pu Liang
|
Erik Cambria
|
Gianmarco Mengaldo
Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form—forward counterfactual reasoning—focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities for stakeholders, guiding their decision-making. However, performing this at scale is challenging due to the cognitive demands involved, underscoring the need for automated solutions. Large Language Models (LLMs) offer promise, but remain unexplored for this application. To address this gap, we introduce a novel benchmark, Fin-Force—**FIN**ancial **FOR**ward **C**ounterfactual **E**valuation. By curating financial news headlines and providing structured evaluation, Fin-Force supports LLM based forward counterfactual generation. This paves the way for scalable and automated solutions for exploring and anticipating future market developments, thereby providing structured insights for decision-making. Through experiments on Fin-Force, we evaluate state-of-the-art LLMs and counterfactual generation methods, analyzing their limitations and proposing insights for future research.
pdf
bib
abs
Towards Statistical Factuality Guarantee for Large Vision-Language Models
Zhuohang Li
|
Chao Yan
|
Nicholas J Jackson
|
Wendi Cui
|
Bo Li
|
Jiaxin Zhang
|
Bradley A. Malin
Advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive performance in image-conditioned text generation; however, hallucinated outputs–text that misaligns with the visual input–pose a major barrier to their use in safety-critical applications. We introduce ConfLVLM, a conformal-prediction-based framework that achieves finite-sample distribution-free statistical guarantees to the factuality of LVLM output. Taking each generated detail as a hypothesis, ConfLVLM statistically tests factuality via efficient heuristic uncertainty measures to filter out unreliable claims. We conduct extensive experiments covering three representative application domains: general scene understanding, medical radiology report generation, and document understanding. Remarkably, ConfLVLM reduces the error rate of claims generated by LLaVa-1.5 for scene descriptions from 87.8% to 10.0% by filtering out erroneous claims with a 95.3% true positive rate. Our results further show that ConfLVLM is highly flexible, and can be applied to any black-box LVLMs paired with any uncertainty measure for any image-conditioned free-form text generation task while providing a rigorous guarantee on controlling hallucination risk.
pdf
bib
abs
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?
Guangzhi Sun
|
Potsawee Manakul
|
Xiao Zhan
|
Mark Gales
Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.
pdf
bib
abs
Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Bolian Li
|
Yanran Wu
|
Xinyu Luo
|
Ruqi Zhang
Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-shifted speculative sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
pdf
bib
abs
Stimulate the Critical Thinking of LLMs via Debiasing Discussion
Ruiyu Xiao
|
Lei Wu
|
Yuanxing Liu
|
Weinan Zhang
|
Ting Liu
Large language models (LLMs) often succumb to users’ viewpoints when faced with conflicting perspectives. We identify two key biases underlying this issue : stance homogeneity bias and human preference bias. To address these biases, we propose a novel two-stage training framework: Multi-stance Discussion Sampling and Truth Alignment Training (MDTA). First, we introduce an equal multi-stance discussion framework to automatically generate multi-model discussion datasets. Based on this framework, we construct the first and largest multi-model fair discussion dataset named Eq-Discussion for supervised fine-tuning, reducing stance homogeneity bias. Second, we optimize Reinforcement Learning from Human Feedback (RLHF) to align with discussion correctness, mitigating human preference bias. Extensive experimental results demonstrate that MDTA effectively reduces both biases and significantly enhances the performance of LLMs across a variety of downstream tasks, including reading comprehension, logical reasoning, and social question answering. Furthermore, we observe that MDTA improves the generalization capabilities of LLMs, leading to substantial performance improvements in non-discussion scenarios and on out-of-domain datasets.
pdf
bib
abs
Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning
Xintong Li
|
Jalend Bantupalli
|
Ria Dharmani
|
Yuwei Zhang
|
Jingbo Shang
There has been a surge in the use of large language models (LLM) conversational agents to generate responses based on long-term history from multiple sessions. However, existing long-term open-domain dialogue datasets lack complex, real-world personalization and fail to capture implicit reasoning—where relevant information is embedded in subtle, syntactic, or semantically distant connections rather than explicit statements. In such cases, traditional retrieval methods fail to capture relevant context, and long-context modeling also becomes inefficient due to numerous complicated persona-related details. To address this gap, we introduce ImplexConv, a large-scale long-term dataset with 2,500 examples, each containing approximately 100 conversation sessions, designed to study implicit reasoning in personalized dialogues. Additionally, we propose TaciTree, a novel hierarchical tree framework that structures conversation history into multiple levels of summarization. Instead of brute-force searching all data, TaciTree enables an efficient, level-based retrieval process where models refine their search by progressively selecting relevant details. Our experiments demonstrate that TaciTree significantly improves the ability of LLMs to reason over long-term conversations with implicit contextual dependencies.
pdf
bib
abs
Improving Instruct Models for Free: A Study on Partial Adaptation
Ozan Irsoy
|
Pengxiang Cheng
|
Jennifer L Chen
|
Daniel Preotiuc-Pietro
|
Shiyue Zhang
|
Duccio Pappadopulo
Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tun- ing may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.
pdf
bib
abs
CoMMIT: Coordinated Multimodal Instruction Tuning
Xintong Li
|
Junda Wu
|
Tong Yu
|
Rui Wang
|
Yu Wang
|
Xiang Chen
|
Jiuxiang Gu
|
Lina Yao
|
Julian McAuley
|
Jingbo Shang
Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy between the two modules so that LLMs can adapt their reasoning abilities to downstream tasks while feature encoders can adjust to provide more task-specific information about its modality. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find the unbalanced learning between the feature encoder and the LLM can cause problems of oscillation and biased learning that lead to sub-optimal convergence. Inspired by our findings, we propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning. Based on this, we further design a dynamic learning scheduler that better coordinates the learning between the LLM and feature encoder, alleviating the problems of oscillation and biased learning. In addition, we introduce an auxiliary regularization on the gradient to promote updating with larger step sizes, which potentially allows for a more accurate estimation of the proposed MultiModal Balance Coefficient and further improves the training sufficiency. Our proposed approach is agnostic to the architecture of LLM and feature encoder, so it can be generically integrated with various MLLMs. We conduct experiments on multiple downstream tasks with various MLLMs, demonstrating that the proposed method is more effective than the baselines in MLLM instruction tuning.
pdf
bib
abs
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Tianhao Wu
|
Weizhe Yuan
|
Olga Golovneva
|
Jing Xu
|
Yuandong Tian
|
Jiantao Jiao
|
Jason E Weston
|
Sainbayar Sukhbaatar
Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model’s ability to judge and follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
pdf
bib
abs
AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction
Song Wang
|
Zhen Tan
|
Zihan Chen
|
Shuang Zhou
|
Tianlong Chen
|
Jundong Li
Recent progress in large language model (LLM)-based multi-agent collaboration highlights the power of structured communication in enabling collective intelligence. However, existing methods largely rely on static or graph-based inter-agent topologies, lacking the potential adaptability and flexibility in communication. In this work, we propose a new framework that rethinks multi-agent coordination through a sequential structure rather than a graph structure, offering a significantly larger topology space for multi-agent communication. Our method focuses on two key directions: (1) Next-Agent Prediction, which selects the most suitable agent role at each step, and (2) Next-Context Selection (NCS), which enables each agent to selectively access relevant information from any previous step. Together, these components construct task-adaptive communication pipelines that support both role flexibility and global information flow. Extensive evaluations across multiple benchmarks demonstrate that our approach achieves superior performance while substantially reducing communication overhead.
pdf
bib
abs
A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users
Nishant Balepur
|
Matthew Shu
|
Yoo Yeon Sung
|
Seraphina Goldfarb-Tarrant
|
Shi Feng
|
Fumeng Yang
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
To assist users in complex tasks, LLMs generate plans: step-by-step instructions towards a goal. While alignment methods aim to ensure LLM plans are helpful, they train (RLHF) or evaluate (ChatbotArena) on what users prefer, assuming this reflects what helps them. We test this with Planorama: an interface where 126 users answer 300 multi-step questions with LLM plans. We get 4388 plan executions and 5584 comparisons to measure plan helpfulness (QA success) and user preferences on plans, and recreate the setup in agents and reward models to see if they simulate or prefer what helps users. We expose: 1) user/model preferences and agent success do not accurately predict which plans help users, so common alignment feedback can misalign with helpfulness; 2) this gap is not due to user-specific preferences, as users are similarly successful when using plans they prefer/disprefer; 3) surface-level cues like brevity and question similarity strongly link to preferences, but such biases fail to predict helpfulness. In all, we argue aligning helpful LLMs needs feedback from real user interactions—not just preferences of what looks helpful—so we discuss the plan NLP researchers can execute to solve this problem.
pdf
bib
abs
Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication
Jocelyn J Shen
|
Akhila Yerukola
|
Xuhui Zhou
|
Cynthia Breazeal
|
Maarten Sap
|
Hae Won Park
Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.
pdf
bib
abs
Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation
Song Wang
|
Zihan Chen
|
Peng Wang
|
Zhepei Wei
|
Zhen Tan
|
Yu Meng
|
Cong Shen
|
Jundong Li
Retrieval-augmented generation (RAG) addresses the limitation of large language models (LLMs) in achieving up-to-date information by integrating external knowledge sources, but it is hindered by noisy or irrelevant retrieved data, leading to reduced accuracy. Additionally, most RAG methods rely on task-specific supervision, reducing their adaptability across domains. To overcome these challenges, we propose WinnowRAG, a novel multi-agent debate-based RAG framework. WinnowRAG operates in two stages: in Stage I, query-aware clustering groups similar documents, with each cluster assigned to an LLM agent for generating personalized responses. A critic LLM then consolidates these answers, forming super-agents. In Stage II, the super-agents engage in a structured discussion to filter out incorrect or irrelevant information, ensuring only relevant knowledge is used for final response generation. Crucially, WinnowRAG is unsupervised and leverages pretrained LLMs without requiring fine-tuning, making it easily adaptable to various tasks. The experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.
pdf
bib
abs
Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition‐Informed Approach to Quantifying Identity Fusion from Text
Devin R. Wright
|
Jisun An
|
Yong-Yeol Ahn
Quantifying *identity fusion*—the psychological merging of self with another entity or abstract target (e.g., a religious group, political party, ideology, value, brand, belief, etc.)—is vital for understanding a wide range of group‐based human behaviors. We introduce the Cognitive Linguistic Identity Fusion Score ([CLIFS](https://github.com/DevinW-sudo/CLIFS)), a novel metric that integrates cognitive linguistics with large language models (LLMs), which builds on implicit metaphor detection. Unlike traditional pictorial and verbal scales, which require controlled surveys or direct field contact, CLIFS delivers fully automated, scalable assessments while maintaining strong alignment with the established verbal measure. In benchmarks, CLIFS outperforms both existing automated approaches and human annotation. As a proof of concept, we apply CLIFS to violence risk assessment to demonstrate that it can improve violence risk assessment by more than 240%. Building on our identification of a new NLP task and early success, we underscore the need to develop larger, more diverse datasets that encompass additional fusion-target domains and cultural backgrounds to enhance generalizability and further advance this emerging area. CLIFS models and code are public at [https://github.com/DevinW-sudo/CLIFS](https://github.com/DevinW-sudo/CLIFS).
pdf
bib
abs
SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham
|
Le Hoang Nam
|
Phu-Vinh Nguyen
|
Chris Ngo
|
Truong-Son Hy
Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To our knowledge, SilVar is the first open-source, speech-driven VLM. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.
pdf
bib
abs
CEMTM: Contextual Embedding-based Multimodal Topic Modeling
Amirhossein Abaskohi
|
Raymond Li
|
Chuyuan Li
|
Shafiq Joty
|
Giuseppe Carenini
We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.
pdf
bib
abs
RedHerring Attack: Testing the Reliability of Attack Detection
Jonathan Rusert
In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an “incorrect” prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.
pdf
bib
abs
Modeling Bottom-up Information Quality during Language Processing
Cui Ding
|
Yanning Yin
|
Lena Ann Jäger
|
Ethan Wilcox
Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing—noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the “quality” of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as a Bayesian update. Second, we test our operationalization by comparing participants’ reading times in conditions where words’ information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
pdf
bib
abs
Data Drives Unstable Hierarchical Generalization in LMs
Tian Qin
|
Naomi Saphra
|
David Alvarez-Melis
Early in training, LMs can behave like n-gram models, but eventually, they often learn tree-based syntactic rules and generalize hierarchically out of distribution (OOD). We study this shift using controlled grammar-learning tasks: question formation and tense inflection. We find a model learns to generalize hierarchically if its training data is *complex*–in particular, if it includes center-embedded clauses, a special syntactic structure. Under this definition, complex data drives hierarchical rules, while less complex data encourages shortcut learning in the form of n-gram-like linear rules. Furthermore, we find that a model uses rules to generalize, whether hierarchical or linear, if its training data is *diverse*–in particular, if it includes many distinct syntax trees in the training set. Under this definition, diverse data promotes stable rule learning, whereas less diverse data promotes memorization of individual syntactic sequences. Finally, intermediate diversity and intermediate complexity form an *unstable regime*, which is characterized by oscillatory learning dynamics and inconsistent behaviors across random seeds. These results highlight the central role of training data in shaping generalization and explain why competing strategies can lead to unstable outcomes.
pdf
bib
abs
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
Jiahao Qiu
|
Yinghui He
|
Xinzhe Juan
|
Yimin Wang
|
Yuhan Liu
|
Zixin Yao
|
Yue Wu
|
Xun Jiang
|
Ling Yang
|
Mengdi Wang
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: **EmoEval** simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. **EmoGuard** serves as an intermediary, monitoring users’ mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions.
pdf
bib
abs
Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs
Ayush Gupta
|
Ramneet Kaur
|
Anirban Roy
|
Adam D. Cobb
|
Rama Chellappa
|
Susmit Jha
We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model’s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.
pdf
bib
abs
Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation
François Ledoyen
|
Gaël Dias
|
Jeremie Pantin
|
Alexis Lechervy
|
Fabrice Maurel
|
Youssef Chahir
Simplifying complex texts is essential to ensure equitable access to information, particularly for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative provides a framework to make content more accessible for these individuals. However, manually creating such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specific constraints of ETR, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two complementary strategies: multi-task retrieval-augmented generation (RAG) for in-context learning (ICL), and MTL-LoRA for parameter-efficient fine-tuning (PEFT). Our experiments with Mistral-7B and LLaMA-3-8B, conducted on ETR-fr, a new high-quality dataset, show that MTL-LoRA consistently outperforms all other strategies in in-domain settings, while the MTL-RAG-based approach achieves better generalization in out-of-domain scenarios. Our code is publicly available at https://github.com/FrLdy/ETR-PEFT-Composition.
pdf
bib
abs
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Yiyang Huang
|
Yizhou Wang
|
Yun Fu
Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.
pdf
bib
abs
ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment
Ruochen Li
|
Jun Li
|
Bailiang Jian
|
Kun Yuan
|
Youxiang Zhu
Automatically generated radiology reports often receive high scores from existing evaluation metrics but fail to earn clinicians’ trust. This gap reveals fundamental flaws in how current metrics assess the quality of generated reports. We rethink the design and evaluation of these metrics and propose a clinically grounded Meta-Evaluation framework. We define clinically grounded criteria spanning clinical alignment and key metric capabilities, including discrimination, robustness, and monotonicity. Using a fine-grained dataset of ground truth and rewritten report pairs annotated with error types, clinical significance labels, and explanations, we systematically evaluate existing metrics and reveal their limitations in interpreting clinical semantics, such as failing to distinguish clinically significant errors, over-penalizing harmless variations, and lacking consistency across error severity levels. Our framework offers guidance for building more clinically reliable evaluation methods.
pdf
bib
abs
MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
Khai Le-Duc
|
Tuyen Tran
|
Bach Phan Tat
|
Nguyen Kim Hai Bui
|
Quan Dang Anh
|
Hung-Phong Tran
|
Thanh Thuy Nguyen
|
Ly Nguyen
|
Tuan Minh Phan
|
Thi Thu Phuong Tran
|
Chris Ngo
|
Khanh Xuan Nguyen
|
Thanh Nguyen-Tang
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMedST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field’s history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.
pdf
bib
abs
Beyond Checkmate: Exploring the Creative Choke Points for AI Generated Texts
Nafis Irtiza Tripto
|
Saranya Venkatraman
|
Mahjabin Nahar
|
Dongwon Lee
The rapid advancement of Large Language Models (LLMs) has revolutionized text generation but also raised concerns about potential misuse, making detecting LLM-generated text (AI text) increasingly essential. While prior work has focused on identifying AI text and effectively checkmating it, our study investigates a less-explored territory: portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion). Whether LLMs excel or falter in incorporating linguistic ingenuity across text segments, the results will critically inform their viability and boundaries as effective creative assistants to humans. Through an analogy with the structure of chess games, comprising opening, middle, and end games, we analyze segment-specific patterns to reveal where the most striking differences lie. Although AI texts closely resemble human writing in the body segment due to its length, deeper analysis shows a higher divergence in features dependent on the continuous flow of language, making it the most informative segment for detection. Additionally, human texts exhibit greater stylistic variation across segments, offering a new lens for distinguishing them from AI. Overall, our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies. Codes available at https://github.com/tripto03/chess_inspired_human_ai_text_distinction.
pdf
bib
abs
MoR: Better Handling Diverse Queries with a Mixture of Sparse, Dense, and Human Retrievers
Jushaan Singh Kalra
|
Xinran Zhao
|
To Eun Kim
|
Fengyu Cai
|
Fernando Diaz
|
Tongshuang Wu
Retrieval-augmented Generation (RAG) is powerful, but its effectiveness hinges on which retrievers we use and how. Different retrievers offer distinct, often complementary signals: BM25 captures lexical matches; dense retrievers, semantic similarity. Yet in practice, we typically fix a single retriever based on heuristics, which fails to generalize across diverse information needs. Can we dynamically select and integrate multiple retrievers for each individual query, without the need for manual selection? In our work, we validate this intuition with quantitative analysis and introduce a mixture of retrievers: a zero-shot, weighted combination of heterogeneous retrievers. Extensive experiments show that such mixtures are effective and efficient: Despite totaling just 0.8B parameters, this mixture outperforms every individual retriever and even larger 7B models—by +10.8% and +3.9% on average, respectively. Further analysis also shows that this mixture framework can help incorporate specialized non-oracle human information sources as retrievers to achieve good collaboration, with a 58.9% relative performance improvement over simulated humans alone.
pdf
bib
abs
Learning Contextual Retrieval for Robust Conversational Search
Seunghan Yang
|
Juntae Lee
|
Jihwan Bang
|
Kyuhong Shim
|
Minsoo Kim
|
Simyung Chang
Effective conversational search demands a deep understanding of user intent across multiple dialogue turns. Users frequently use abbreviations and shift topics in the middle of conversations, posing challenges for conventional retrievers. While query rewriting techniques improve clarity, they often incur significant computational cost due to additional autoregressive steps. Moreover, although LLM-based retrievers demonstrate strong performance, they are not explicitly optimized to track user intent in multi-turn settings, often failing under topic drift or contextual ambiguity. To address these limitations, we propose ContextualRetriever, a novel LLM-based retriever that directly incorporates conversational context into the retrieval process. Our approach introduces: (1) a context-aware embedding mechanism that highlights the current query within the dialogue history; (2) intent-guided supervision based on high-quality rewritten queries; and (3) a training strategy that preserves the generative capabilities of the base LLM. Extensive evaluations across multiple conversational search benchmarks demonstrate that ContextualRetriever significantly outperforms existing methods while incurring no additional inference overhead.
pdf
bib
abs
LIDDIA: Language-based Intelligent Drug Discovery Agent
Reza Averly
|
Frazier N. Baker
|
Ian A Watson
|
Xia Ning
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDiA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDiA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDiA, demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA.
pdf
bib
abs
Agentic-R1: Distilled Dual-Strategy Reasoning
Weihua Du
|
Pranjal Aggarwal
|
Sean Welleck
|
Yiming Yang
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, **DualDistill**, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train **Agentic-R1**, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems and using text-based reasoning for abstract ones. Our method improves accuracy on computation-intensive tasks and reduces inference latency on standard benchmarks, demonstrating the promise of multi-strategy distillation for robust and efficient reasoning.
pdf
bib
abs
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
Yichi Zhang
|
Xin Luna Dong
|
Zhaojiang Lin
|
Andrea Madotto
|
Anuj Kumar
|
Babak Damavandi
|
Joyce Chai
|
Seungwhan Moon
Recent advances in conversational AI have been substantial, but developing real-time systems for perceptual task guidance remains challenging. These systems must provide interactive, proactive assistance based on streaming visual inputs, yet their development is constrained by the costly and labor-intensive process of data collection and system evaluation. To address these limitations, we present a comprehensive framework with three key contributions. First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos, resulting in ProAssist, a large-scale synthetic dialogue dataset spanning multiple domains. Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies. Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses, incorporating novel techniques for handling data imbalance and long-duration videos. This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.
pdf
bib
abs
Should I Share this Translation? Evaluating Quality Feedback for User Reliance on Machine Translation
Dayeon Ki
|
Kevin Duh
|
Marine Carpuat
As people increasingly use AI systems in work and daily life, feedback mechanisms that help them use AI responsibly are urgently needed, particularly in settings where users are not equipped to assess the quality of AI predictions. We study a realistic Machine Translation (MT) scenario where monolingual users decide whether to share an MT output, first without and then with quality feedback. We compare four types of quality feedback: explicit feedback that directly give users an assessment of translation quality using (1) error highlights and (2) LLM explanations, and implicit feedback that helps users compare MT inputs and outputs through (3) backtranslation and (4) question–answer (QA) tables. We find that all feedback types, except error highlights, significantly improve both decision accuracy and appropriate reliance. Notably, implicit feedback, especially QA tables, yields significantly greater gains than explicit feedback in terms of decision accuracy, appropriate reliance, and user perceptions – receiving the highest ratings for helpfulness and trust, and the lowest for mental burden.
pdf
bib
abs
ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
Ali Salamatian
|
Amirhossein Abaskohi
|
Wan-Cyuan Fan
|
Mir Rayat Imtiaz Hossain
|
Leonid Sigal
|
Giuseppe Carenini
Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.
pdf
bib
abs
LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval
Yanzhen Shen
|
Sihao Chen
|
Xueqiang Xu
|
Yunyi Zhang
|
Chaitanya Malaviya
|
Dan Roth
While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.
pdf
bib
abs
ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
Fanhu Zeng
|
Fei Zhu
|
Haiyang Guo
|
Xu-Yao Zhang
|
Cheng-Lin Liu
Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed instruction datasets. However, novel tasks would be encountered sequentially in dynamic world, which urges for equipping LMMs with multimodal continual instruction learning (MCIT) ability especially for diverse and challenging generative tasks. Existing MCIT methods do not fully exploit the unique attribute of LMMs and often gain performance at the expense of efficiency. In this paper, we propose a novel prompt learning framework for MCIT to effectively alleviate forgetting of previous knowledge while managing computational complexity with natural image-text supervision. Concretely, we learn prompts for each task and exploit efficient prompt fusion for knowledge transfer and prompt selection for complexity management with dual-modality guidance. Extensive experiments demonstrate that our approach achieves substantial +14.26% performance gain on MCIT benchmarks with remarkable x1.42 inference speed free from growing computation.
pdf
bib
abs
Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Xiaoshu Chen
|
Sihang Zhou
|
Ke Liang
|
Xiaoyu Sun
|
Xinwang Liu
Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, making gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) over-smoothed as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search to divide the rationale into internal semantically coherent chunks and focuses SLM on learning from only one chunk per iteration. In this way, CWT naturally isolates non-reasoning chunks that do not involve the core reasoning token (e.g., summary and transitional chunks) from the SLM learning for reasoning chunks, making the fraction of the core reasoning token increase in the corresponding iteration. Based on CWT, skip-thinking training (STT) is proposed. STT makes the SLM automatically skip non-reasoning medium chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
pdf
bib
abs
Can an Individual Manipulate the Collective Decisions of Multi-Agents?
Fengyuan Liu
|
Rui Zhao
|
Shuo Chen
|
Guohao Li
|
Philip Torr
|
Lei Han
|
Jindong Gu
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision?To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system’s collaborative decision-making process.More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system.Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework.We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
pdf
bib
abs
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages
Yujia Hu
|
Ming Shan Hee
|
Preslav Nakov
|
Roy Ka-Wei Lee
The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce SGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore’s diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: conversation, question-answering, and content composition. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments. Disclaimer: This paper contains sensitive content that may be disturbing to some readers.
pdf
bib
abs
Improving Clustering with Positive Pairs Generated from LLM-Driven Labels
Xiaotong Zhang
|
Ying Li
Traditional unsupervised clustering methods, which often rely on contrastive training of embedders, suffer from a lack of label knowledge, resulting in suboptimal performance. Furthermore, the presence of potential false negatives can destabilize the training process. Hence, we propose to improve clustering with Positive Pairs generated from LLM-driven Labels (PPLL). In the proposed framework, LLM is initially employed to cluster the data and generate corresponding mini-cluster labels. Subsequently, positive pairs are constructed based on these labels, and an embedder is trained using BYOL to obviate the need for negative pairs. Following training, the acquired label knowledge is integrated into K-means clustering. This framework enables the integration of label information throughout the training and inference processes, while mitigating the reliance on negative pairs. Additionally, it generates interpretable labels for improved understanding of clustering results. Empirical evaluations on a range of datasets demonstrate that our proposed framework consistently surpasses state-of-the-art baselines, achieving superior performance, robustness, and computational efficiency for diverse text clustering applications.
pdf
bib
abs
Gamma-Guard: Lightweight Residual Adapters for Robust Guardrails in Large Language Models
Lijia Lv
|
Yuanshu Zhao
|
Guan Wang
|
Xuehai Tang
|
Wen Jie
|
Jizhong Han
|
Songlin Hu
Large language models (LLMs) are widely deployed as zero-shot evaluators for answer grading, content moderation, and document ranking. Yet studies show that guard models (Guards)—LLMs fine-tuned for safety—remain vulnerable to “jailbreak” attacks, jeopardising downstream chatbots.We confirm this weakness on three public benchmarks (BeaverTails, XSTest, AdvBench) and trace it to representation shifts that arise in the embedding layer and cascade through the Transformer stack.To counteract the effect, we introduce Gamma-Guard: lightweight residual adapters inserted after the embeddings and at sparse intervals in the model. The adapters start with zero-scaled gates, so they retain the original behaviour; a brief adversarial fine-tuning phase then teaches them to denoise embeddings and refocus attention.With fewer than 0.1% extra parameters and only a 2% latency increase, Gamma-Guard lifts adversarial accuracy from <5% to 95% a 90 percentage-point gain while reducing clean-data accuracy by just 8 percentage points.Extensive ablations further show that robustness improvements persist across different layer placements and model sizes.To our knowledge, this is the first approach that directly augments large Guards with trainable adapters, providing a practical path toward safer large-scale LLM deployments.
pdf
bib
abs
Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning
Jingyang Lin
|
Andy Wong
|
Tian Xia
|
Shenghua He
|
Hui Wei
|
Mei Han
|
Jiebo Luo
Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-based Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI’s reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 28.0% gain on Loong’s financial subset.
pdf
bib
abs
Dynamic Energy-Based Contrastive Learning with Multi-Stage Knowledge Verification for Event Causality Identification
Ya Su
|
Hu Zhang
|
Yue Fan
|
Guangjun Zhang
|
YuJie Wang
|
Ru Li
|
Hongye Tan
Event Causal Identification (ECI) aims to identify fine-grained causal relationships between events from unstructured text. Contrastive learning has shown promise in enhancing ECI by optimizing representation distances between positive and negative samples. However, existing methods often rely on rule-based or random sampling strategies, which may introduce spurious causal positives. Moreover, static negative samples often fail to approximate actual decision boundaries, thus limiting discriminative performance. Therefore, we propose an ECI method enhanced by Dynamic Energy-based Contrastive Learning with multi-stage knowledge Verification (DECLV). Specifically, we integrate multi-source knowledge validation and LLM-driven causal inference to construct a multi-stage knowledge validation mechanism, which generates high-quality contrastive samples and effectively suppresses spurious causal disturbances. Meanwhile, we introduce the Stochastic Gradient Langevin Dynamics (SGLD) method to dynamically generate adversarial negative samples, and employ an energy-based function to model the causal boundary between positive and negative samples. The experimental results show that our method outperforms previous state-of-the-art methods on both benchmarks, EventStoryLine and Causal-TimeBank.
pdf
bib
abs
ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
Zhipeng Bian
|
Jieming Zhu
|
Qijiong Liu
|
Wang Lin
|
Guohao Cai
|
Zhaocheng Du
|
Jiacheng Sun
|
Zhou Zhao
|
Zhenhua Dong
Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.
pdf
bib
abs
From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
JianZhi Yan
|
Le Liu
|
Youcheng Pan
|
Shiwei Chen
|
Zike Yuan
|
Yang Xiang
|
Buzhou Tang
Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to its verbosity. In this work, we propose Multiround Adaptive Chain-of-Thought Compression (
MACC), a framework that leverages the
token elasticity phenomenon—where overly small token budgets may paradoxically increase output length—to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to dynamically determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6% over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that
test-time performance—accuracy and token length—can be reliably predicted using interpretable features like perplexity and compression rate
on training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in
https://github.com/Leon221220/MACC.
pdf
bib
abs
A Symbolic Adversarial Learning Framework for Evolving Fake News Generation and Detection
Chong Tian
|
Qirong Ho
|
Xiuying Chen
Rapid LLM advancements heighten fake news risks by enabling the automatic generation of increasingly sophisticated misinformation. Previous detection methods, including fine-tuned small models or LLM-based detectors, often struggle with its dynamically evolving nature. In this work, we propose a novel framework called the Symbolic Adversarial Learning Framework (SALF), which implements an adversarial training paradigm by an agent symbolic learning optimization process, rather than relying on numerical updates. SALF introduces a paradigm where the generation agent crafts deceptive narratives, and the detection agent uses structured debates to identify logical and factual flaws for detection, and they iteratively refine themselves through such adversarial interactions. Unlike traditional neural updates, we represent agents using agent symbolic learning, where learnable weights are defined by agent prompts, and simulate back-propagation and gradient descent by operating on natural language representations of weights, loss, and gradients. Experiments on two multilingual benchmark datasets demonstrate SALF’s effectiveness, showing it generates sophisticated fake news that degrades state-of-the-art detection performance by up to 53.4% in Chinese and 34.2% in English on average. SALF also refines detectors, improving detection of refined content by up to 7.7%. We hope our work inspires further exploration into more robust, adaptable fake news detection systems.
pdf
bib
abs
RareSyn: Health Record Synthesis for Rare Disease Diagnosis
Huimin Wang
|
Yutian Zhao
|
Yefeng Zheng
|
Xian Wu
Diagnosis based on Electronic Health Records (EHRs) often struggles with data scarcity and privacy concerns. To address these issues, we introduce RareSyn, an innovative data synthesis approach designed to augment and de-identify EHRs, with a focus on rare diseases. The core insight of RareSyn involves using seed EHRs of rare diseases to recall similar records from both common and rare diseases, and then leveraging Large Language Models to substitute the key medical information (e.g., symptoms or examination details) in these records with information from the knowledge graph, thereby generating new EHRs. We first train a transformer Encoder with contrastive learning to integrate various types of medical knowledge. Then, RareSyn engages in iterative processes of recalling similar EHRs, structuring EHRs, revising EHRs, and generating new EHRs until the produced EHRs achieve extensive coverage of the rare disease knowledge. We assess RareSyn based on its utility for diagnosis modeling, the diversity of medical knowledge it incorporates, and the privacy of the synthesized EHRs. Extensive experiments demonstrate its effectiveness in improving disease diagnosis, enhancing diversity, and maintaining privacy.
pdf
bib
abs
Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework
Jie Chen
|
Jinhao Jiang
|
Yingqian Min
|
Zican Dong
|
Shijie Wang
|
Xin Zhao
|
Ji-Rong Wen
Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions—termed stickers—which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.
pdf
bib
abs
CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu
|
Zeli Su
|
Ziyin Zhang
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
pdf
bib
abs
Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems
Xu Shen
|
Yixin Liu
|
Yiwei Dai
|
Yili Wang
|
Rui Miao
|
Yue Tan
|
Shirui Pan
|
Xin Wang
The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-Learner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-Learner.
pdf
bib
abs
Boosting Data Utilization for Multilingual Dense Retrieval
Chao Huang
|
Fengran Mo
|
Yufeng Chen
|
Changhao Guan
|
Zhenrui Yue
|
Xinyu Wang
|
Jinan Xu
|
Kaiyu Huang
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
pdf
bib
abs
Self-Augmented Preference Alignment for Sycophancy Reduction in LLMs
Chien Hung Chen
|
Hen-Hsen Huang
|
Hsin-Hsi Chen
Sycophancy causes models to produce answers that cater to user expectations rather than providing truthful responses. Sycophantic behavior in models can erode user trust by creating a perception of dishonesty or bias. This lack of authenticity may lead users to question the reliability and objectivity of the system’s responses. Although Reinforcement Learning from Human Feedback (RLHF) is effective in aligning models with human preferences, previous studies have observed that it can simultaneously amplify sycophantic behavior. However, these studies primarily focused on proprietary models and employed indirect analysis to demonstrate the influence of human feedback. Our study focuses on sycophancy in open-source models, which are more reproducible and transparent for research. We investigated the impact of human feedback on sycophancy by directly comparing models aligned with human feedback to those not aligned. To address sycophancy, we proposed assessing the user’s expected answer rather than ignoring it. Consequently, we developed the Sycophancy Answer Assessment (SAA) dataset and introduced Self-Augmented Preference Alignment, demonstrating that these methods effectively enhance the model’s assessment ability and significantly reduce sycophancy across tasks.
pdf
bib
abs
TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning
Hang Ni
|
Fan Liu
|
Xinyu Ma
|
Lixin Su
|
Shuaiqiang Wang
|
Dawei Yin
|
Hui Xiong
|
Hao Liu
Large language models (LLMs) have shown promise in automating travel planning, yet they often fall short in addressing nuanced spatiotemporal rationality. While existing benchmarks focus on basic plan validity, they neglect critical aspects such as route efficiency, POI appeal, and real-time adaptability. This paper introduces **TP-RAG**, the first benchmark tailored for retrieval-augmented, spatiotemporal-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain annotated POIs, and 18,784 high-quality travel trajectory references sourced from online tourist documents, enabling dynamic and context-aware planning. Through extensive experiments, we reveal that integrating reference trajectories significantly improves spatial efficiency and POI rationality of the travel plan, while challenges persist in universality and robustness due to conflicting references and noisy data. To address these issues, we propose *EvoRAG*, an evolutionary framework that potently synergizes diverse retrieved trajectories with LLMs’ intrinsic reasoning. *EvoRAG* achieves state-of-the-art performance, improving spatiotemporal compliance and reducing commonsense violation compared to ground-up and retrieval-augmented baselines. Our work underscores the potential of hybridizing Web knowledge with LLM-driven optimization, paving the way for more reliable and adaptive travel planning agents.
pdf
bib
abs
Recontextualizing Revitalization: A Mixed Media Approach to Reviving the Nüshu Language
Ivory Yang
|
Xiaobo Guo
|
Yuxin Wang
|
Hefan Zhang
|
Yaning Jia
|
William Dinauer
|
Soroush Vosoughi
Nüshu is an endangered language from Jiangyong County, China, and the world’s only known writing system created and used exclusively by women. Recent Natural Language Processing (NLP) work has digitized small Nüshu-Chinese corpora, but the script remains computationally inaccessible due to its handwritten, mixed-media form and dearth of multimodal resources. We address this gap with two novel datasets: NüshuVision, an image corpus of 500 rendered sentences in traditional vertical, right-to-left orthography, and NüshuStrokes, the first sequential handwriting recordings of all 397 Unicode Nüshu characters by an expert calligrapher. Evaluating five state-of-the-art Chinese Optical Character Recognition (OCR) systems on NüshuVision shows that all fail entirely, each yielding a Character Error Rate (CER) of 1.0. Fine-tuning Microsoft’s TrOCR on NüshuVision lowers CER to 0.67, a modest yet meaningful improvement. These contributions establish the first multimodal foundation for Nüshu revitalization and offer a culturally grounded framework for language preservation.
pdf
bib
abs
Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving
Chuxue Cao
|
Mengze Li
|
Juntao Dai
|
Jinluan Yang
|
Zijian Zhao
|
Shengyu Zhang
|
Weijie Shi
|
Chengzhong Liu
|
Sirui Han
|
Yike Guo
Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B’s low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs’ generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs’ mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
pdf
bib
abs
From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
Tianduo Wang
|
Lu Xu
|
Wei Lu
|
Shanbo Cheng
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
pdf
bib
abs
CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
Yong Zhao
|
Kai Xu
|
Zhengqiu Zhu
|
Yue Hu
|
Zhiheng Zheng
|
Yingfeng Chen
|
Yatai Ji
|
Chen Gao
|
Yong Li
|
Jincai Huang
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings—spanning environment, action, and perception—largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose -Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming frontier-based baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/tsinghua-fib-lab/CityEQA.git.
pdf
bib
abs
Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression
Sreetama Sarkar
|
Yue Che
|
Alex Gavin
|
Peter Anthony Beerel
|
Souvik Kundu
Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from “hallucination”, generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present **SPIN**, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference **without incurring any significant compute or latency overhead**. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-k attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to **2.7x** while maintaining F1, and improving throughput by **1.8x** compared to existing alternatives.
pdf
bib
abs
Examining False Positives under Inference Scaling for Mathematical Reasoning
Yu Wang
|
Nan Yang
|
Liang Wang
|
Furu Wei
|
Fuli Feng
Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives, suggesting a significantly lower scaling ceiling than what automatic evaluations indicate. Additionally, we analyze specific instances of false positives and discuss potential limitations in self-improvement techniques and synthetic data generation under such conditions. Our data and code are publicly available at https://github.com/Wloner0809/False-Positives-in-Math.
pdf
bib
abs
Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
Yikang Liu
|
Wanyang Zhang
|
Yiming Wang
|
Jialong Tang
|
Pei Zhang
|
Baosong Yang
|
Fei Huang
|
Rui Wang
|
Hai Hu
Translationese refers to linguistic properties that usually occur in translated texts. Previous works study translationese by framing it as a binary classification between original texts and translated texts. In this paper, we argue that translationese should be graded instead of binary and propose the first measure for translationese—the translationese-index (T-index), computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use synthesized translations and translations in the wild to evaluate T-index’s generalizability in cross-domain settings and its validity against human judgments.Our results show that T-index can generalize to unseen genres, authors, and language pairs. Moreover, T-index computed using two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can effectively capture translationese, as demonstrated by alignment with human pointwise ratings and pairwise judgments.Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics andcan serve as a complementary metric in MT QE.
pdf
bib
abs
Exploring the Limitations of Mamba in COPY and CoT Reasoning
Ruifeng Ren
|
Zhicong Li
|
Yong Liu
Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba’s ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba’s limitations compared to Transformers in learning these tasks.
pdf
bib
abs
ProcWorld: Benchmarking Large Model Planning in Reachability-Constrained Environments
Dong Wang
|
Xinghang Li
|
Zhengshen Zhang
|
Jirong Liu
|
Xiao Ma
|
Hanbo Zhang
|
Tao Kong
|
Huaping Liu
We introduce ProcWorld, a large-scale benchmark for partially observable embodied spatial reasoning and long-term planning with large language models (LLM) and vision language models (VLM). ProcWorld features a wide range of challenging embodied navigation and object manipulation tasks, covering 16 task types, 5,000 rooms, and over 10 million evaluation trajectories with diverse data distribution. ProcWorld supports configurable observation modes, ranging from text-only descriptions to vision-only observations. It enables text-based actions to control the agent following language instructions. ProcWorld has presented significant challenges for LLMs and VLMs: (1) active information gathering given partial observations for disambiguation; (2) simultaneous localization and decision-making by tracking the spatio-temporal state-action distribution; (3) constrained reasoning with dynamic states subject to physical reachability. Our extensive evaluation of 15 foundation models and 5 reasoning algorithms (with over 1 million rollouts) indicates larger models perform better. However, ProcWorld remains highly challenging for existing state-of-the-art models and in-context learning methods due to constrained reachability and the need for combinatorial spatial reasoning.
pdf
bib
abs
R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation
Kaijie Chen
|
Zihao Lin
|
Zhiyang Xu
|
Ying Shen
|
Yuguang Yao
|
Joy Rimchala
|
Jiaxin Zhang
|
Lifu Huang
Reasoning is a fundamental capability often required in real-world text-to-image (T2I) generation, e.g., generating “a bitten apple that has been left in the air for more than a week” necessitates understanding temporal decay and commonsense concepts. While recent T2I models have made impressive progress in producing photorealistic images, their reasoning capability remains underdeveloped and insufficiently evaluated. To bridge this gap, we introduce R2I-Bench, a comprehensive benchmark specifically designed to rigorously assess reasoning-driven T2I generation. R2I-Bench comprises 3068 meticulously curated data instances, spanning 7 core reasoning categories, including commonsense, mathematical, logical, compositional, numerical, causal, and concept mixing. To facilitate fine-grained evaluation, we design R2IScore, a QA-style metric based on instance-specific, reasoning-oriented evaluation questions that assess three critical dimensions: text-image alignment, reasoning accuracy, and image quality. Extensive experiments with 16 representative T2I models, including a strong pipeline-based framework that decouples reasoning and generation using the state-of-the-art language and image generation models, demonstrate consistently limited reasoning performance, highlighting the need for more robust, reasoning-aware architectures in the next generation of T2I systems.
pdf
bib
abs
Can GRPO Boost Complex Multimodal Table Understanding?
Xiaoqiang Kang
|
Shengen Wu
|
Zimu Wang
|
Yilin Liu
|
Xiaobo Jin
|
Kaizhu Huang
|
Wei Wang
|
Yutao Yue
|
Xiaowei Huang
|
Qiufeng Wang
Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model’s table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
pdf
bib
abs
MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
Agam Goyal
|
Xianyang Zhan
|
Yilun Chen
|
Koustuv Saha
|
Eshwar Chandrasekharan
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to enable scalable content moderation. MoMoE orchestrates four operators—Allocate, Predict, Aggregate, Explain—and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.
pdf
bib
abs
Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment
Jingcheng Deng
|
Zhongtao Jiang
|
Liang Pang
|
Zihao Wei
|
Liwei Chen
|
Kun Xu
|
Yang Song
|
Huawei Shen
|
Xueqi Cheng
A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs’ pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.
pdf
bib
abs
Evaluating LLM-Generated Diagrams as Graphs
Chumeng Liang
|
Jiaxuan You
Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams.
pdf
bib
abs
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
Agam Goyal
|
Vedant Rathi
|
William Yeh
|
Yian Wang
|
Yuen Chen
|
Hari Sundaram
Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model’s knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
pdf
bib
abs
VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning
Shi-Yu Tian
|
Zhi Zhou
|
Kun-Yang Yu
|
Ming Yang
|
Lin-Han Jia
|
Lan-Zhe Guo
|
Yu-Feng Li
Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, including mathematical reasoning. However, the current evaluation mostly focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing or contradictory conditions, known as ill-defined problems. To further study this problem, we develop a large-scale benchmark called Problems with Missing and Contradictory conditions (PMC) containing over 5,000 validated ill-defined mathematical problems. Our preliminary experiments through PMC reveal two challenges about existing methods: (1) traditional methods exhibit a trade-off between solving accuracy and rejection capabilities, and (2) formal methods struggle with modeling complex problems. To address these challenges, We develop Variable-Constraint Search (VCSearch), a training-free framework that leverages formal language to detect ill-defined problems, where a variable-constraint pair search strategy is incorporated to improve the modeling capability of formal language. Extensive experiments demonstrate that VCSearch improves the accuracy of identifying unsolvable problems by at least 12% across different LLMs, thus achieving stronger robust mathematical reasoning ability.
pdf
bib
abs
How do autoregressive transformers solve full addition?
Wang Peixu
|
Chen Yu
|
Yu Ming
|
Cheng Xiang
Large pre-trained language models have demonstrated impressive capabilities, but there is still much to learn about how they operate. In this study, we conduct an investigation of the autoregressive transformer’s ability to perform basic addition operations. Specifically, by using causal analysis we found that a few different attention heads in the middle layers control the addition carry, with each head processing carries of different lengths. Due to the lack of global focus on the sequence within these attention heads, the model struggles to handle long-sequence addition tasks. By performing inference intervention on mistral-7B, partial task performance can be restored, with the accuracy on 20-digit long-sequence additions from 2% to 38%. Through fine-tuning, a new mechanism branches out for handling complex cases, yet it still faces challenges with length generalization. Our research reveals how the models perform basic arithmetic task, and further provides insights into the debate on whether these models are merely statistical.
pdf
bib
abs
MAIN: Mutual Alignment Is Necessary for instruction tuning
Fanyi Yang
|
Jianfeng Liu
|
Xin Zhang
|
Haoyu Liu
|
Xixin Cao
|
Yuefeng Zhan
|
Hao Sun
|
Weiwei Deng
|
Feng Sun
|
Qi Zhang
Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs. All code is available from our repository.
pdf
bib
abs
Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation
Dingwei Chen
|
Ziqiang Liu
|
Feiteng Fang
|
Chak Tou Leong
|
Shiwen Ni
|
Ahmadreza Argha
|
Hamid Alinejad-Rokny
|
Min Yang
|
Chengming Li
Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs—commonly referred to as “hallucinations”—remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose **PLI** (**P**remature **L**ayers **I**nterpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs’ internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.
pdf
bib
abs
DeepWell-Adol: A Scalable Expert-Based Dialogue Corpus for Adolescent Positive Mental Health and Wellbeing Promotion
Wenyu Qiu
|
Yuxiong Wang
|
Jiajun Tan
|
Hanchao Hou
|
Qinda Liu
|
Wei Yao
|
Shiguang Ni
Promoting positive mental health and well-being, especially in adolescents, is a critical yet underexplored area in natural language processing (NLP). Most existing NLP research focuses on clinical therapy or psychological counseling for the general population, which does not adequately address the preventative and growth-oriented needs of adolescents. In this paper, we introduce DeepWell-Adol, a domain-specific Chinese dialogue corpus grounded in positive psychology and coaching, designed to foster adolescents’ positive mental health and well-being. To balance the trade-offs between data quality, quantity, and scenario diversity, the corpus comprises two main components: human expert-written seed data (ensuring professional quality) and its mirrored expansion (automatically generated using a two-stage scenario-based augmentation framework). This approach enables large-scale data creation while maintaining domain relevance and reliability. Comprehensive evaluations demonstrate that the corpus meets general standards for psychological dialogue and emotional support, while also showing superior performance across multiple models in promoting positive psychological processes, character strengths, interpersonal relationships, and healthy behaviors. Moreover, the framework proposed for building and evaluating DeepWell-Adol offers a flexible and scalable method for developing domain-specific datasets. It significantly enhances automation and reduces development costs without compromising professional standards—an essential consideration in sensitive areas like adolescent and elderly mental health. We make our dataset publicly available.
pdf
bib
abs
Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise
Xiaoqun Liu
|
Jiacheng Liang
|
Luoxi Tang
|
Muchao Ye
|
Weicheng Ma
|
Zhaohan Xi
Large language models (LLMs) are widely adapted for downstream applications through fine-tuning, a process named customization. However, recent studies have identified a vulnerability during this process, where malicious samples can compromise the robustness of LLMs and amplify harmful behaviors. To address this challenge, we propose an adaptive data curation approach allowing any text to be curated to enhance its effectiveness in counteracting harmful samples during customization. To avoid the need for additional defensive modules, we further introduce a comprehensive mitigation framework spanning the lifecycle of the customization process: before customization to immunize LLMs against future compromise attempts, during customization to neutralize risks, and after customization to restore compromised models. Experimental results demonstrate a significant reduction in compromising effects, achieving up to a 100% success rate in generating safe responses. By combining adaptive data curation with lifecycle-based mitigation strategies, this work represents a solid step forward in mitigating compromising risks and ensuring the secure adaptation of LLMs.
pdf
bib
abs
Speculative Safety-Aware Decoding
Xuekang Wang
|
Shengyu Zhu
|
Xueqi Cheng
Despite extensive efforts to align large language models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource-intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses the desired safety property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of both models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.
pdf
bib
abs
PanicToCalm: A Proactive Counseling Agent for Panic Attacks
Jihyun Lee
|
Yejin Min
|
San Kim
|
Yejin Jeon
|
Sung Jun Yang
|
Hyounghun Kim
|
Gary Lee
Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce Pace, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train Pacer, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that Pacer outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with Pacer consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios.
pdf
bib
abs
CoPL: Collaborative Preference Learning for Personalizing LLMs
Youngbin Choi
|
Seunghyuk Cho
|
Minjong Lee
|
MoonJeong Park
|
Yesong Ko
|
Jungseul Ok
|
Dongwoo Kim
Personalizing large language models (LLMs) is important for aligning outputs with diverse user preferences, yet existing methods struggle with flexibility and generalization. We propose CoPL (Collaborative Preference Learning), a graph-based collaborative filtering framework that models user-response relationships to enhance preference estimation, particularly in sparse annotation settings. By integrating a mixture of LoRA experts, CoPL efficiently fine-tunes LLMs while dynamically balancing shared and user-specific preferences. Additionally, an optimization-free adaptation strategy enables generalization to unseen users without fine-tuning. Experiments on TL;DR, UltraFeedback-P, and PersonalLLM datasets demonstrate that CoPL outperforms existing personalized reward models, effectively capturing both common and controversial preferences, making it a scalable solution for personalized LLM alignment.
pdf
bib
abs
Dynamic Collaboration of Multi-Language Models based on Minimal Complete Semantic Units
Chao Hao
|
Zezheng Wang
|
Yanhua Huang
|
Ruiwen Xu
|
Wenzhe Niu
|
Xin Liu
|
Zitong Yu
This paper investigates the enhancement of reasoning capabilities in language models through token-level multi-model collaboration. Our approach selects the optimal tokens from the next token distributions provided by multiple models to perform autoregressive reasoning. Contrary to the assumption that more models yield better results, we introduce a distribution distance-based dynamic selection strategy (DDS) to optimize the multi-model collaboration process. To address the critical challenge of vocabulary misalignment in multi-model collaboration, we propose the concept of minimal complete semantic units (MCSU), which is simple yet enables multiple language models to achieve natural alignment within the linguistic space. Experimental results across various benchmarks demonstrate the superiority of our method. The codes will be released soon.
pdf
bib
abs
AI Chatbots as Professional Service Agents: Developing a Professional Identity
Wenwen Li
|
Kangwei Shi
|
Yidong Chai
With the rapid expansion of large language model (LLM) applications, there is an emerging shift in the role of LLM-based AI chatbots from serving merely as general inquiry tools to acting as professional service agents. However, current studies often overlook a critical aspect of professional service agents: the act of communicating in a manner consistent with their professional identities. This is of particular importance in the healthcare sector, where effective communication with patients is essential for achieving professional goals, such as promoting patient well-being by encouraging healthy behaviors. To bridge this gap, we propose LAPI (LLM-based Agent with a Professional Identity), a novel framework for designing professional service agent tailored for medical question-and-answer (Q&A) services, ensuring alignment with a specific professional identity. Our method includes a theory-guided task planning process that decomposes complex professional tasks into manageable subtasks aligned with professional objectives and a pragmatic entropy method designed to generate professional and ethical responses with low uncertainty. Experiments on various LLMs show that the proposed approach outperforms baseline methods, including few-shot prompting, chain-of-thought prompting, across key metrics such as fluency, naturalness, empathy, patient-centricity, and ROUGE-L scores. Additionally, the ablation study underscores the contribution of each component to the overall effectiveness of the approach.
pdf
bib
abs
DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
Zhuoyuan Mao
|
Mengjie Zhao
|
Qiyu Wu
|
Hiromi Wakaki
|
Yuki Mitsufuji
Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model’s ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We open-source the codes, models and datasets we constructed: https://github.com/sony/DeepResonance.
pdf
bib
abs
Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent
Giulia Pucci
|
Leonardo Ranaldi
Large language models (LLMs) have demonstrated capabilities that are highly satisfactory to a wide range of users by adapting to their culture and wisdom. Yet, this can translate into a propensity to produce responses that align with users’ viewpoints, even when the latter are wrong. This behaviour is known as sycophancy, the tendency of LLMs to generate misleading responses as long as they align with the user’s, inducing bias and reducing reliability. To make interactions consistent, reliable and safe, we introduce X-Agent, an Oversight Reasoning framework that audits human–LLM dialogues, reasons about them, captures sycophancy and corrects the final outputs. Concretely, X-Agent extends debate-based frameworks by (i) auditing human-LLM conversations, (ii) applying a defence layer that steers model behaviour and goes beyond user beliefs, and (iii) extracting reasoning traces from evaluations that serve as training signals for mitigating sycophancy. We evaluate X-Agent across diverse scenarios and languages, showing that it consistently detects sycophancy, reduces unwarranted agreement, and improves cross-turn consistency, advancing a reasoning-as-overview paradigm for safer LLM interaction. Our approach introduces a novel paradigm in which reasoning is not merely a means to solve problems, but as a mechanism for overseeing the problem-solving processes of other models.
pdf
bib
abs
CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability
Han Peng
|
Jinhao Jiang
|
Zican Dong
|
Xin Zhao
|
Lei Fang
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and Retrieval-Augmented Generation (RAG) to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce **CAFE**, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show that CAFE outperforms baselines, achieving an average SubEM improvement of up to 22.1% and 13.7% over SFT and RAG methods, respectively, across three different models. Our code is available at https://github.com/RUCAIBox/CAFE.
pdf
bib
abs
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
Senyu Li
|
Jiayi Wang
|
Felermino D. M. A. Ali
|
Colin Cherry
|
Daniel Deutsch
|
Eleftheria Briakou
|
Rui Sousa-Silva
|
Henrique Lopes Cardoso
|
Pontus Stenetorp
|
David Ifeoluwa Adelani
Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.
pdf
bib
abs
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
Nakyeong Yang
|
Minsung Kim
|
Seunghyun Yoon
|
Joongbo Shin
|
Kyomin Jung
Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the inherent complexity and interconnectedness of knowledge, which requires careful examination. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a novel benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE leverages a regularized explainability method to localize contextual knowledge neurons, updating only these neurons using carefully selected unforgotten samples. Experimental results demonstrate that existing unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
pdf
bib
abs
Calibrating Pseudo-Labeling with Class Distribution for Semi-supervised Text Classification
Weiyi Yang
|
Richong Zhang
|
Junfan Chen
|
Jiawei Sheng
Semi-supervised text classification (SSTC) aims to train text classification models with few labeled data and massive unlabeled data. Existing studies develop effective pseudo-labeling methods, but they can struggle with unlabeled data that have imbalanced classes mismatched with the labeled data, making the pseudo-labeling biased towards majority classes, resulting in catastrophic error propagation. We believe it is crucial to explicitly estimate the overall class distribution, and use it to calibrate pseudo-labeling to constrain majority classes. To this end, we formulate the pseudo-labeling as an optimal transport (OT) problem, which transports the unlabeled sample distribution to the class distribution. With a memory bank, we dynamically collect both the high-confidence pseudo-labeled data and true labeled data, thus deriving reliable (pseudo-) labels for class distribution estimation. Empirical results on 3 commonly used benchmarks demonstrate that our model is effective and outperforms previous state-of-the-art methods.
pdf
bib
abs
Coarse-to-Fine Grounded Memory for LLM Agent Planning
Wei Yang
|
Jinwei Xiao
|
Hongming Zhang
|
Qingyang Zhang
|
Yanna Wang
|
Bo Xu
Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse-to-Fine Grounded Memory (CFGM), a novel framework that grounds coarse-to-fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. CFGM grounds environmental information into coarse-grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid-grained tips from each experience. At inference, CFGM retrieves task-relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine-grained key information, enabling flexible self-QA reflection and plan correction. Extensive experiments on AlfWorld, Webshop and ScienceWorld demonstrate that CFGM significantly outperforms competitive baselines and comprehensively optimizes memory-enhanced LLM Agent system.
pdf
bib
abs
From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?
Xisheng Xiao
|
Hanlin Zhao
Large language models (LLMs) have demonstrated strong performance in solving math problems, and there is growing research on evaluating their robustness. Unlike previous studies that create problem variants by adding perturbations to a single problem, this paper focuses on the interaction between problems. Specifically, we combine two original problems with a logical connection to get a new math problem, and measure the LLMs’ performance on it to evaluate its compositional generalization, which is an important and essential reasoning capability in human intelligence. The result of experiments that cover 14 different LLMs shows that even when the mathematical essence remains unchanged, a simple form of combination can significantly reduce the performance of LLMs, revealing the limitation of their generalization ability. Additionally, we propose an automated pipeline with 98.2% accuracy to assist in annotating datasets (1 manual, 2 synthetic). The extensive experiments conducted on these datasets further verify the conclusion and obtain some important findings. Finally, we analyze the impact of factors such as difficulty and length on LLMs’ performance, offering insights for future research.
pdf
bib
abs
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
Mohammad Beigi
|
Ying Shen
|
Parshin Shojaee
|
Qifan Wang
|
Zichao Wang
|
Chandan K. Reddy
|
Ming Jin
|
Lifu Huang
Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster sycophancy—alignment with user-provided information, regardless of factual accuracy. In this paper, we introduce SMART (Sycophancy Mitigation through Adaptive Reasoning Trajectories), reconceptualizing sycophancy as a reasoning optimization problem rather than an output alignment issue. SMART employs a two-stage approach: (1) Uncertainty-Aware Adaptive Monte Carlo Tree Search (UA-MCTS), which dynamically adjusts exploration based on state-level uncertainty; and (2) progress-based reinforcement learning that distills these improved reasoning patterns into model adaptation. Through extensive experiments, we show that SMART significantly outperforms existing baselines in effectively reducing sycophancy while maintaining performance on out-of-distribution inputs. These findings demonstrate the importance of optimizing internal reasoning processes for developing aligned truthful AI assistant.
pdf
bib
abs
SimVBG: Simulating Individual Values by Backstory Generation
Bangde Du
|
Ziyi Ye
|
Zhijing Wu
|
Monika A. Jankowska
|
Shuqi Zhu
|
Qingyao Ai
|
Yujia Zhou
|
Yiqun Liu
As Large Language Models (LLMs) demonstrate increasingly strong human-like capabilities, the need to align them with human values has become significant. Recent advanced techniques, such as prompt learning and reinforcement learning, are being employed to bring LLMs closer to aligning with human values. While these techniques address broad ethical and helpfulness concerns, they rarely consider simulating individualized human values. To bridge this gap, we propose SimVBG, a framework that simulates individual values based on individual backstories that reflect their past experience and demographic information. SimVBG transforms structured data on an individual to a backstory and utilizes a multi-module architecture inspired by the Cognitive–Affective Personality System to simulate individual value based on the backstories. We test SimVBG on a self-constructed benchmark derived from the World Values Survey and show that SimVBG improves top-1 accuracy by more than 10% over the retrieval-augmented generation method. Further analysis shows that performance increases as additional interaction user history becomes available, indicating that the model can refine its persona over time. Code, dataset, and complete experimental results are available at https://github.com/bangdedadi/SimVBG.
pdf
bib
abs
EvolveSearch: An Iterative Self-Evolving Search Agent
Ding-Chu Zhang
|
Yida Zhao
|
Jialong Wu
|
Liwen Zhang
|
Baixuan Li
|
Wenbiao Yin
|
Yong Jiang
|
Yu-Feng Li
|
Kewei Tu
|
Pengjun Xie
|
Fei Huang
The rapid advancement of large language models (LLMs) has transformed the landscape of agentic information seeking capabilities through the integration of tools such as search engines and web browsers. However, current mainstream approaches for enabling LLM web search proficiency face significant challenges: supervised fine-tuning struggles with data production in open-search domains, while RL converges quickly, limiting their data utilization efficiency. To address these issues, we propose EvolveSearch, a novel iterative self-evolution framework that combines SFT and RL to enhance agentic web search capabilities without any external human-annotated reasoning data. Extensive experiments on seven multi-hop question-answering (MHQA) benchmarks demonstrate that EvolveSearch consistently improves performance across iterations, ultimately achieving an average improvement of 4.7% over the current state-of-the-art across seven benchmarks, opening the door to self-evolution agentic capabilities in open web search domains.
pdf
bib
abs
Syntax-Aware Retrieval Augmentation for Neural Symbolic Regression
Canmiao Zhou
|
Han Huang
Symbolic regression is a powerful technique for discovering mathematical expressions that best fit observed data. While neural symbolic regression methods based on large-scale pre-trained models perform well on simple tasks, the reliance on fixed parametric knowledge typically limits their generalization to complex and diverse data distributions. To address this challenge, we propose a syntax-aware retrieval-augmented mechanism that leverages the syntactic structure of symbolic expressions to perform context-aware retrieval from a pre-constructed token datastore during inference. This mechanism enables the model to incorporate highly relevant non-parametric prior information to assist in expression generation. Additionally, we design an entropy-based confidence network that dynamically adjusts the fusion strength between neural and retrieved components by estimating predictive uncertainty. Extensive experiments on multiple symbolic regression benchmarks demonstrate that the proposed method significantly outperforms representative baselines, validating the effectiveness of retrieval augmentation in enhancing the generalization performance of neural symbolic regression models.
pdf
bib
abs
Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs
Dingkun Zhang
|
Shuhan Qi
|
Xinyu Xiao
|
Kehai Chen
|
Xuan Wang
Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is efficient to reuse the existing ones and extend them to more modalities through Modality-incremental Continual Learning (MCL). The exploration of MCL is in its early stages. In this work, we dive into the causes of performance degradation in MCL. We uncover that it suffers not only from forgetting as in traditional continual learning, but also from misalignment between the modality-agnostic and modality-specific components. To this end, we propose an elegantly simple MCL paradigm called “MErge then ReAlign” (MERA) to address both forgetting and misalignment. MERA avoids introducing heavy model budgets or modifying model architectures, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate the impressive performance of MERA, holding an average of 99.84% Backward Relative Gain when extending to four modalities, achieving nearly lossless MCL performance. Our findings underscore the misalignment issue in MCL. More broadly, our work showcases how to adjust different components of MLLMs during continual learning.
pdf
bib
abs
Graceful Forgetting in Generative Language Models
Chunyang Jiang
|
Chi-Min Chan
|
Yiyang Cai
|
Yulong Liu
|
Wei Xue
|
Yike Guo
Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.
pdf
bib
abs
Answering Narrative-Driven Recommendation Queries via a Retrieve–Rank Paradigm and the OCG-Agent
Yunxiao Shi
|
Haoning Shang
|
Xing Zi
|
Wujiang Xu
|
Yue Feng
|
Min Xu
Narrative-driven recommendation queries are common in question-answering platforms, AI search engines, social forums, and some domain-specific vertical applications. Users typically submit free-form text requests for recommendations, e.g., “Any mind-bending thrillers like Shutter Island you’d recommend?” Such special queries have traditionally been addressed as generic QA task under the RAG paradigm. This work formally introduces narrative recommendation as a distinct task and contends that the RAG paradigm is inherently ill-suited for it, owing to information loss in LLMs when retrieving information from from multiple long and fragmented contexts, and limitations in ranking effectiveness. To overcome these limitations, we propose a novel retrieve-rank paradigm by theoretically demonstrating its superiority over RAG paradigm. Central to this new paradigm, we specially focus on the information retrieval stage and introduce Open-domain Candidate Generation (OCG)-Agent that generatively retrieves structurally adaptive and semantically aligned candidates, ensuring both extensive candidate coverage and high-quality information. We validate effectiveness of new paradigm and OCG-Agent’s retrieve mechanism under real-world datasets from Reddit and corporate education-consulting scenarios. Further extensive ablation studies confirming the rationality of each OCG-Agent component.
pdf
bib
abs
Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values
Hongbo Zhang
|
Han Cui
|
Guangsheng Bao
|
Linyi Yang
|
Jun Wang
|
Yue Zhang
We introduce Direct Value Optimization (DVO), an innovative offline reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on 3 math reasoning, 4 commonsense reasoning, and 3 coding tasks shows that DVO consistently outperforms existing offline preference optimization techniques by a significant margin of 4% to 6%, and is competitive to online GRPO but with higher sample efficiency. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
pdf
bib
abs
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Brendan Murphy
|
Dillon Bowen
|
Shahrad Mohammadzadeh
|
Tom Tseng
|
Julius Broomfield
|
Adam Gleave
|
Kellin Pelrine
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models with safeguards destroyed. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attacks and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
pdf
bib
abs
Neural Topic Modeling via Contextual and Graph Information Fusion
Jiyuan Liu
|
Jiaxing Yan
|
Chunjiang Zhu
|
Xingyu Liu
|
Li Qing
|
Yanghui Rao
Topic modeling is a powerful unsupervised tool for knowledge discovery. However, existing work struggles with generating limited-quality topics that are uninformative and incoherent, which hindering interpretable insights from managing textual data. In this paper, we improve the original variational autoencoder framework by incorporating contextual and graph information to address the above issues. First, the encoder utilizes topic fusion techniques to combine contextual and bag-of-words information well, and meanwhile exploits the constraints of topic alignment and topic sharpening to generate informative topics. Second, we develop a simple word co-occurrence graph information fusion strategy that efficiently increases topic coherence. On three benchmark datasets, our new framework generates more coherent and diverse topics compared to various baselines, and achieves strong performance on both automatic and manual evaluations.
pdf
bib
abs
CARE: A Disagreement Detection Framework with Concept Alignment and Reasoning Enhancement
Jiyuan Liu
|
Jielin Song
|
Yunhe Pang
|
Zhiyu Shen
|
Yanghui Rao
Disagreement detection is a crucial task in natural language processing (NLP), particularly in analyzing online discussions and social media content. Large language models (LLMs) have demonstrated significant advancements across various NLP tasks. However, the performance of LLM in disagreement detection is limited by two issues: *conceptual gap* and *reasoning gap*. In this paper, we propose a novel two-stage framework, Concept Alignment and Reasoning Enhancement (CARE), to tackle the issues. The first stage, Concept Alignment, addresses the gap between expert and model by performing **sub-concept taxonomy extraction**, aligning the model’s comprehension with human experts. The second stage, Reasoning Enhancement, improves the model’s reasoning capabilities by introducing curriculum learning workflow, which includes **rationale to critique** and **counterfactual to detection** for reducing spurious association. Extensive experiments on disagreement detection task demonstrate the effectiveness of our framework, showing superior performance in zero-shot and supervised learning settings, both within and across domains.
pdf
bib
abs
Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents
Yejin Yoon
|
Yuri Son
|
Namyeong So
|
Minseo Kim
|
Minsoo Cho
|
Chanhee Park
|
Seungshin Lee
|
Taeuk Kim
Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics.To evaluate an agent’s ability to initiate and recover from mode transitions, we propose two new metrics—Switch and Recovery.Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additionalgains, achieving 75.74% joint mode-intent accuracy and a 70.1% win rate against GPT-4o in human evaluation.These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.
pdf
bib
abs
LightThinker: Thinking Step-by-Step Compression
Jintian Zhang
|
Yuqi Zhu
|
Mengshu Sun
|
Yujie Luo
|
Shuofei Qiao
|
Lun Du
|
Da Zheng
|
Huajun Chen
|
Ningyu Zhang
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window.This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance.
pdf
bib
abs
How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark
Minglai Yang
|
Ethan Huang
|
Liang Zhang
|
Mihai Surdeanu
|
William Yang Wang
|
Liangming Pan
We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate Large Language Models’ (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.
pdf
bib
abs
Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval-Augmented Generation Across Learning Styles
Debdeep Sanyal
|
Agniva Maiti
|
Umakanta Maharana
|
Dhruv Kumar
|
Ankur Mali
|
C. Lee Giles
|
Murari Mandal
Effective teaching necessitates adapting pedagogical strategies to the inherent diversity of students, encompassing variations in aptitude, learning styles, and personality, a critical challenge in education and teacher training. Large Language Models (LLMs) offer a powerful tool to simulate complex classroom dynamics, providing a controlled environment for exploring optimal teaching patterns. However, existing simulation frameworks often fall short by neglecting comprehensive student modeling beyond basic knowledge states and, more importantly, by lacking mechanisms for teachers to dynamically adapt their approach based on student feedback and collective performance. Addressing these limitations, we propose a simulation framework that integrates LLM-based diverse student agents with a self-evolving teacher agent. We use genetic algorithms to automatically tune and optimize the teacher’s pedagogical parameters based on simulated student performance, enabling the teacher agent to discover and refine teaching patterns tailored to specific class characteristics. Complementing this, we introduce Persona-RAG, a novel Retrieval-Augmented Generation method specifically designed for personalized knowledge retrieval in pedagogical contexts, allowing students to retrieve information as per their learning styles. We show how Persona-RAG remains competitive with standard RAG baselines in accurately retrieving relevant information while adding a touch of personalization for students. Crucially, we perform extensive experiments and highlight the different patterns learnt by the teacher agent while optimizing over classes with students of various learning styles. Our work presents a significant step towards creating adaptive educational technologies and improving teacher training through realistic, data-driven simulation.
pdf
bib
abs
GeoEdit: Geometric Knowledge Editing for Large Language Models
Yujie Feng
|
Li-Ming Zhan
|
Zexin Lu
|
Yongxin Xu
|
Xu Chu
|
Yasha Wang
|
Jiannong Cao
|
Philip S. Yu
|
Xiao-Ming Wu
Regular updates are essential for maintaining up-to-date knowledge in large language models (LLMs). However, existing training-based model editing methods often struggle to effectively incorporate new knowledge while preserving unrelated general knowledge. To address this challenge, we propose a novel framework called Geometric Knowledge Editing (GeoEdit). GeoEdit utilizes the geometric relationships of parameter updates from fine-tuning to differentiate between neurons associated with new knowledge updates and those related to general knowledge perturbations. By employing a direction-aware knowledge identification method, we avoid updating neurons with directions approximately orthogonal to existing knowledge, thus preserving the model’s generalization ability. For the remaining neurons, we integrate both old and new knowledge for aligned directions and apply a “forget-then-learn” editing strategy for opposite directions. Additionally, we introduce an importance-guided task vector fusion technique that filters out redundant information and provides adaptive neuron-level weighting, further enhancing model editing performance. Extensive experiments on two publicly available datasets demonstrate the superiority of GeoEdit over existing state-of-the-art methods.
pdf
bib
abs
A Generative Pre-Trained Language Model for Channel Prediction in Wireless Communications Systems
Bo Lin
|
Huanming Zhang
|
Yuhua Jiang
|
Yucong Wang
|
Tengyu Zhang
|
Shaoqiang Yan
|
Hongyao Li
|
Yihong Liu
|
Feifei Gao
Channel prediction can greatly reduce the pilot overhead and is a critical technology in the fifth-generation (5G) and the coming 6G wireless communications systems. Conventional model-based channel prediction methods suffer from limited accuracy due to imperfect temporal modeling, while existing AI-based methods suffer from limited generalization due to inadequate training strategies. Recently, large language models (LLMs) have demonstrated remarkable generalization and generation capabilities across diverse domains such as computer vision, quantitative economics, and bioinformatics, which motivates us to apply LLMs in channel prediction. In this paper, we formulate the ‘channel sentence’ based on channel correlation, where the channel is regarded as a ‘word’. Subsequently, we propose a generative pre-trained language model for channel prediction (CP-GPT). We collect 12M channel data according to the 3GPP 38.901 protocol and train CP-GPT based on the transformer decoder architecture. Moreover, we design two pre-training tasks based on the characteristics of wireless channels to enhance CP-GPT’s understanding of communications channels. We further propose a comprehensive benchmark to rigorously evaluate the capabilities of CP-GPT across multiple dimensions. The simulation results demonstrate that CP-GPT has successfully learned various channel characteristics and exhibits impressive capabilities across numerous downstream tasks.
pdf
bib
abs
AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning
Yujie Feng
|
Jian Li
|
Xiaoyu Dong
|
Pengfei Xu
|
Xiaohui Zhou
|
Yujia Zhang
|
Zexin Lu
|
Yasha Wang
|
Alan Zhao
|
Xu Chu
|
Xiao-Ming Wu
Continual learning (CL) is essential for deploying large language models (LLMs) in dynamic real-world environments without the need for costly retraining. Recent model merging-based methods have attracted significant attention, but they still struggle to effectively manage the trade-off between learning new knowledge and preventing forgetting, a challenge largely stemming from suboptimal number of merges and merging frequency. In this paper, we introduce Adaptive Iterative Model Merging (AimMerging), a novel CL framework that utilizes learning and forgetting signals from the training trajectory to dynamically monitor the model’s training status. Guided by dynamic monitoring, the training trajectory-guided merge controller adaptively determines the timing and frequency of iterative fusion, while the rehearsal-based knowledge fusion module computes the merging weights and executes the fusion. Comprehensive experiments on three CL benchmarks with various model sizes (from 770M to 13B) demonstrate that AimMerging achieves significant performance improvements over existing state-of-the-art methods, with an average relative improvement of 80% and 59% on FWT and BWT, respectively. The source code is provided for reproducibility.
pdf
bib
abs
R-PRM: Reasoning-Driven Process Reward Modeling
Shuaijie She
|
Junxiao Liu
|
Yifeng Liu
|
Jiajun Chen
|
Xin Huang
|
Shujian Huang
Process Reward Models (PRMs) have emerged as a promising solution to address the reasoning mistakes of large language models (LLMs). However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy. This limitation is further compounded by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM), which activates inherent reasoning to enhance process-level evaluation. First, we leverage stronger LLMs to generate seed data from limited annotations, effectively activating reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we explore self-improvement of our PRM through preference optimization, without requiring additional annotated data. Third, we introduce inference time scaling to fully harness our model’s reasoning potential. Extensive experiments demonstrate R-PRM’s effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 13.9 and 8.5 F1 scores. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.6 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and robust generalization, indicating its broader potential.
pdf
bib
abs
RLAE: Reinforcement Learning-Assisted Ensemble for LLMs
Yuqian Fu
|
Yuanheng Zhu
|
Jiajun Chai
|
Guojun Yin
|
Wei Lin
|
Qichao Zhang
|
Dongbin Zhao
Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose **R**einforcement **L**earning-**A**ssisted **E**nsemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms (RLAE_PPO and RLAE_MAPPO ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to 3.3\\% accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency. The source code is available at here.
pdf
bib
abs
Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic
Yang Yan
|
Yu Lu
|
Renjun Xu
|
Zhenzhong Lan
Large language models (LLMs) achieve impressive results on advanced mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising the question of whether they have
truly grasped fundamental arithmetic rules or are merely relying on pattern matching. To unravel this issue, we systematically probe LLMs’ understanding of two-integer addition (0 to
264) by testing three crucial properties: commutativity (
A+B=B+A), representation invariance via symbolic remapping (e.g.,
7 ↦ Y), and consistent accuracy scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark disconnect: while models achieve high numeric accuracy (73.8–99.8%), they systematically fail these diagnostics. Specifically, accuracy plummets to
≤ 7.5% with symbolic inputs, commutativity is violated in up to 20% of cases, and accuracy scaling is non-monotonic. Interventions further expose this pattern-matching reliance: explicitly providing rules degrades performance by 29.49%, while prompting for explanations before answering merely maintains baseline accuracy. These findings demonstrate that current LLMs address elementary addition via pattern matching, not robust rule induction, motivating new diagnostic benchmarks and innovations in model architecture and training to cultivate genuine mathematical reasoning. Our dataset and generating code are available at
https://github.com/kuri-leo/llm-arithmetic-diagnostic.
pdf
bib
abs
AskToAct: Enhancing LLMs Tool Use via Self-Correcting Clarification
Xuan Zhang
|
Yongliang Shen
|
Zhe Zheng
|
Linjuan Wu
|
Wenqi Zhang
|
Yuchen Yan
|
Qiuying Peng
|
Jun Wang
|
Weiming Lu
Large language models (LLMs) have demonstrated remarkable capabilities in tool learning. In real-world scenarios, user queries are often ambiguous and incomplete, requiring effective clarification. However, existing interactive clarification approaches face two critical limitations: reliance on manually constructed datasets, which inherently constrains training data scale and diversity, and lack of error correction mechanisms during multi-turn clarification, leading to error accumulation that compromises both accuracy and efficiency. We present AskToAct, which addresses these challenges by exploiting the structural mapping between queries and their tool invocation solutions. Our key insight is that tool parameters naturally represent explicit user intents. By systematically removing key parameters from queries while retaining them as ground truth, we enable automated construction of high-quality training data. We further enhance model robustness through error-correction pairs and selective masking, enabling dynamic error detection during clarification interactions. Comprehensive experiments demonstrate that AskToAct significantly outperforms existing approaches, achieving above 57% accuracy in recovering critical unspecified intents and enhancing clarification efficiency by an average of 10.46% while maintaining high accuracy in tool invocation. Our framework exhibits robust performance across different model architectures and successfully generalizes to entirely unseen APIs without additional training, achieving performance comparable to GPT-4o with substantially fewer computational resources.
pdf
bib
abs
START: Self-taught Reasoner with Tools
Chengpeng Li
|
Mingfeng Xue
|
Zhenru Zhang
|
Jiaxi Yang
|
Beichen Zhang
|
Bowen Yu
|
Binyuan Hui
|
Junyang Lin
|
Xiang Wang
|
Dayiheng Liu
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex reasoning through long chain-of-thought, yet they struggle with precise computations and algorithmic operations. Integrating computational tools with LRMs remains challenging, particularly in activating and enhancing models’ tool-use capabilities without compromising their reasoning strengths. We address these challenges through START (Self-taught Reasoner with Tools), introducing two key innovations: (1) Hint-infer, a training-free approach that activates LRMs’ latent tool-use capabilities through artificial hints, enabling test-time performance scaling; (2) Hint-RFT, a self-training framework that enables models to learn effective tool utilization through diverse hint patterns and rejection-based data synthesis. Experiments show that START significantly improves state-of-the-art LRMs across challenging benchmarks, including competition-level mathematics (AMC23: 95.0%, AIME24: 75.6%) and graduate-level science questions (GPQA: 64.6%). Our analysis reveals that START not only enhances accuracy but also improves reasoning efficiency through strategic tool utilization, demonstrating broad applicability in complex reasoning scenarios.
pdf
bib
abs
The Impact of Negated Text on Hallucination with Large Language Models
Jaehyung Seo
|
Hyeonseok Moon
|
Heuiseok Lim
Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.
pdf
bib
abs
A Probabilistic Inference Scaling Theory for LLM Self-Correction
Zhe Yang
|
Yichang Zhang
|
Yudong Wang
|
Ziyao Xu
|
Junyang Lin
|
Zhifang Sui
Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the tth round of self-correction is given by: Acct = Upp - 𝛼t(Upp - Acc0),where Acc0 denotes the initial accuracy, Upp represents the upper bound of accuracy convergence, and 𝛼 determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
pdf
bib
abs
MentalGLM Series: Explainable Large Language Models for Mental Health Analysis on Chinese Social Media
Wei Zhai
|
Nan Bai
|
Qing Zhao
|
Jianqiang Li
|
Fan Wang
|
Hongzhi Qi
|
Meng Jiang
|
Xiaoqin Wang
|
Bing Xiang Yang
|
Guanghui Fu
With the rise of mental health challenges, social media has become a key platform for emotional expression. Deep learning offers a promising solution for analyzing mental health but lacks flexibility and interpretability. Large language models (LLMs) introduce greater adaptability and can explain their decisions, yet they still underperform deep learning in complex psychological analysis. We present C-IMHI, the first multi-task Chinese social media interpretable mental health instruction dataset (9K samples) with quality control and manual validation. Additionally, we introduce MentalGLM, the first open-source Chinese LLMs for explainable mental health analysis, trained on 50K instructions. The proposed models excelled in three mental health downstream tasks, outperforming or matching deep learning and LLMs. A portion of the generated decision explanations was validated by experts, demonstrating promising accuracy and reliability. We evaluated the proposed models on a clinical dataset, where they significantly outperformed other LLMs, demonstrating their potential for clinical applications. Our models show strong performance, validated across tasks and domains. The decision explanations enhance usability and facilitate better understanding and practical application of the models. Both the constructed dataset and the models are publicly available via: https://github.com/zwzzzQAQ/MentalGLM.
pdf
bib
abs
Knowledge-Aware Co-Reasoning for Multidisciplinary Collaboration
Xurui Li
|
Wanghaijiao
|
Kaisong Song
|
Rui Zhu
|
Haixu Tang
Large language models (LLMs) have shown significant potential to improve diagnostic performance for clinical professionals. Existing multi-agent paradigms rely mainly on prompt engineering, suffering from improper agent selection and insufficient knowledge integration. In this work, we propose a novel framework KACR (Knowledge-Aware Co-Reasoning) that integrates structured knowledge reasoning into multidisciplinary collaboration from two aspects: (1) a reinforcement learning-optimized agent that uses clinical knowledge graphs to guide dynamic discipline determination; (2) a multidisciplinary collaboration strategy that enables robust consensus through integration of domain-specific expertise and interdisciplinary persuasion mechanism. Extensive experiments conducted on both academic and real-world datasets demonstrate the effectiveness of our method.
pdf
bib
abs
Astra: Efficient Transformer Architecture and Contrastive Dynamics Learning for Embodied Instruction Following
Yueen Ma
|
DaFeng Chi
|
Shiguang Wu
|
Yuecheng Liu
|
Yuzheng Zhuang
|
Irwin King
Vision-language-action models have gained significant attention for their ability to model multimodal sequences in embodied instruction following tasks. However, most existing models rely on causal attention, which we find suboptimal for processing sequences composed of interleaved segments from different modalities. In this paper, we introduce Astra, a novel Transformer architecture featuring trajectory attention and learnable action queries, designed to efficiently process segmented multimodal trajectories and predict actions for imitation learning. Furthermore, we propose a contrastive dynamics learning objective to enhance the model’s understanding of environment dynamics and multimodal alignment, complementing the primary behavior cloning objective. Through extensive experiments on three large-scale robot manipulation benchmarks, Astra demonstrates substantial performance improvements over previous models.
pdf
bib
abs
MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Woohyun Cho
|
Youngmin Kim
|
Sunghyun Lee
|
Youngjae Yu
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought (SylAVL-CoT), which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
pdf
bib
abs
MuTIS: Enhancing Reasoning Efficiency through Multi Turn Intervention Sampling in Reinforcement Learning
Wenshuo Zhao
|
Haoxing Zhai
|
Xinyu Qiu
|
Zhenting Qi
|
Shuhe Li
|
Linchao Zhu
Recently, large reasoning models (LRMs) have demonstrated state-of-the-art performance across a wide range of benchmarks. However, a common challenge for these models is the “overthinking” problem, which leads to excessive reasoning steps and significant computational overhead. Furthermore, the issues with long Chain-of-Thought (CoT) are especially pronounced in smaller models (≤ 3B parameters). Aside from producing excessively verbose “reflection words”, they often exhibit repetition and get trapped in unproductive generation loops. Existing solutions typically involve either using flexible reasoning chains as training data or leveraging the model’s latent space to bypass intermediate reasoning steps, but none of these methods have considered directly optimizing reasoning trajectories during the sampling phase of training. In our work, we introduce the Multi-Turn Intervention Sampling Framework (MuTIS). Our framework leverages multi-turn interventions to produce concise reasoning chains. It fine-tunes reasoning models through reinforcement learning, demonstrably breaking the accuracy-efficiency trade-off. It also demonstrates strong scalability, exhibiting excellent performance on 7B models. Code is available at https://github.com/Edric-Zhao/MuTIS/tree/main.
pdf
bib
abs
PRIM: Towards Practical In-Image Multilingual Machine Translation
Yanzhi Tian
|
Zeming Liu
|
Zhengyang Liu
|
Chong Feng
|
Xin Li
|
Heyan Huang
|
Yuhang Guo
In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.
pdf
bib
abs
Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE
Beatrice Savoldi
|
Giuseppe Attanasio
|
Eleonora Cupin
|
Eleni Gkovedarou
|
Janiça Hackenbuchner
|
Anne Lauscher
|
Matteo Negri
|
Andrea Piergentili
|
Manjinder Thind
|
Luisa Bentivogli
Avoiding the propagation of undue (binary) gender inferences and default masculine language remains a key challenge towards inclusive multilingual technologies, particularly when translating into languages with extensive gendered morphology. Gender-neutral translation (GNT) represents a linguistic strategy towards fairer communication across languages. However, research on GNT is limited to a few resources and language pairs. To address this gap, we introduce mGeNTE, an expert-curated resource, and use it to conduct the first systematic multilingual evaluation of inclusive translation with state-of-the-art instruction-following language models (LMs). Experiments on en-es/de/it/el reveal that while models can recognize when neutrality is appropriate, they cannot consistently produce neutral translations, limiting their usability. To probe this behavior, we enrich our evaluation with interpretability analyses that identify task-relevant features and offer initial insights into the internal dynamics of LM-based GNT.
pdf
bib
abs
DiplomacyAgent: Do LLMs Balance Interests and Ethical Principles in International Events?
Jianxiang Peng
|
Ling Shi
|
Xinwei Wu
|
Hanwen Zhang
|
Fujiang Liu
|
Haocheng Lyu
|
Deyi Xiong
The widespread deployment of large language models (LLMs) across various domains has made their safety a critical priority. Inspired by think-tank decision-making philosophy, we propose DiplomacyAgent, an LLM-based multi-agent system for diplomatic position analysis. With DiplomacyAgent, we are able to systematically assess how LLMs balance “interests” against “ethical principles” when addressing various international events, hence understanding the safety implications of LLMs in diplomacy. Specifically, this will help to assess the consistency of LLM stance with widely recognized ethical standards, as well as the potential risks or ideological biases that may arise. Through integrated quantitative metrics, our research uncovers unexpected decision-making patterns in LLM responses to sensitive issues including human rights protection, environmental sustainability, regional conflicts, etc. It discloses that LLMs could exhibit a strong bias towards interests, leading to unsafe decisions that violate ethical and moral principles. Our experiment results suggest that deploying LLMs in high-stakes domains, particularly in the formulation of diplomatic policies, necessitates a comprehensive assessment of potential ethical and social implications, as well as the implementation of stringent safety protocols.
pdf
bib
abs
DisLoRA: Task-specific Low-Rank Adaptation via Orthogonal Basis from Singular Value Decomposition
She Yifei
|
Xinhao Wei
|
Yulong Wang
Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) is critical for adapting to diverse downstream tasks with minimal computational cost. We propose **Di**rectional-**S**VD **Lo**w-**R**ank **A**daptation (DisLoRA), a novel PEFT framework that leverages singular value decomposition (SVD) to decompose pretrained weight matrices into orthogonal backbone and task-specific subspaces, enabling precise capture of task-specific directions (TSDs). By dynamically identifying TSDs and employing adaptive soft orthogonal regularization with mean-normalization mechanism, DisLoRA balances task-specific and orthogonal losses without manual tuning, ensuring robust training stability. Extensive experiments on GLUE and Commonsense Reasoning benchmarks demonstrate that DisLoRA surpasses established PEFT methods, including LoRA, PiSSA, DoRA, LoRA-Dash, and SORSA. DisLoRA achieves superior performance on multiple individual GLUE datasets, surpassing baselines by up to 10.28% on SST-2 and 3.28% on CoLA, and consistently attains higher average accuracy than baselines across Commonsense Reasoning Tasks, with a maximum gain of 3.1%. These results demonstrate DisLoRA’s performance in efficient and high-performing LLM adaptation for domain-specific tasks while preserving generalization.
pdf
bib
abs
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
Zixin Chen
|
Sicheng Song
|
KaShun Shum
|
Yanna Lin
|
Rui Sheng
|
Weiqi Wang
|
Huamin Qu
Misleading visualizations, which manipulate chart representations to support specific claims, can distort perception and lead to incorrect conclusions. Despite decades of research, they remain a widespread issue, posing risks to public understanding and raising safety concerns for AI systems involved in data-driven communication. While recent multimodal large language models (MLLMs) show strong chart comprehension abilities, their capacity to detect and interpret misleading charts remains unexplored. We introduce Misleading ChartQA benchmark, a large-scale multimodal dataset designed to evaluate MLLMs on misleading chart reasoning. It contains 3,026 curated examples spanning 21 misleader types and 10 chart types, each with standardized chart code, CSV data, multiple-choice questions, and labeled explanations, validated through iterative MLLM checks and exhausted expert human review. We benchmark 24 state-of-the-art MLLMs, analyze their performance across misleader types and chart formats, and propose a novel region-aware reasoning pipeline that enhances model accuracy. Our work lays the foundation for developing MLLMs that are robust, trustworthy, and aligned with the demands of responsible visual communication.
pdf
bib
abs
Textual Aesthetics in Large Language Models
Lingjie Jiang
|
Shaohan Huang
|
Xun Wu
|
Furu Wei
Image aesthetics is a crucial metric in the field of image generation. However, textual aesthetics has not been sufficiently explored. With the widespread application of large language models (LLMs), previous work has primarily focused on the correctness of content and the helpfulness of responses. Nonetheless, providing responses with textual aesthetics is also an important factor for LLMs, which can offer a cleaner layout and ensure greater consistency and coherence in content. In this work, we introduce a pipeline for aesthetics polishing and help construct a textual aesthetics dataset named TEXAES. We propose a textual aesthetics-powered fine-tuning method based on direct preference optimization, termed TAPO, which leverages textual aesthetics without compromising content correctness. Additionally, we develop two evaluation methods for textual aesthetics based on text and image analysis, respectively.Our experiments demonstrate that using textual aesthetics data and employing the TAPO fine-tuning method not only improves aesthetic scores but also enhances performance on general evaluation datasets such as AlpacalEval and Arena-hard.
pdf
bib
abs
Section-Level Simplification of Biomedical Abstracts
Jan Bakker
|
Jaap Kamps
Cochrane produces systematic reviews whose abstracts are divided into seven standard sections. However, the plain language summaries (PLS) of Cochrane reviews do not adhere to the same structure, which has prevented researchers from training simplification models on paired abstract and PLS sections. In this work, we devise a two-step method to automatically divide PLS of Cochrane reviews into the same sections in which abstracts are divided. In the first step, we align each sentence in a PLS to a section in the parallel abstract if they cover similar content. In the second step, we classify the remaining sentences into sections based on the content of the PLS and what we learned from the first step. We manually divide 22 PLS into sections to evaluate our method. Upon execution of our method, we obtain the Cochrane-sections dataset, which consists of paired abstract and PLS sections in English for a total of 7.7K Cochrane reviews. Thus, our work yields references for the section-level simplification of biomedical abstracts.
pdf
bib
abs
PoseStitch-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation
Abhinav Joshi
|
Vaibhav Sharma
|
Sanjeet Singh
|
Ashutosh Modi
Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. We propose PoseStitch-SLT, a novel pre-training scheme that is inspired by linguistic-templates-based sentence generation technique. With translation comparison on two sign language datasets, How2Sign and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods for pose-based gloss-free translation. The results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.
pdf
bib
abs
Few-Shot Open-Set Classification via Reasoning-Aware Decomposition
Avyav Kumar Singh
|
Helen Yannakoudakis
Large language models (LLMs) excel at few-shot learning, but their ability to reject out-of-distribution examples remains under-explored. We study this challenge under the setting of few-shot open-set classification, where a model must not only classify examples from a small set of seen classes but also reject unseen ones at inference time. This setting is more realistic and challenging than traditional closed-set supervised learning, requiring both fine-grained classification and robust rejection. We show that, for small LLMs, neither chain-of-thought (CoT) prompting nor supervised fine-tuning (SFT) alone are sufficient to generalise reliably, particularly when class semantics are anonymised. We introduce Wasserstein GFN (W-GFN), a novel amortised Generative Flow Network framework that uses latent trajectories to approximate the Bayesian posterior. With as few as 4 examples per class, W-GFN substantially improves performance, enabling Llama 3.2 3B to achieve up to ≥80% of the performance of Llama 3.3 70B in complex datasets, despite being ∼ 23 times smaller, which highlights the importance of reasoning-aware approaches for robust open-set few-shot learning.
pdf
bib
abs
Translation in the Hands of Many: Centering Lay Users in Machine Translation Interactions
Beatrice Savoldi
|
Alan Ramponi
|
Matteo Negri
|
Luisa Bentivogli
Converging societal and technical factors have transformed language technologies into user-facing applications used by the general public across languages. Machine Translation (MT) has become a global tool, with cross-lingual services now also supported by dialogue systems powered by multilingual Large Language Models (LLMs). Widespread accessibility has extended MT’s reach to a vast base of *lay users*, many with little to no expertise in the languages or the technology itself. And yet, the understanding of MT consumed by such a diverse group of users—their needs, experiences, and interactions with multilingual systems—remains limited. In our position paper, we first trace the evolution of MT user profiles, focusing on non-experts and how their engagement with technology may shift with the rise of LLMs. Building on an interdisciplinary body of work, we identify three factors—usability, trust, and literacy—that are central to shaping user interactions and must be addressed to align MT with user needs. By examining these dimensions, we provide insights to guide the progress of more user-centered MT.
pdf
bib
abs
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Yirong Zeng
|
Xiao Ding
|
Yuxian Wang
|
Weiwen Liu
|
Yutai Hou
|
Wu Ning
|
Xu Huang
|
Duyu Tang
|
Dandan Tu
|
Bing Qin
|
Ting Liu
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model’s deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
pdf
bib
abs
Transplant Then Regenerate: A New Paradigm for Text Data Augmentation
Guangzhan Wang
|
Hongyu Zhang
|
Beijun Shen
|
Xiaodong Gu
Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.
pdf
bib
abs
Compositional Generalisation for Explainable Hate Speech Detection
Agostina Calabrese
|
Tom Sherborne
|
Björn Ross
|
Mirella Lapata
Hate speech detection is key to online content moderation, but current models struggle to generalise beyond their training data. This has been linked to dataset biases and the use of sentence-level labels, which fail to teach models the underlying structure of hate speech. In this work, we show that even when models are trained with more fine-grained, span-level annotations (e.g., “artists” is labeled as target and “are parasites” as dehumanising comparison), they struggle to disentangle the meaning of these labels from the surrounding context. As a result, combinations of expressions that deviate from those seen during training remain particularly difficult for models to detect. We investigate whether training on a dataset where expressions occur with equal frequency across all contexts can improve generalisation. To this end, we create U-PLEAD, a dataset of ~364,000 synthetic posts, along with a novel compositional generalisation benchmark of ~8,000 manually validated posts. Training on a combination of U-PLEAD and real data improves compositional generalisation while achieving state-of-the-art performance on the human-sourced PLEAD.
pdf
bib
abs
CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs
Jinyoung Kim
|
Ji Won Yoon
Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose
Cycle-
Consistency in
Question
Answering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at
https://github.com/scai-research/ccqa_official.
pdf
bib
abs
TVQACML: Benchmarking Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages
Sha Jiu
|
Yu Weng
|
Mengxiao Zhu
|
Chong Feng
|
Zheng Liu
|
Jialedongzhu
Text-Centric Visual Question Answering (TEC-VQA) is a critical research area that requires semantic interactions between objects and scene texts. However, most existing TEC-VQA benchmarks focus on high-resource languages like English and Chinese. Although few works expanding multilingual QA pairs in non-text-centric VQA datasets through translation, which encounters a substantial “visual-textual misalignment” problem when applied to TEC-VQA. Moreover, the open-source nature of these benchmarks and the broad sources of training data for MLLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging TEC-VQA benchmark called Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages(TVQACML), which involves eight languages, including Standard Chinese, Korean, and six minority languages. TVQACML supports a wide range of tasks, such as Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER), featuring 32,000 question-answer pairs across 8,000 images. Extensive experiments on TVQACML across multiple MLLMs demonstrate the effectiveness of evaluating the MLLMs and enhancing multilingual TEC-VQA performance with fine-tuning.
pdf
bib
abs
Transparent and Coherent Procedural Mistake Detection
Shane Storks
|
Itamar Bar-Yossef
|
Yayuan Li
|
Zheyuan Zhang
|
Jason J Corso
|
Joyce Chai
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
pdf
bib
abs
Teaching Your Models to Understand Code via Focal Preference Alignment
Jie Wu
|
Haoling Li
|
Xin Zhang
|
Xiao Liu
|
Yangyu Huang
|
Jianwen Luo
|
Yizhen Zhang
|
Zuchao Li
|
Ruihang Chu
|
Yujiu Yang
|
Scarlett Li
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
pdf
bib
abs
MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval
Xixi Wu
|
Yanchao Tan
|
Nan Hou
|
Ruiyang Zhang
|
Hong Cheng
Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning and accurate answers.To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-K pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.
pdf
bib
abs
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
Ioanna Ntinou
|
Alexandros Xenos
|
Yassine Ouali
|
Adrian Bulat
|
Georgios Tzimiropoulos
Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only two hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters.
pdf
bib
abs
TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu
|
Pu Jian
|
Chong Chen
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering.
pdf
bib
abs
Retrieval Enhanced Feedback via In-context Neural Error-book
Jongyeop Hyun
|
Bumsoo Kim
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback—Feed-Target, Feed-Check, and Feed-Path—to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.
pdf
bib
abs
Improve LLM-as-a-Judge Ability as a General Ability
Jiachen Yu
|
Shaoning Sun
|
Xiaohui Hu
|
Jiaxu Yan
|
Kaidong Yu
|
Xuelong Li
LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM’s judge ability. In this work, we conceptualize judging ability as a general capability of LLMs and adapt the two-stage SFT-DPO training framework—commonly used in traditional general model training—to the development of judge models. We introduce an efficient data synthesis method, which includes the automatic generation of various judge templates, dual verification for data accuracy and consistency. A difficulty-based data stratification strategy allows us to distribute more effective data to the SFT and DPO stages respectively. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task with CoT outputs. We further validate the effectiveness of our model by deploying it to provide reward signals in a real-world RLHF scenarios. We will open-source our model weights and training data to facilitate further research.
pdf
bib
abs
G2: Guided Generation for Enhanced Output Diversity in LLMs
Zhiwen Ruan
|
Yixia Li
|
Yefeng Liu
|
Yun Chen
|
Weihua Luo
|
Peng Li
|
Yang Liu
|
Guanhua Chen
Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, these models exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts. This limitation significantly affects tasks requiring diverse outputs, from creative writing to reasoning. Existing solutions, like temperature scaling, enhance diversity by modifying probability distributions but compromise output quality. We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality. G2 employs a base generator alongside dual Guides, which guide the generation process through decoding-based interventions to encourage more diverse outputs conditioned on the original query. Comprehensive experiments demonstrate that G2 effectively improves output diversity while maintaining an optimal balance between diversity and quality.
pdf
bib
abs
ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
Yuejin Xie
|
Youliang Yuan
|
Wenxuan Wang
|
Fan Mo
|
Jianmin Guo
|
Pinjia He
LLMs are evolving into assistants that leverage tools, significantly expanding their capabilities but also introducing critical safety risks. Current models exhibit notable vulnerabilities, particularly in maintaining safety during multi-step tool interactions and in scenarios involving indirect harm. This paper introduces ToolSafety, a safety fine-tuning dataset designed to address these limitations. ToolSafety comprises 5,668 direct harm samples, 4,311 indirect harm samples, and 4,311 multi-step samples. Key features include support for multi-step safety through synthesized trajectories and realistic, context-aware sample generation. We fine-tuned LLaMA3.1-8B-Instruct and Qwen2.5-7B-Instruct using ToolSafety. Experimental results demonstrate that these models effectively maintain safety in multi-step and indirect harm scenarios. Further analysis into superficial alignment across different decoding strategies, languages, and jailbreak prompts indicates that while some risks persist, the issue is less severe than in multi-step settings. Overall, our approach significantly improves safety across various scenarios with small impact on helpfulness, positioning ToolSafety as a valuable resource for building safer tool-using AI systems.
pdf
bib
abs
Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning
Sangyeon Cho
|
Mingi Kim
|
Jinkwon Hwang
|
Jaehoon Go
|
Minuk Ma
|
Sunjae Yoon
|
Junyeong Kim
Automated Audio Captioning (AAC) aims to generate natural language descriptions of audio content, enabling machines to interpret and communicate complex acoustic scenes. However, current AAC datasets often suffer from short and simplistic captions, limiting model expressiveness and semantic depth. To address this, we introduce **VggCaps**, a new multi-modal dataset that pairs audio with corresponding video and leverages large language models (LLMs) to generate rich, descriptive captions. VggCaps significantly outperforms existing benchmarks in caption length, lexical diversity, and human-rated quality. Furthermore, we propose **Multi2Cap**, a novel AAC framework that learns audio-visual representations through a AV-grounding module during pre-training and reconstructs visual semantics using audio alone at inference. This enables visually grounded captioning in audio-only scenarios. Experimental results on Clotho and AudioCaps demonstrate that Multi2Cap achieves state-of-the-art performance across multiple metrics, validating the effectiveness of cross-modal supervision and LLM-based generation in advancing AAC.
pdf
bib
abs
Towards Optimal Evaluation Efficiency for Large Language Models
Guohong Li
|
Deyi Xiong
Comprehensive evaluation of large language models (LLMs) typically requires large-scale benchmarks, which is costly in terms of both data annotation and computational resource needed for evaluation. To mitigate these challenges, we propose an efficient evaluation framework that selects a question subset based on pre-tested results, thereby reducing the costs. We formulate the subset selection problem as an optimization task, solved using optimal random sampling and simulated annealing algorithms. We compare our approach with prior clustering-based methods and assess their reliability in terms of score accuracy. Additionally, we perform semantic analysis and evaluate whether the selected subsets preserve the semantic information of the original benchmark using Wasserstein distance. Experimental results show that our method outperforms previous approaches in terms of reliability, as measured by L2 norm. Our study provides an optimized perspective for balancing evaluation efficiency and reliability in LLM assessments, while revealing the relationship between optimization methods and semantic retention.
pdf
bib
abs
MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs
Yiheng Hu
|
Xiaoyang Wang
|
Qing Liu
|
Xiwei Xu
|
Qian Fu
|
Wenjie Zhang
|
Liming Zhu
Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.
pdf
bib
abs
Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning
Sugyeong Eo
|
Jung Jun Lee
|
Chanjun Park
|
Heuiseok Lim
A sparse Mixture-of-Experts (MoE) architecture has emerged as a highly scalable solution by conditionally activating sub-modules without a proportional increase in computational costs. However, improving expert specialization to enhance performance and generalization remains a challenge for MoE, especially in instruction tuning scenarios characterized by significant input heterogeneity. In this work, we propose the Mixture-of-Clustered-Experts (MoCE) to address this limitation through a dual-stage routing mechanism. The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-k experts within the group at the token level. This approach enables the effective partitioning of heterogeneous inputs based on their knowledge requirements, encouraging expert group specialization while maintaining the advantages of token-level routing. We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities. Detailed analysis further highlights the robustness and effectiveness of MoCE.
pdf
bib
abs
Process-Supervised Reinforcement Learning for Code Generation
Yufan Ye
|
Ting Zhang
|
Wenbin Jiang
|
Hua Huang
Existing reinforcement learning (RL) strategies based on outcome supervision have proven effective in enhancing the performance of large language models (LLMs) for code generation. While reinforcement learning based on process supervision shows great potential in multi-step reasoning tasks, its effectiveness in the field of code generation still lacks sufficient exploration and verification. The primary obstacle stems from the resource-intensive nature of constructing a high-quality process-supervised reward dataset, which requires substantial human expertise and computational resources. To overcome this challenge, this paper proposes a “mutation/refactoring-execution verification” strategy. Specifically, the teacher model is used to mutate and refactor the statement lines or blocks, and the execution results of the compiler are used to automatically label them, thus generating a process-supervised reward dataset. Based on this dataset, we have carried out a series of RL experiments. The experimental results show that, compared with the method relying only on outcome supervision, reinforcement learning based on process supervision performs better in handling complex code generation tasks. In addition, this paper for the first time confirms the advantages of the Direct Preference Optimization (DPO) method in the RL task of code generation based on process supervision, providing new ideas and directions for code generation research.
pdf
bib
abs
MuCAL: Contrastive Alignment for Preference-Driven KG-to-Text Generation
Yifei Song
|
Claire Gardent
We propose MuCAL (Multilingual Contrastive Alignment Learning) to tackle the challenge of Knowledge Graphs (KG)-to-Text generation using preference learning, where reliable preference data is scarce. MuCAL is a multilingual KG/Text alignment model achieving robust cross-modal retrieval across multiple languages and difficulty levels. Building on MuCAL, we automatically create preference data by ranking candidate texts from three LLMs (Qwen2.5, DeepSeek-v3, Llama-3). We then apply Direct Preference Optimization (DPO) on these preference data, bypassing typical reward modelling steps to directly align generation outputs with graph semantics. Extensive experiments on KG-to-English Text generation show two main advantages: (1) Our KG/text similarity models provide a better signal for DPO than similar existing metrics, and (2) significantly better generalisation on out-of-domain datasets compared to standard instruction tuning. Our results highlight MuCAL’s effectiveness in supporting preference learning for KG-to-English Text generation and lay the foundation for future multilingual extensions. Code and data are available at https://github.com/MeloS7/MuCAL_DPO/tree/main.
pdf
bib
abs
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
|
Zhaowei Li
|
Qi Xu
|
Linfeng Li
|
YiQing Cai
|
Botian Jiang
|
Hang Song
|
Xingcan Hu
|
Pengyu Wang
|
Li Xiao
Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.
pdf
bib
abs
Thought calibration: Efficient and confident test-time scaling
Menghua Wu
|
Cai Zhou
|
Stephen Bates
|
Tommi Jaakkola
Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model’s growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model’s hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.
pdf
bib
abs
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Ziling Cheng
|
Meng Cao
|
Leila Pishdad
|
Yanshuai Cao
|
Jackie CK Cheung
Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.
pdf
bib
abs
QCRD: Quality-guided Contrastive Rationale Distillation for Large Language Models
Wei Wang
|
Zhaowei Li
|
Qi Xu
|
YiQing Cai
|
Hang Song
|
Qi Qi
|
Ran Zhou
|
Zhida Huang
|
Tao Wang
|
Li Xiao
The deployment of large language models (LLMs) faces considerable challenges concerning resource constraints and inference efficiency. Recent research has increasingly focused on smaller, task-specific models enhanced by distilling knowledge from LLMs. However, prior studies have often overlooked the diversity and quality of knowledge, especially the untapped potential of negative knowledge. Constructing effective negative knowledge remains severely understudied. In this paper, we introduce a novel framework called quality-guided contrastive rationale distillation aimed at enhancing reasoning capabilities through contrastive knowledge learning. For positive knowledge, we enrich its diversity through temperature sampling and employ self-consistency for further denoising and refinement. For negative knowledge, we propose an innovative self-adversarial approach that generates low-quality rationales by sampling previous iterations of smaller language models, embracing the idea that one can learn from one’s own weaknesses. A contrastive loss is developed to distill both positive and negative knowledge into smaller language models, where an online-updating discriminator is integrated to assess qualities of rationales and assign them appropriate weights, optimizing the training process. Through extensive experiments across multiple reasoning tasks, we demonstrate that our method consistently outperforms existing distillation techniques, yielding higher-quality rationales.
pdf
bib
abs
SHARP: Steering Hallucination in LVLMs via Representation Engineering
Junfei Wu
|
Yue Ding
|
Guofan Liu
|
Tianze Xia
|
Ziyue Huang
|
Dianbo Sui
|
Qiang Liu
|
Shu Wu
|
Liang Wang
|
Tieniu Tan
Despite their impressive capabilities, Large Vision-Language Models (LVLMs) frequently generate responses that are plausible but incorrect or unsupported—commonly referred to as hallucinations. In this study, we investigate whether different types of hallucinations are reflected in the model’s internal representations by probing their encoded features. We focus on two key causes of hallucination in multimodal reasoning: (1) over-reliance on textual priors and (2) preference for user prompts over conflicting visual evidence—factors identified in prior work as frequent and impactful. Our probing results reveal that hallucinations exhibit distinguishable representational patterns, suggesting the potential for a representation-level approach to characterize and mitigate them. Motivated by these findings, we propose Steering HAllucination via RePresentation Engineering (SHARP), a representation-level intervention framework that modulates hallucination-related features during inference. SHARP identifies functional representations responsible for prior-driven biases and visual-context conflicts, and jointly adjusts the model’s internal activations in real time. We evaluate our approach extensively on three large vision-language models across multiple benchmarks. Experimental results demonstrate that SHARP effectively reduces hallucinations while preserving the performance and generalization capabilities of LVLMs.
pdf
bib
abs
Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech
Tony Woo
|
Sehun Lee
|
Kang-wook Kim
|
Gunhee Kim
Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose **Think-Verbalize-Speak**, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is *verbalizing*, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce **ReVerT**, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at https://yhytoto12.github.io/TVS-ReVerT.
pdf
bib
abs
Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings
Safal Shrestha
|
Minwu Kim
|
Aadim Nepal
|
Anubhav Shrestha
|
Keith W. Ross
Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we “warm up” the model by distilling Long CoTs from a toy domain, namely, Knights & Knaves (K&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: (i) the warmup phase alone facilitates generalized reasoning, leading to performance improvements across a range of tasks, including MATH, HumanEval+, and MMLU-Pro; (ii) When both the base model and the warmed-up model are RLVR trained on the same small dataset (≤100 examples), the warmed-up model consistently outperforms the base model;(iii) Warming up before RLVR training allows a model to maintain cross-domain generalizability even after training on a specific domain; (iv) Introducing warmup in the pipeline improves not only accuracy but also overall sample efficiency during RLVR training. The results in this paper highlight the promise of warmup for building robust reasoning LLMs in data-scarce environments.
pdf
bib
abs
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Hao Zheng
|
Xinyan Guan
|
Hao Kong
|
Wenkai Zhang
|
Jia Zheng
|
Weixiang Zhou
|
Hongyu Lin
|
Yaojie Lu
|
Xianpei Han
|
Le Sun
Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
pdf
bib
abs
SWAM: Adaptive Sliding Window and Memory-Augmented Attention Model for Rumor Detection
Mei Guo
|
Chen Chen
|
Chunyan Hou
|
Yike Wu
|
Xiaojie Yuan
Detecting rumors on social media has become a critical task in combating misinformation. Existing propagation-based rumor detection methods often focus on the static propagation graph, overlooking that rumor propagation is inherently dynamic and incremental in the real world. Recently propagation-based rumor detection models attempt to use the dynamic graph that is associated with coarse-grained temporal information. However, these methods fail to capture the long-term time dependency and detailed temporal features of propagation. To address these issues, we propose a novel adaptive Sliding Window and memory-augmented Attention Model (SWAM) for rumor detection. The adaptive sliding window divides the sequence of posts into consecutive disjoint windows based on the propagation rate of nodes. We also propose a memory-augmented attention to capture the long-term dependency and the depth of nodes in the propagation graph. Multi-head attention mechanism is applied between nodes in the memorybank and incremental nodes to iteratively update the memorybank, and the depth information of nodes is also considered. Finally, the propagation features of nodes in the memorybank are utilized for rumor detection. Experimental results on two public real-world datasets demonstrate the effectiveness of our model compared with the state-of-the-art baselines.
pdf
bib
abs
HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning
Xingyu Tan
|
Xiaoyang Wang
|
Qing Liu
|
Xiwei Xu
|
Xin Yuan
|
Liming Zhu
|
Wenjie Zhang
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present HydraRAG, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. HydraRAG handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, HydraRAG uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, HydraRAG fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that HydraRAG achieves overall state-of-the-art results on all benchmarks with GPT-3.5-Turbo, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, HydraRAG enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo. The source code is available on https://stevetantan.github.io/HydraRAG/.
pdf
bib
abs
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
|
Longteng Guo
|
Yepeng Tang
|
Tongtian Yue
|
Junxian Cai
|
Kai Ma
|
Qingbin Liu
|
Xi Chen
|
Jing Liu
Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code is available at https://github.com/johncaged/VRoPE.
pdf
bib
abs
SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP
Decheng Duan
|
Jitong Peng
|
Yingyi Zhang
|
Chengzhi Zhang
Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP—a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.
pdf
bib
abs
Think and Recall: Layer-Level Prompting for Lifelong Model Editing
Jinke Wang
|
Zenan Ying
|
Qi Liu
|
Wei Chen
|
Tong Xu
|
Huijun Hou
|
Zhi Zheng
Lifelong model editing aims to dynamically adjust a model’s output with respect to specific facts, knowledge points, or behaviors, enabling the model to adapt to the ever-changing demands of the real world without requiring retraining. While some retrieval-based methods have demonstrated potential in lifelong editing scenarios by storing edited knowledge in external memory, they often suffer from limitations in usability, such as requiring additional training corpora or lacking support for reversible and detachable edits.To address these issues, we propose a plug-and-play method for knowledge retrieval and storage, i.e., Layer-Level Prompting (LLP), which enables seamless and efficient lifelong model editing. In our LLP framework, the reasoning process of LLMs is divided into two stages, respectively knowledge retrieval (Think) and knowledge injection(Recall). Specifically, the knowledge retrieval process is performed in the early layers of the model. Based on the retrieved information, the model is guided to access the updated knowledge stored in the subsequent layer to complete the knowledge editing process. Experimental results demonstrate that our method consistently outperforms existing techniques on lifelong model editing tasks, achieving superior performance on question answering and hallucination benchmarks across different LLMs.
pdf
bib
abs
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
|
Nurdaulet Mukhituly
|
Kentaro Inui
|
Hanan Aldarmaki
|
Nils Lukas
Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks under white-box access and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.
pdf
bib
abs
FIRE: Flexible Integration of Data Quality Ratings for Effective Pretraining
Xu Liangyu
|
Xuemiao Zhang
|
Feiyu Duan
|
Sirui Wang
|
Rongxiang Weng
|
Jingang Wang
|
Xunliang Cai
Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5% tokens needed by the Random baseline to reach the target performance.
pdf
bib
abs
Multi-Domain Explainability of Preferences
Nitay Calderon
|
Liat Ein-Dor
|
Roi Reichart
Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts (rubrics) that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.
pdf
bib
abs
Tuning Less, Prompting More: In-Context Preference Learning Pipeline for Natural Language Transformation
Shuyun Yang
|
Yan Zhang
|
Zhengmao Ye
|
Lei Duan
|
Mingjie Tang
Natural language transformation (NLT) tasks, such as machine translation (MT) and text style transfer (TST), require models to generate accurate and contextually appropriate outputs. However, existing approaches face significant challenges, including the computational costs of leveraging large pre-trained models and the limited generalization ability of fine-tuned smaller models. In this paper, we propose a novel framework that combines the flexibility of prompting with the cost-effectiveness of fine-tuning. Our method enhances smaller models by integrating In-Context Examples (ICE) from retrieval, enabling the model to better capture contextual information and align with user-level preferences. We further improve performance through hierarchical contrastive learning and dynamic preference inference mechanisms. Experimental results demonstrate that our approach outperforms existing methods, such as Supervised Fine Tuning (SFT), Direct Preference Optimization (DPO), and Contrastive Preference Optimization (CPO), across both MT and TST tasks, providing a more efficient solution for resource-constrained environments.
pdf
bib
abs
IL-PCSR: Legal Corpus for Prior Case and Statute Retrieval
Shounak Paul
|
Dhananjay Ghumare
|
Pawan Goyal
|
Saptarshi Ghosh
|
Ashutosh Modi
Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers till date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCSR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM based re-ranking approach that gives the best performance.
pdf
bib
abs
ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge
Chaoyue He
|
Xin Zhou
|
Yi Wu
|
Xinjia Yu
|
Yan Zhang
|
Lei Zhang
|
Di Wang
|
Shengfei Lyu
|
Hong Xu
|
Wang Xiaoqiao
|
Wei Liu
|
Chunyan Miao
We introduce ESGenius, a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social, and Governance (ESG) and sustainability-focused question answering. ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1,136 Multiple-Choice Questions (MCQs) generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics. Each question is systematically linked to its corresponding source text, enabling transparent evaluation and supporting Retrieval-Augmented Generation (RAG) methods; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports, and recommendation documents from 7 authoritative sources. Moreover, to fully assess the capabilities and adaptation potential of LLMs, we implement a rigorous two-stage evaluation protocol—Zero-Shot and RAG. Extensive experiments across 50 LLMs (0.5B to 671B) demonstrate that state-of-the-art models achieve only moderate performance in zero-shot settings, with accuracies around 55–70%, highlighting a significant knowledge gap for LLMs in this specialized, interdisciplinary domain. However, models employing RAG demonstrate significant performance improvements, particularly for smaller models. For example, DeepSeek-R1-Distill-Qwen-14B improves from 63.82% (zero-shot) to 80.46% with RAG. These results demonstrate the necessity of grounding responses in authoritative sources for enhanced ESG understanding. To the best of our knowledge, ESGenius is the first comprehensive QA benchmark designed to rigorously evaluate LLMs on ESG and sustainability knowledge, providing a critical tool to advance trustworthy AI in this vital domain.
pdf
bib
abs
How Sememic Components Can Benefit Link Prediction for Lexico-Semantic Knowledge Graphs?
Hansi Wang
|
Yue Wang
|
Qiliang Liang
|
Yang Liu
Link Prediction (LP) aims to predict missing triple information within a Knowledge Graph (KG). Existing LP methods have sought to improve the performance by integrating structural and textual information. However, for lexico-semantic KGs designed to document fine-grained sense distinctions, these types of information may not be sufficient to support effective LP. From a linguistic perspective, word senses within lexico-semantic relations usually show systematic differences in their sememic components. In light of this, we are motivated to enhance LP with sememe knowledge. We first construct a Sememe Prediction (SP) dataset, SememeDef, for learning such knowledge, and two Chinese datasets, HN7 and CWN5, for LP evaluation; Then, we propose a method, SememeLP, to leverage this knowledge for LP fully. It consistently and significantly improves the LP performance in both English and Chinese, achieving SOTA MRR of 75.1%, 80.5%, and 77.1% on WN18RR, HN7, and CWN5, respectively; Finally, an in-depth analysis is conducted, making clear how sememic components can benefit LP for lexico-semantic KGs, which provides promising progress for the completion of them.
pdf
bib
abs
WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLMs in Image Classification
Yiwen Jiang
|
Deval Mehta
|
Siyuan Yan
|
Yaling Shen
|
Zimu Wang
|
Zongyuan Ge
Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.
pdf
bib
abs
Calibration Across Layers: Understanding Calibration Evolution in LLMs
Abhinav Joshi
|
Areeb Ahmad
|
Ashutosh Modi
Large Language Models (LLMs) have demonstrated inherent calibration capabilities, where predicted probabilities align well with correctness, despite prior findings that deep neural networks are often overconfident. Recent studies have linked this behavior to specific components in the final layer, such as entropy neurons and the unembedding matrix’s null space. In this work, we provide a complementary perspective by investigating how calibration evolves throughout the network’s depth. Analyzing multiple open-weight models on the MMLU benchmark, we uncover a distinct confidence correction phase in the upper/later layers, where model confidence is actively recalibrated after decision certainty has been reached. Furthermore, we identify a low-dimensional calibration direction in the residual stream whose perturbation significantly improves calibration metrics (ECE and MCE) without harming accuracy. Our findings suggest that calibration is a distributed phenomenon, shaped throughout the network’s forward pass, not just in its final projection, providing new insights into how confidence-regulating mechanisms operate within LLMs.
pdf
bib
abs
The discordance between embedded ethics and cultural inference in large language models
Aida Ramezani
|
Yang Xu
Effective interactions between artificial intelligence (AI) and humans require an equitable and accurate representation of diverse cultures. It is known that current AI, particularly large language models (LLMs), possesses some degrees of cultural knowledge but not without limitations. We present a framework aimed at understanding the origin of these limitations. We hypothesize that there is a fundamental discordance between embedded ethics—how LLMs represent right versus wrong, and cultural inference—how LLMs infer cultural knowledge, specifically cultural norms. We demonstrate this by extracting low-dimensional subspaces that embed ethical principles of LLMs based on established benchmarks. We then show that how LLMs make errors in culturally distinctive scenarios significantly correlates with how they represent cultural norms with respect to these embedded ethics subspaces. Furthermore, we show that coercing cultural norms to be more aligned with the embedded ethics increases LLM performance in cultural inference. Our analyses of 12 language models, two large-scale cultural benchmarks spanning 75 countries and two ethical datasets indicate that 1) the ethics-culture discordance tends to be exacerbated in instruct-tuned models, and 2) how current LLMs represent ethics can impose limitations on their adaptation to diverse cultures particularly pertaining to non-Western and low-income regions.
pdf
bib
abs
SSA: Semantic Contamination of LLM-Driven Fake News Detection
Cheng Xu
|
Nan Yan
|
Shuhao Guan
|
Yuke Mei
|
Tahar Kechadi
Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. In this work, (1) we are the first to formally define semantic contamination for this task and (2) introduce the Semantic Sensitivity Amplifier (SSA), a lightweight, model-agnostic framework that detects BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step (r≥.97, for models ≥3B, p<.05; 𝜌 ≥.9 overall, p<.05). These results show that SSA provides a sensitive and scalable audit of comprehensive BDC risk and paves the way for a more integrity evaluation of the LLM-driven fake news detection task.
pdf
bib
abs
Logits-Based Finetuning
Jingyao Li
|
Senqiao Yang
|
Sitong Wu
|
Han Shi
|
Chuanyang Zheng
|
Hong Xu
|
Jiaya Jia
In recent years, developing compact and efficient large language models (LLMs) has emerged as a thriving area of research. However, traditional Supervised Fine-Tuning (SFT), which relies on singular ground truth labels, often fails to capture token-level dependencies and linguistic diversity. To address these limitations, we propose a logits-based fine-tuning framework that integrates the strengths of supervised learning and knowledge distillation. Our approach constructs enriched training targets by combining teacher logits with ground truth labels, preserving both correctness and linguistic diversity. This ensures more reliable and effective training. To validate our approach, we constructed a large-scale 1.2M logits dataset and trained a series of science-focused models. Experimental results demonstrate that our method achieves significant improvements over current SOTA, with accuracy gains of 18% on Mawps and 22.7% on TabMWP. Across nine widely used mathematical benchmarks, our method consistently outperforms prior SFT models, achieving an average improvement of 7.28%. All code and datasets will be open-sourced.
pdf
bib
abs
STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment
Jiaqian Li
|
Qisheng Hu
|
Jing Li
|
Wenya Wang
In-Context Learning (ICL) has become a powerful paradigm that enables LLMs to perform a wide range of tasks without task-specific fine-tuning. However, the effectiveness of ICL heavily depends on the quality of exemplar selection. In particular, for structured prediction tasks such as semantic parsing, existing ICL selection strategies often overlook structural alignment, leading to suboptimal performance and poor generalization. To address this issue, we propose a novel two-stage exemplar selection strategy that achieves a strong balance between efficiency, generalizability, and performance. First, we fine-tune a BERT-based retriever using structure-aware supervision, guiding it to select exemplars that are both semantically relevant and structurally aligned. Then, we enhance the retriever with a plug-in module, which amplifies syntactically meaningful information in the hidden representations. This plug-in is model-agnostic, requires minimal overhead, and can be seamlessly integrated into existing pipelines. Experiments on four benchmarks spanning three semantic parsing tasks demonstrate that our method consistently outperforms existing baselines with multiple recent LLMs as inference-time models.
pdf
bib
abs
PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation
Tao Fan
|
Guoqiang Ma
|
Yuanfeng Song
|
Lixin Fan
|
Qiang Yang
Compressing Large Language Models (LLMs) into task-specific Small Language Models (SLMs) encounters two significant challenges: safeguarding domain-specific knowledge privacy and managing limited resources. To tackle these challenges, we propose PPC-GPT, a novel unified framework that systematically addresses both privacy preservation and model compression in federated settings. PPC-GPT works on a server-client federated architecture, where the client sends differentially private (DP) perturbed task-specific data to the server’s LLM. The LLM then generates synthetic data along with their corresponding rationales. This synthetic data is subsequently used for both LLM pruning and retraining processes. Our framework’s key innovation lies in its holistic integration of privacy-preserving mechanisms, synthetic data generation, and task-specific compression techniques, creating unique benefits through component interaction. Our experiments across diverse text generation tasks demonstrate that PPC-GPT successfully achieves dual objectives: maintaining competitive performance comparable to full-sized LLMs while ensuring robust privacy protection through its federated architecture. Our code has been contributed to the FATE open-source project and is now publicly accessible at
https://github.com/FederatedAI/FATE-LLM/tree/main/python/fate_llm/algo/ppc-gptpdf
bib
abs
Efficient Beam Search for Large Language Models Using Trie-Based Decoding
Brian J Chan
|
Mao-xun Huang
|
Jui-Hung Cheng
|
Chao-Ting Chen
|
Hen-Hsen Huang
This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4–8×) and up to 2.4× faster decoding, without compromising generation quality. These results highlight our method’s suitability for memory-constrained environments and large-scale deployments.
pdf
bib
abs
Power doesn’t reside in size: A Low Parameter Hybrid Language Model (HLM) for Sentiment Analysis in Code-mixed data
Pavan Sai Balaga
|
Nagasamudram Karthik
|
Challa Vishwanath
|
Raksha Sharma
|
Rudra Murthy
|
Ashish Mittal
Code-mixed text—where multiple languages are used within the same utterance—is increasingly common in both spoken and written communication. However, it presents significant challenges for machine learning models due to the interplay of distinct grammatical structures, effectively forming a hybrid language. While fine-tuning large language models (LLMs) such as GPT-3, or Llama-3 on code-mixed data has led to performance improvements, these models still lag behind their monolingual counterparts and incur high computational costs due to the large number of trainable parameters.In this paper, we focus on the task of sentiment detection in code-mixed text and propose a Hybrid Language Model (HLM) that combines a multilingual encoder (e.g., mBERT) with a lightweight decoder (e.g., Sarvam-1) (3B parameters). Despite having significantly fewer trainable parameters, HLM achieves sentiment classification performance comparable to that of fine-tuned Large Language Models (LLMs) (> 7B parameters). Furthermore, our results demonstrate that HLM significantly outperforms models trained individually, underscoring its effectiveness for low-resource, code-mixed sentiment analysis.
pdf
bib
abs
Evaluating Taxonomy Free Character Role Labeling (TF-CRL) in News Stories using Large Language Models
David G Hobson
|
Derek Ruths
|
Andrew Piper
We introduce Taxonomy-Free Character Role Labeling (TF-CRL); a novel task that assigns open-ended narrative role labels to characters in news stories based on their functional role in the narrative. Unlike fixed taxonomies, TF-CRL enables more nuanced and comparative analysis by generating compositional labels (e.g., Resilient Leader, Scapegoated Visionary). We evaluate several large language models (LLMs) on this task using human preference rankings and ratings across four criteria: faithfulness, relevance, informativeness, and generalizability. LLMs almost uniformly outperform human annotators across all dimensions. We further show how TF-CRL supports rich narrative analysis by revealing novel latent taxonomies and enabling cross-domain narrative comparisons. Our approach offers new tools for studying media portrayals, character framing, and the socio-political impacts of narrative roles at-scale.
pdf
bib
abs
MIRROR: Multimodal Cognitive Reframing Therapy for Rolling with Resistance
Subin Kim
|
Hoonrae Kim
|
Jihyun Lee
|
Yejin Jeon
|
Gary Lee
Recent studies have explored the use of large language models (LLMs) in psychotherapy; however, text-based cognitive behavioral therapy (CBT) models often struggle with client resistance, which can weaken therapeutic alliance. To address this, we propose a multimodal approach that incorporates nonverbal cues, which allows the AI therapist to better align its responses with the client’s negative emotional state.Specifically, we introduce a new synthetic dataset, Mirror (Multimodal Interactive Rolling with Resistance), which is a novel synthetic dataset that pairs each client’s statements with corresponding facial images. Using this dataset, we train baseline vision language models (VLMs) so that they can analyze facial cues, infer emotions, and generate empathetic responses to effectively manage client resistance.These models are then evaluated in terms of both their counseling skills as a therapist, and the strength of therapeutic alliance in the presence of client resistance. Our results demonstrate that Mirror significantly enhances the AI therapist’s ability to handle resistance, which outperforms existing text-based CBT approaches.Human expert evaluations further confirm the effectiveness of our approach in managing client resistance and fostering therapeutic alliance.
pdf
bib
abs
RETAIL: Towards Real-world Travel Planning for Large Language Models
Bin Deng
|
Yizhe Feng
|
Zeming Liu
|
Qing Wei
|
Xiangrong Zhu
|
Shuai Chen
|
Yuanfang Guo
|
Yunhong Wang
Although large language models have enhanced automated travel planning abilities, current systems remain misaligned with real-world scenarios. First, they assume users provide explicit queries, while in reality requirements are often implicit. Second, existing solutions ignore diverse environmental factors and user preferences, limiting the feasibility of plans. Third, systems can only generate plans with basic POI arrangements, failing to provide all-in-one plans with rich details. To mitigate these challenges, we construct a novel dataset RETAIL, which supports decision-making for implicit queries while covering explicit queries, both with and without revision needs. It also enables environmental awareness to ensure plan feasibility under real-world scenarios, while incorporating detailed POI information for all-in-one travel plans. Furthermore, we propose a topic-guided multi-agent framework, termed TGMA. Our experiments reveal that even the strongest existing model achieves merely a 1.0% pass rate, indicating real-world travel planning remains extremely challenging. In contrast, TGMA demonstrates substantially improved performance 2.72%, offering promising directions for real-world travel planning.
pdf
bib
abs
Unraveling Interwoven Roles of Large Language Models in Authorship Privacy: Obfuscation, Mimicking, and Verification
Tuc Nguyen
|
Yifan Hu
|
Thai Le
Recent advancements in large language models (LLMs) have been fueled by large-scale training corpora drawn from diverse sources such as websites, news articles, and books. These datasets often contain explicit user information, such as person names, addresses, that LLMs may unintentionally reproduce in their generated outputs. Beyond such explicit content, LLMs can also leak identity-revealing cues through implicit signals such as distinctive writing styles, raising significant concerns about authorship privacy. There are three major automated tasks in authorship privacy, namely authorship obfuscation (AO), authorship mimicking (AM), and authorship verification (AV). Prior research has studied AO, AM, and AV independently. However, their interplays remain under-explored, which leaves a major research gap, especially in the era of LLMs, where they are profoundly shaping how we curate and share user-generated content, and the distinction between machine‐generated and human‐authored text is also increasingly blurred. This work then presents the first unified framework for analyzing the dynamic relationships among LLM-enabled AO, AM, and AV in the context of authorship privacy. We quantify how they interact with each other to transform human‐authored text, examining effects at a single point in time and iteratively over time. We also examine the role of demographic metadata, such as gender, academic background, in modulating their performances, inter-task dynamics, and privacy risks. The code is available at
https://github.com/nguyentuc/authorship_privacy.
pdf
bib
abs
Reward Model Perspectives: Whose Opinions Do Reward Models Reward?
Elle
Reward models (RMs) are central to the alignment of language models (LMs). An RM often serves as a proxy for human preferences to guide downstream LM behavior. However, our understanding of RM behavior is limited. Our work (i) formalizes a framework for measuring the alignment of opinions captured by RMs, (ii) investigates the extent to which RMs demonstrate sociodemographic biases, and (iii) explores the effects of prompting to steer rewards towards the preferences of a target group. We study the subjective and diverse perspectives on controversial topics, which allows us to quantify RM perspectives in terms of their opinions, attitudes, and values. We show that RMs are poorly aligned with several demographic groups and can systematically reward harmful stereotypes, and steering alone is not enough to overcome these limitations. Our findings underscore the need for more careful consideration of RM behavior in model alignment during preference learning to prevent the propagation of unwanted social biases in the language technologies that we use.
pdf
bib
abs
FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference
Yu-Chen Lu
|
Chong-Yan Chen
|
Chi-Chih Chang
|
Yu-Fang Hu
|
Kai-Chiang Wu
Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.
pdf
bib
abs
Do You Know About My Nation? Investigating Multilingual Language Models’ Cultural Literacy Through Factual Knowledge
Eshaan Tanwar
|
Anwoy Chatterjee
|
Michael Saxon
|
Alon Albalak
|
William Yang Wang
|
Tanmoy Chakraborty
Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models’ comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models’ accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.
pdf
bib
abs
CoEvo: Coevolution of LLM and Retrieval Model for Domain-Specific Information Retrieval
Ang Li
|
Yiquan Wu
|
Yinghao Hu
|
Lizhi Qing
|
Shihang Wang
|
Chengyuan Liu
|
Tao Wu
|
Adam Jatowt
|
Ming Cai
|
Fei Wu
|
Kun Kuang
Information retrieval in specialized domains (e.g., legal and medical) faces challenges in aligning user queries, often expressed in colloquial language, with highly structured, terminology-rich documents. This discrepancy creates a distribution gap in the text representation. Recent methods aim to enhance queries by generating intermediary elements (e.g., keywords, pseudo-documents) before performing retrieval with large language models (LLMs). However, by treating LLMs and retrievers separately, these approaches risk producing unreliable or irrelevant intermediaries, which can significantly degrade retrieval performance. To address this issue, we propose CoEvo, an alternating optimization framework that facilitates the coevolution of LLMs and retrieval models. CoEvo operates through two key steps: L-step directs the LLM in generating intermediaries by leveraging an archive of historical examples known to enhance retrieval. R-step trains the retriever using contrastive learning on the intermediaries produced by the LLM. Finally, we evaluate and flexibly leverage content generated by the LLM to amplify the effectiveness of coevolution. Experimental results demonstrate significant improvements in retrieval performance across both legal and medical domains.
pdf
bib
abs
Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
Shiyu Li
|
Yang Tang
|
Ruijie Liu
|
Shi-Zhe Chen
|
Xi Chen
Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
pdf
bib
abs
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Yue Zhang
|
Tianyi Ma
|
Zun Wang
|
Yanyuan Qiao
|
Parisa Kordjamshidi
Integrating large language models (LLMs) into embodied AI models is becoming increasingly prevalent. However, existing zero-shot LLM-based Vision-and-Language Navigation (VLN) agents either encode images as textual scene descriptions, potentially oversimplifying visual details, or process raw image inputs, which can fail to capture abstract semantics required for high-level reasoning. In this paper, we improve the navigation agent’s contextual understanding by incorporating textual descriptions that facilitate analogical reasoning across images from multiple perspectives. By leveraging text-based analogical reasoning, the agent enhances its global scene understanding and spatial reasoning, leading to more accurate action decisions. We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.
pdf
bib
abs
MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models
Xiaolong Wang
|
Zhaolu Kang
|
Wangyuxuan Zhai
|
Xinyue Lou
|
Yunghwei Lai
|
Ziyue Wang
|
Yawen Wang
|
Kaiyu Huang
|
Yile Wang
|
Peng Li
|
Yang Liu
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong performance in image-text alignment, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models—encompassing both open-source and proprietary architectures—reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.
pdf
bib
abs
Mind the Gap: How BabyLMs Learn Filler-Gap Dependencies
Chi-Yun Chang
|
Xueyang Huang
|
Humaira Nasir
|
Shane Storks
|
Olawale Akingbade
|
Huteng Dai
Humans acquire syntactic constructions like filler-gap dependencies from limited and often noisy input. Can neural language models do the same? We investigate this question by evaluating GPT-2 models trained on child-oriented input from the BabyLM Challenge. Our experiments focus on whether these “baby” language models acquire filler-gap dependencies, generalize across constructions, and respect structural constraints such as island effects. We apply a suite of syntactic constructions to four models trained on child language, including two base models (trained on 10M and 100M tokens) and two well-performing models from the BabyLM Challenge (ConcreteGPT and BabbleGPT). We evaluate model behavior using wh-licensing scores, flip tests, and grammaticality contrasts across four constructions. Results show that BabyLM-scale models partially acquire filler-gap dependencies but often fail to generalize or fully capture island constraints.
pdf
bib
abs
Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline
Meng Lu
|
Ruochen Zhang
|
Carsten Eickhoff
|
Ellie Pavlick
Multilingual large language models (LLMs) often exhibit factual inconsistencies across languages, usually with better performance in factual recall tasks in high-resource languages than in other languages. The causes of these failures, however, remain poorly understood. Using mechanistic analysis techniques, we uncover the underlying pipeline that LLMs employ, which involves using the English-centric factual recall mechanism to process multilingual queries and then translating English answers back into the target language. We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language for the final answer. To address these vulnerabilities, we introduce two vector interventions, both independent of languages and datasets, to redirect the model toward better internal paths for higher factual consistency. Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language. Our findings demonstrate how mechanistic insights can be used to unlock latent multilingual capabilities in LLMs.
pdf
bib
abs
BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models
Zsolt T. Kardkovács
|
Lynda Djennane
|
Anna Field
|
Boualem Benatallah
|
Yacine Gaci
|
Fabio Casati
|
Walid Gaaloul
Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects.Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.
pdf
bib
abs
Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models
Chen Han
|
Wenzhen Zheng
|
Xijin Tang
The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. Inspired by the idea that “Truth Becomes Clearer Through Debate”, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Based on fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two fakenews datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D’s capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards robust and interpretable misinformation detection. Our code is available at https://github.com/hanshenmesen/Debate-to-Detect
pdf
bib
abs
Controllable Memorization in LLMs via Weight Pruning
Chenjie Ni
|
Zhepeng Wang
|
Runxue Bao
|
Shangqian Gao
|
Yanfu Zhang
The evolution of pre-trained large language models (LLMs) has significantly transformed natural language processing. However, these advancements pose challenges, particularly the unintended memorization of training data, which raises ethical and privacy concerns. While prior research has largely focused on mitigating memorization or extracting memorized information, the deliberate control of memorization has been underexplored. This study addresses this gap by introducing a novel and unified gradient-based weight pruning framework to freely control memorization rates in LLMs. Our method enables fine-grained control over pruning parameters, allowing models to suppress or enhance memorization based on application-specific requirements. Experimental results demonstrate that our approach effectively balances the trade-offs between memorization and generalization, with an increase of up to 89.3% in Fractional ER suppression and 40.9% in Exact ER amplification compared to the original models.
pdf
bib
abs
Tracing L1 Interference in English Learner Writing: A Longitudinal Corpus with Error Annotations
Poorvi Acharya
|
J. Elizabeth Liebl
|
Dhiman Goswami
|
Kai North
|
Marcos Zampieri
|
Antonios Anastasopoulos
The availability of suitable learner corpora is crucial for studying second language acquisition (SLA) and language transfer. However, curating such corpora is challenging, as high-quality learner data is rarely publicly available. As a result, only a few learner corpora, such as ICLE and TOEFL-11, are accessible to the research community.To address this gap, we present Anonymous, a novel English learner corpus with longitudinal data. The corpus consists of 687 texts written by adult learners taking English as a second language courses in the USA. These learners are either preparing for university admission or enhancing their language proficiency while beginning their university studies. Unlike most learner corpora, Anonymous includes longitudinal data, allowing researchers to explore language learning trajectories over time. The corpus features contributions from speakers of 15 different L1s.We demonstrate the utility of Anonymous through two case studies at the intersection of SLA and Computational Linguistics: (1) Native Language Identification (NLI), and (2) a quantitative and qualitative analysis of linguistic features influenced by L1 using large language models
pdf
bib
abs
DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
Lei Yang
|
Shaoyang Xu
|
Jianxiang Peng
|
Shaolin Zhu
|
Deyi Xiong
Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose a novel RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a Divide-and-Conquer Incremental Search (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.
pdf
bib
abs
Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation
Jiayu Yao
|
Shenghua Liu
|
Yiwei Wang
|
Lingrui Mei
|
Baolong Bi
|
Yuyao Ge
|
Zhecheng Li
|
Xueqi Cheng
Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index (PSIp) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems. Our code and experimental resources are available at https://github.com/Theodyy/Multimodal-Rag-Position-Bias.
pdf
bib
abs
Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports
Punit Kumar Singh
|
Nishant Kumar
|
Akash Ghosh
|
Kunal Pasad
|
Khushi Soni
|
Manisha Jaishwal
|
Sriparna Saha
|
Syukron Abu Ishaq Alfarozi
|
Asres Temam Abagissa
|
Kitsuchart Pasupa
|
Haiqin Yang
|
Jose G Moreno
Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce CultSportQA, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, categorized into primarily three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, CultSportQA establishes a new standard for assessing AI’s ability to understand and reason about traditional sports. The dataset will be publicly available, fostering research in culturally aware AI systems.
pdf
bib
abs
Multilingual Federated Low-Rank Adaptation for Collaborative Content Anomaly Detection across Multilingual Social Media Participants
Jiaxin Li
|
Geng Zhao
|
Xiaoci Zhang
Recently, the rapid development of multilingual social media platforms (SNS) exacerbates new challenges in SNS content anomaly detection due to data islands and linguistic imbalance. While federated learning (FL) and parameter-efficient fine-tuning (PEFT) offer potential solutions in most cases, when every client is multilingual, existing solutions struggle with multilingual heterogeneity: 1) entangled language-specific knowledge during aggregation, 2) noise from minority languages, and 3) unstable cross-platform collaboration. Based on the asymmetric nature of LoRA, we propose MuLA-F, a multilingual Federated LoRA introducing SVD-based language-specific disentanglement of LoRA blocks and a local orthogonal tuning strategy. Evaluations across three SNS content anomaly detection tasks demonstrate MuLA-F’s superiority in multilingual performance while reducing multilingual knowledge conflicts and communication rounds.
pdf
bib
abs
M3Retrieve: Benchmarking Multimodal Retrieval for Medicine
Arkadeep Acharya
|
Akash Ghosh
|
Pradeepika Verma
|
Kitsuchart Pasupa
|
Sriparna Saha
|
Dr Priti Singh
With the increasing use of Retrieval-Augmented Generation (RAG), strong retrieval models have become more important than ever. In healthcare, multimodal retrieval models that combine information from both text and images offer major advantages for many downstream tasks such as question answering, cross-modal retrieval, and multimodal summarization, since medical data often includes both formats. However, there is currently no standard benchmark to evaluate how well these models perform in medical settings. To address this gap, we introduce M3Retrieve, a Multimodal Medical Retrieval Benchmark. M3Retrieve spans 5 domains,16 medical fields, and 4 distinct tasks, with over 1.2 Million text documents and 164K multimodal queries, all collected under approved licenses. We evaluate leading multimodal retrieval models on this benchmark to explore the challenges specific to different medical specialities and to understand their impact on retrieval performance. By releasing M3Retrieve, we aim to enable systematic evaluation, foster model innovation, and accelerate research toward building more capable and reliable multimodal retrieval systems for medical applications.
pdf
bib
abs
The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems
Zengqing Wu
|
Takayuki Ito
Consensus formation is pivotal in multi-agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM-based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in-context learning, can be more effective in dynamic environments that require long-horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus-diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios – Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public-Goods Provision – confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in-context learning, underscoring the value of preserving diversity for resilient decision-making.
pdf
bib
abs
Friend or Foe? A Computational Investigation of Semantic False Friends across Romance Languages
Ana Sabina Uban
|
Liviu P Dinu
|
Ioan-Bogdan Iordache
|
Simona Georgescu
|
Claudia Vlad
In this paper we present a comprehensive analysis of lexical semantic divergence between cognate words and borrowings in the Romance languages. We experiment with different algorithms for false friend detection including deceptive cognate and deceptive borrowings and correction and evaluate them systematically on cognate and borrowing pairs in the five Romance languages. We use the most complete and reliable dataset of cognate words based on etymological dictionaries for the five main Romance languages (Italian, Spanish, Portuguese, French and Romanian) to extract deceptive cognates and borrowings automatically based on usage, and freely publish the lexicon of obtained true and deceptive cognate and borrowings in every Romance language pair.
pdf
bib
abs
KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models
Seorin Kim
|
Dongyoung Lee
|
Jaejin Lee
Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.
pdf
bib
abs
SeMob: Semantic Synthesis for Dynamic Urban Mobility Prediction
Runfei Chen
|
Shuyang Jiang
|
Wei Huang
Human mobility prediction is vital for urban services, but often fails to account for abrupt changes from external events. Existing spatiotemporal models struggle to leverage textual descriptions detailing these events. We propose SeMob, an LLM-powered semantic synthesis pipeline for dynamic mobility prediction. Specifically, SeMob employs a multi-agent framework where LLM-based agents automatically extract and reason about spatiotemporally related text from complex online texts. Fine-grained relevant contexts are then incorporated with spatiotemporal data through our proposed innovative progressive fusion architecture. The rich pre-trained event prior contributes enriched insights about event-driven prediction, and hence results in a more aligned forecasting model. Evaluated on a dataset constructed through our pipeline, SeMob achieves maximal reductions of 13.92% in MAE and 11.12% in RMSE compared to the spatiotemporal model. Notably, the framework exhibits pronounced superiority especially within spatiotemporal regions close to an event’s location and time of occurrence.
pdf
bib
abs
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
Yize Cheng
|
Wenxiao Wang
|
Mazda Moayeri
|
Soheil Feizi
Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce **DyePack**, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, **without requiring access to the loss, logits, or any internal details of the model.** Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, **enabling exact false positive rate (FPR) computation when flagging every model.** This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
pdf
bib
abs
Minimal, Local, and Robust: Embedding-Only Edits for Implicit Bias in T2I Models
Feng He
|
Chao Zhang
|
Zhixue Zhao
Implicit assumptions and priors are often necessary in text-to-image generation tasks, especially when textual prompts lack sufficient context. However, these assumptions can sometimes reflect societal biases, low variance, or outdated concepts in the training data. We present Embedding-only Editing (EmbEdit), a method designed to efficiently edit implicit assumptions and priors in the text-to-image model without affecting unrelated objects or degrading overall performance. Given a “source” prompt (e.g., “nurse”) that elicits an assumption (e.g., a female nurse) and a “destination” prompt or distribution (e.g. equal gender chance), EmbEdit only fine-tunes the word token embedding (WTE) of the target object (i.e. token “nurse”’s WTE). Our method prevents unintended effects on other objects in the model’s knowledge base, as the WTEs for unrelated objects and the model weights remain unchanged. Further, our method can be applied to any text-to-image model with a text encoder. It is highly efficient, modifying only 768, 2048, and 4864 parameters for Stable Diffusion 1.4, Stable Diffusion XL, and FLUX, respectively, matching each model’s WTE dimension. Additionally, changes could be easily reversed by restoring the original WTE layers. The results show that EmbEdit outperforms previous methods in various models, tasks, and editing scenarios (both single and sequential multiple edits), achieving at least a 6.01% improvement (from 87.17% to 93.18%).
pdf
bib
abs
Journalism-Guided Agentic In-context Learning for News Stance Detection
Dahyun Lee
|
Jonghyeon Choi
|
Jiyoung Han
|
Kunwoo Park
As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection—identifying a text’s position on a target—can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce K-News-Stance, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 21,650 segment-level stance annotations across 47 societal issues. We also propose JoA-ICL, a Journalism-guided Agentic In-Context Learning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments showed that JoA-ICL outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.
pdf
bib
abs
Less Is MuRE: Revisiting Shallow Knowledge Graph Embeddings
Victor Charpenay
|
Steven Schockaert
In recent years, the field of knowledge graph completion has focused on increasingly sophisticated models, which perform well on link prediction tasks, but are less scalable than earlier methods and are not suitable for learning entity embeddings. As a result, shallow models such as TransE and ComplEx remain the most popular choice in many settings. However, the strengths and limitations of such models remain poorly understood. In this paper, we present a unifying framework and systematically analyze a number of variants and extensions of existing shallow models, empirically showing that MuRE and its extension, ExpressivE, are highly competitive. Motivated by the strong empirical results of MuRE, we also theoretically analyze the expressivity of its associated scoring function, surprisingly finding that it can capture the same class of rule bases as state-of-the-art region-based embedding models.
pdf
bib
abs
Jailbreak LLMs through Internal Stance Manipulation
Shuangjie Fu
|
Du Su
|
Beining Huang
|
Fei Sun
|
Jingang Wang
|
Wei Chen
|
Huawei Shen
|
Xueqi Cheng
To confront the ever-evolving safety risks of LLMs, automated jailbreak attacks have proven effective for proactively identifying security vulnerabilities at scale. Existing approaches, including GCG and AutoDAN, modify adversarial prompts to induce LLMs to generate responses that strictly follow a fixed affirmative template. However, we observed that the reliance on the rigid output template is ineffective for certain malicious requests, leading to suboptimal jailbreak performance. In this work, we aim to develop a method that is universally effective across all hostile requests. To achieve this, we explore LLMs’ intrinsic safety mechanism: a refusal stance towards the adversarial prompt is formed in a confined region and ultimately leads to a rejective response. In light of this, we propose Stance Manipulation (SM), a novel automated jailbreak approach that generates jailbreak prompts to suppress the refusal stance and induce affirmative responses. Our experiments across four mainstream open-source LLMs demonstrate the superiority of SM’s performance. Under commenly used setting, SM achieves success rates over 77.1% across all models on Advbench. Specifically, for Llama-2-7b-chat, SM outperforms the best baseline by 25.4%. In further experiments with extended iterations in a speedup setup, SM achieves over 92.2% attack success rate across all models. Our code is publicly available at https://github.com/Zed630/Stance-Manipulation.
pdf
bib
abs
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
Haoming Huang
|
Yibo Yan
|
Jiahao Huo
|
Xin Zou
|
Xinfeng Li
|
Kun Wang
|
Xuming Hu
Large Language Models (LLMs), despite their remarkable capabilities, are hampered by hallucinations. A particularly challenging variant, knowledge overshadowing, occurs when one piece of activated knowledge inadvertently masks another relevant piece, leading to erroneous outputs even with high-quality training data. Current understanding of overshadowing is largely confined to inference-time observations, lacking deep insights into its origins and internal mechanisms during model training. Therefore, we introduce **PhantomCircuit, a novel framework designed to comprehensively analyze and detect knowledge overshadowing.** By innovatively employing knowledge circuit analysis, PhantomCircuit dissects the function of key components in the circuit and how the attention pattern dynamics contribute to the overshadowing phenomenon and its evolution throughout the training process. Extensive experiments demonstrate PhantomCircuit’s effectiveness in identifying such instances, offering novel insights into this elusive hallucination and providing the research community with a new methodological lens for its potential mitigation. Our code can be found in https://github.com/halfmorepiece/PhantomCircuit.
pdf
bib
abs
Complex Numerical Reasoning with Numerical Semantic Pre-training Framework
Jun Zhang
|
Haihong E
|
Tianyi Hu
|
Yifan Zhu
|
Meina Song
|
Haoran Luo
Multi-hop complex reasoning over incomplete knowledge graphs (KGs) has been extensively studied, but research on numerical knowledge graphs (NKGs) remains relatively limited. Recent approaches focus on separately encoding entities and numerical values, using neural networks to process query encodings for reasoning. However, in complex multi-hop reasoning tasks, numerical values are not merely symbols, and they carry specific semantics and logical relationships that must be accurately represented. The CNR-NST framework can perform binary operations on numerical attributes in NKGs, enabling it to infer new numerical attributes from existing knowledge. Our approach effectively handles up to 102 types of complex numerical reasoning queries. On three public datasets, CNR-NST demonstrates SOTA performance in complex numerical queries, achieving an average improvement of over 40% compared to existing methods. Notably, this work expands the query types for complex multi-hop numerical reasoning and introduces a new evaluation metric for numerical answers, which has been validated through comprehensive experiments.
pdf
bib
abs
Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling
Sydney Anuyah
|
Mehedi Mahmud Kaushik
|
Sri Rama Krishna Reddy Dwarampudi
|
Rakesh Shiradkar
|
Arjan Durresi
|
Sunandan Chakraborty
We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150 000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 200 rows of gold human annotations for coreference resolution using lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies from sentences in the abstract, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%
pdf
bib
abs
OntologyRAG-Q: Resource Development and Benchmarking for Retrieval-Augmented Question Answering in Qur’anic Tafsir
Sadam Al-Azani
|
Maad Alowaifeer
|
Alhanoof Alhunief
|
Ahmed Abdelali
This paper introduces essential resources for Qur’anic studies: an annotated Tafsir ontology, a dataset of approximately 4,200 question-answer pairs, and a collection of 15 structured Tafsir books available in two formats. We present a comprehensive framework for handling sensitive Qur’anic Tafsir data that spans the entire pipeline from dataset construction through evaluation and error analysis. Our work establishes new benchmarks for retrieval and question-answering tasks on Qur’anic content, comparing performance across state-of-the-art embedding models and large language models (LLMs).We introduce OntologyRAG-Q, a novel retrieval-augmented generation approach featuring our custom Ayat-Ontology chunking method that segments Tafsir content at the verse level using ontology-driven structure. Benchmarking reveals strong performance across various LLMs, with GPT-4 achieving the highest results, followed closely by ALLaM. Expert evaluations show our system achieves 69.52% accuracy and 74.36% correctness overall, though multi-hop and context-dependent questions remain challenging. Our analysis demonstrates that answer position within documents significantly impacts retrieval performance, and among the evaluation metrics tested, BERT-recall and BERT-F1 correlate most strongly with expert assessments. The resources developed in this study are publicly available at
https://github.com/sazani/OntologyRAG-Q.git.
pdf
bib
abs
The Practical Impacts of Theoretical Constructs on Empathy Modeling
Allison Lahnala
|
Charles Welch
|
David Jurgens
|
Lucie Flek
Conceptual operationalizations of empathy in NLP are varied, with some having specific behaviors and properties, while others are more abstract. How these variations relate to one another and capture properties of empathy observable in text remains unclear. To provide insight into this, we analyze the transfer performance of empathy models adapted to empathy tasks with different theoretical groundings. We study (1) the dimensionality of empathy definitions, (2) the correspondence between the defined dimensions and measured/observed properties, and (3) the conduciveness of the data to represent them, finding they have a significant impact to performance compared to other transfer setting features. Characterizing the theoretical grounding of empathy tasks as direct, abstract, or adjacent further indicates that tasks that directly predict specified empathy components have higher transferability. Our work provides empirical evidence for the need for precise and multidimensional empathy operationalizations.
pdf
bib
abs
RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation
Sashuai Zhou
|
Weinan Gan
|
Qijiong Liu
|
Ke Lei
|
Jieming Zhu
|
Hai Huang
|
Yan Xia
|
Ruiming Tang
|
Zhenhua Dong
|
Zhou Zhao
Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains. To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization. To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.
pdf
bib
abs
Grouping Entities with Shared Properties using Multi-Facet Prompting and Property Embeddings
Amit Gajbhiye
|
Thomas Bailleux
|
Zied Bouraoui
|
Luis Espinosa-Anke
|
Steven Schockaert
Methods for learning taxonomies from data have been widely studied. We study a specific version of this task, called commonality identification, where only the set of entities is given and we need to find meaningful ways to group those entities. While LLMs should intuitively excel at this task, it is difficult to directly use such models in large domains. In this paper, we instead use LLMs to describe the different properties that are satisfied by each of the entities individually. We then use pre-trained embeddings to cluster these properties, and finally group entities that have properties which belong to the same cluster. To achieve good results, it is paramount that the properties predicted by the LLM are sufficiently diverse. We find that this diversity can be improved by prompting the LLM to structure the predicted properties into different facets of knowledge.
pdf
bib
abs
Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering
Kun Zhu
|
Lizi Liao
|
Yuxuan Gu
|
Lei Huang
|
Xiaocheng Feng
|
Bing Qin
The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.
pdf
bib
abs
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
Dongjun Kim
|
Gyuho Shim
|
Yongchan Chun
|
Minhyuk Kim
|
Chanjun Park
|
Heuiseok Lim
Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce **BENCHMARK PROFILING**, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. **BENCHMARK PROFILING** therefore explains why performance gains do not always translate into user-perceived competence and offer a transparent tool for benchmark audit and model interpretability.
pdf
bib
abs
TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review
Yuan Chang
|
Ziyue Li
|
Hengyuan Zhang
|
Yuanbo Kong
|
Yanru Wu
|
Hayden Kwok-Hay So
|
Zhijiang Guo
|
Liya Zhu
|
Ngai Wong
While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches.
pdf
bib
abs
Improving Chemical Understanding of LLMs via SMILES Parsing
Yunhui Jang
|
Jaehyung Kim
|
Sungsoo Ahn
Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.
pdf
bib
abs
Can Large Language Models Tackle Graph Partitioning?
Yiheng Wu
|
Ningchao Ge
|
Yanmin Li
|
Liwei Qian
|
Mengna Zhu
|
Haoyu Yang
|
Haiwen Chen
|
Jibing Wu
Large language models (LLMs) demonstrate remarkable capabilities in understanding complex tasks and have achieved commendable performance in graph-related tasks, such as node classification, link prediction, and subgraph classification. These tasks primarily depend on the local reasoning capabilities of the graph structure. However, research has yet to address the graph partitioning task that requires global perception abilities. Our preliminary findings reveal that vanilla LLMs can only handle graph partitioning on extremely small-scale graphs. To overcome this limitation, we propose a three-phase pipeline to empower LLMs for large-scale graph partitioning: coarsening, reasoning, and refining. The coarsening phase reduces graph complexity. The reasoning phase captures both global and local patterns to generate a coarse partition. The refining phase ensures topological consistency by projecting the coarse-grained partitioning results back to the original graph structure. Extensive experiments demonstrate that our framework enables LLMs to perform graph partitioning across varying graph scales, validating both the effectiveness of LLMs for partitioning tasks and the practical utility of our proposed methodology.
pdf
bib
abs
To See a World in a Spark of Neuron: Disentangling Multi-Task Interference for Training-Free Model Merging
Zitao Fang
|
Guodong Du
|
Shuyang Yu
|
Yifei Guo
|
Yiwei Zhang
|
Yiyao Cao
|
Jing Li
|
Ho-Kin Tang
|
Sim Kuan Goh
Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at [this http URL](https://ZzzitaoFang.github.io/projects/NeuroMerging/).
pdf
bib
abs
What You Read Isn’t What You Hear: Linguistic Sensitivity in Deepfake Speech Detection
Binh Nguyen
|
Shuju Shi
|
Ryan Ofman
|
Thai Le
Recent advances in text-to-speech technology have enabled highly realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior research has predominantly focused on acoustic-level perturbations, leaving **the impact of linguistic variation largely unexplored**. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing **TAPAS** (Transcript-to-Audio Perturbation Anti-Spoofing), a novel framework for transcript-level adversarial attacks. Our extensive evaluation shows that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates exceed **60%** on several open-source detector–voice pairs, and the accuracy of one commercial detector drops from **100%** on synthetic audio to just **32%**. Through a comprehensive feature attribution analysis, we find that linguistic complexity and model-level audio embedding similarity are key factors contributing to detector vulnerabilities. To illustrate the real-world risks, we replicate a recent Brad Pitt audio deepfake scam and demonstrate that TAPAS can bypass commercial detectors. These findings underscore the **need to move beyond purely acoustic defenses** and incorporate linguistic variation into the design of robust anti-spoofing systems. Our source code is available at https://github.com/nqbinh17/audio_linguistic_adversarial.
pdf
bib
abs
Task-Aware Resolution Optimization for Visual Large Language Models
Weiqing Luo
|
Zhen Tan
|
Yifan Li
|
Xinyu Zhao
|
Kwonjoon Lee
|
Behzad Dariush
|
Tianlong Chen
Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with (1) image complexity, and (2) uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, accounting for these two factors as the zeroth-order and first-order terms in the Taylor expansion on a given image input. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.
pdf
bib
abs
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
Yukyung Lee
|
JoongHoon Kim
|
Jaehee Kim
|
Hyowon Cho
|
Jaewook Kang
|
Pilsung Kang
|
Najoung Kim
Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models. We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols. To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions. Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments. More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance. CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
pdf
bib
abs
A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations
Lingjun Zhao
|
Hal Daumé Iii
Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.
pdf
bib
abs
Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Qihang Ma
|
Shengyu Li
|
Jie Tang
|
Dingkang Yang
|
Chenshaodong
|
Yingyi Zhang
|
Chao Feng
|
Ran Jiao
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.
pdf
bib
abs
Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation
Tianhao Niu
|
Yiming Cui
|
Baoxin Wang
|
Xiao Xu
|
Xin Yao
|
Qingfu Zhu
|
Dayong Wu
|
Shijin Wang
|
Wanxiang Che
Chart2code has recently received significant attention in the multimodal community due to its potential to reduce the burden of visualization and promote a more detailed understanding of charts. However, existing Chart2code-related training datasets suffer from at least one of the following issues: (1) limited scale, (2) limited type coverage, and (3) inadequate complexity. To address these challenges, we seek more diverse sources that better align with real-world user distributions and propose dual data synthesis pipelines: (1) synthesize based on online plotting code. (2) synthesize based on chart images in the academic paper. We create a large-scale Chart2code training dataset Chart2code53, including 53 chart types, 130K Chart-code pairs based on the pipeline. Experimental results demonstrate that even with few parameters, the model finetuned on Chart2code53 achieves state-of-the-art performance on multiple Chart2code benchmarks within open-source models.
pdf
bib
abs
The State of Multilingual LLM Safety Research: From Measuring The Language Gap To Mitigating It
Zheng Xin Yong
|
Beyza Ermis
|
Marzieh Fadaee
|
Stephen Bach
|
Julia Kreutzer
This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020–2024 across major NLP conferences and workshops at ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.
pdf
bib
abs
AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
Saket Sanjeev Chaturvedi
|
Gaurav Bagwe
|
Lan Emily Zhang
|
Xiaoyong Yuan
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources to improve factual accuracy and verifiability. However, this reliance introduces new attack surfaces within the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have exposed such vulnerabilities, they largely rely on manipulating user queries, which is often infeasible in practice due to fixed or protected user inputs. This narrow focus overlooks a more realistic and stealthy vector: instructional prompts, which are widely reused, publicly shared, and rarely audited. Their implicit trust makes them a compelling target for adversaries to manipulate RAG behavior covertly.We introduce a novel attack for Adversarial Instructional Prompt (AIP) that exploits adversarial instructional prompts to manipulate RAG outputs by subtly altering retrieval behavior. By shifting the attack surface to the instructional prompts, AIP reveals how trusted yet seemingly benign interface components can be weaponized to degrade system integrity. The attack is crafted to achieve three goals: (1) naturalness, to evade user detection; (2) utility, to encourage use of prompts; and (3) robustness, to remain effective across diverse query variations. We propose a diverse query generation strategy that simulates realistic linguistic variation in user queries, enabling the discovery of prompts that generalize across paraphrases and rephrasings. Building on this, a genetic algorithm-based joint optimization is developed to evolve adversarial prompts by balancing attack success, clean-task utility, and stealthiness. Experimental results show that AIP achieves up to 95.23% attack success rate while preserving benign functionality. These findings uncover a critical and previously overlooked vulnerability in RAG systems, emphasizing the need to reassess the shared instructional prompts.
pdf
bib
abs
From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing
Lanxiao Huang
|
Daksh Dave
|
Tyler Cody
|
Peter A. Beling
|
Ming Jin
Large Language Models (LLMs) have been explored for automating or enhancing penetration testing tasks, but their effectiveness and reliability across diverse attack phases remain open questions. This study presents a comprehensive evaluation of multiple LLM-based agents, ranging from singular to modular designs, across realistic penetration testing scenarios, analyzing their empirical performance and recurring failure patterns. We further investigate the impact of core functional capabilities on agent success, operationalized through five targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions respectively support the capabilities of Context Coherence & Retention, Inter-Component Coordination & State Management, Tool Usage Accuracy & Selective Execution, Multi-Step Strategic Planning & Error Detection & Recovery, and Real-Time Dynamic Responsiveness. Our findings reveal that while some architectures natively exhibit select properties, targeted augmentations significantly enhance modular agent performance—particularly in complex, multi-step, and real-time penetration testing scenarios.
pdf
bib
abs
Editing Across Languages: A Survey of Multilingual Knowledge Editing
Nadir Durrani
|
Basel Mousi
|
Fahim Dalvi
While Knowledge Editing has been extensively studied in monolingual settings, it remains underexplored in multilingual contexts. This survey systematizes recent research on Multilingual Knowledge Editing (MKE), a growing subdomain of model editing focused on ensuring factual edits generalize reliably across languages. We present a comprehensive taxonomy of MKE methods, covering parameter-based, memory-based, fine-tuning, and hypernetwork approaches. We survey available benchmarks, summarize key findings on method effectiveness and transfer patterns, and identify persistent challenges such as cross-lingual propagation, language anisotropy, and limited evaluation for low-resource and culturally specific languages. We also discuss broader concerns such as stability and scalability of multilingual edits. Our analysis consolidates a rapidly evolving area and lays the groundwork for future progress in editable language-aware LLMs.
pdf
bib
abs
Your RAG is Unfair: Exposing Fairness Vulnerabilities in Retrieval-Augmented Generation via Backdoor Attacks
Gaurav Bagwe
|
Saket Sanjeev Chaturvedi
|
Xiaolong Ma
|
Xiaoyong Yuan
|
Kuang-Ching Wang
|
Lan Emily Zhang
Retrieval-augmented generation (RAG) enhances factual grounding by integrating retrieval mechanisms with generative models but introduces new attack surfaces, particularly through backdoor attacks. While prior research has largely focused on disinformation threats, fairness vulnerabilities remain underexplored. Unlike conventional backdoors that rely on direct trigger-to-target mappings, fairness-driven attacks exploit the interaction between retrieval and generation models, manipulating semantic relationships between target groups and social biases to establish a persistent and covert influence on content generation.This paper introduces BiasRAG , a systematic framework that exposes fairness vulnerabilities in RAG through a two-phase backdoor attack. During the pre-training phase, the query encoder is compromised to align the target group with the intended social bias, ensuring long-term persistence. In the post-deployment phase, adversarial documents are injected into knowledge bases to reinforce the backdoor, subtly influencing retrieved content while remaining undetectable under standard fairness evaluations. Together, BiasRAG ensures precise target alignment over sensitive attributes, stealthy execution, and resilience. Empirical evaluations demonstrate that BiasRAG achieves high attack success rates while preserving contextual relevance and utility, establishing a persistent and evolving threat to fairness in RAG.
pdf
bib
abs
Drift-Adapter: A Practical Approach to Near Zero-Downtime Embedding Model Upgrades in Vector Databases
Harshil Vejendla
Upgrading embedding models in production vector databases typically necessitates re-encoding the entire corpus and rebuilding the Approximate Nearest Neighbor (ANN) index, leading to significant operational disruption and computational cost. This paper presents Drift-Adapter, a lightweight, learnable transformation layer designed to bridge embedding spaces between model versions. By mapping new queries into the legacy embedding space, Drift-Adapter enables the continued use of the existing ANN index, effectively deferring full re-computation. We systematically evaluate three adapter parameterizations: Orthogonal Procrustes, Low-Rank Affine, and a compact Residual MLP, trained on a small sample of paired old/new embeddings. Experiments on MTEB text corpora and a CLIP image model upgrade (1M items) show that Drift-Adapter recovers 95–99% of the retrieval recall (Recall@10, MRR) of a full re-embedding, adding less than 10,𝜇s query latency. Compared to operational strategies like full re-indexing or dual-index serving, Drift-Adapter dramatically reduces recompute costs (by over 100 times) and facilitates upgrades with near-zero operational interruption. We analyze robustness to varied model drift, training data size, scalability to billion-item systems, and the impact of design choices like diagonal scaling, demonstrating Drift-Adapter’s viability as a pragmatic solution for agile model deployment.
pdf
bib
abs
The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas
Ya Wu
|
Qiang Sheng
|
Danding Wang
|
Guang Yang
|
Yifan Sun
|
Zhengjia Wang
|
Yuyan Bu
|
Juan Cao
Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
pdf
bib
abs
SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
Harshil Vejendla
Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a sparse subset of feed-forward experts. Token-level routing, however, assigns an entire semantic spectrum to each expert, creating capacity bottlenecks, load-balancing pathologies, and limited specialisation. We introduce SliceMoE, an architecture that routes contiguous slices of a token’s hidden vector. A d-dimensional embedding is partitioned into S slices, and for each slice, a lightweight shared router predicts the top-k experts. Experts operate on their assigned slices independently, and outputs are re-assembled, maintaining per-token FLOP efficiency. Because slices from different tokens interleave within an expert, utilisation is naturally smoother. We propose a slice-level capacity loss, cross-slice dropout, and efficient fused batched-GEMM kernels. Experiments on WikiText-103 language modelling, WMT En–De translation, and three text-classification datasets show SliceMoE attains up to 1.7x faster inference than dense baselines, 12–18% lower perplexity than parameter-matched token-MoE, and improved expert balance, with interpretable expertise over syntactic versus semantic sub-spaces.
pdf
bib
abs
ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Heng Zhou
|
Hejia Geng
|
Xiangyuan Xue
|
Li Kang
|
Yiran Qin
|
Zhiyong Wang
|
Zhenfei Yin
|
Lei Bai
Multi-agent systems have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving. However, current MAS frameworks are limited by poor flexibility and scalability, with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process. The core of ReSo is the proposed Collaborative Reward Model, which can provide fine-grained reward signals for MAS cooperation for optimization. We also introduce an automated data synthesis framework for generating MAS benchmarks, without human annotations. Experimentally, ReSo matches or outperforms existing methods. ReSo achieves 33.7% and 32.3% accuracy on Math-MAS and SciBench-MAS SciBench, while other methods completely fail. The code and data are available at [Reso](https://github.com/hengzzzhou/ReSo).
pdf
bib
abs
ConstraintLLM: A Neuro-Symbolic Framework for Industrial-Level Constraint Programming
Weichun Shi
|
Minghao Liu
|
Wanting Zhang
|
Langchen Shi
|
Fuqi Jia
|
Feifei Ma
|
Jian Zhang
Constraint programming (CP) is a crucial technology for solving real-world constraint optimization problems (COPs), with the advantages of rich modeling semantics and high solving efficiency. Using large language models (LLMs) to generate formal modeling automatically for COPs is becoming a promising approach, which aims to build trustworthy neuro-symbolic AI with the help of symbolic solvers. However, CP has received less attention compared to works based on operations research (OR) models. We introduce ConstraintLLM, the first LLM specifically designed for CP modeling, which is trained on an open-source LLM with multi-instruction supervised fine-tuning. We propose the Constraint-Aware Retrieval Module (CARM) to increase the in-context learning capabilities, which is integrated in a Tree-of-Thoughts (ToT) framework with guided self-correction mechanism. Moreover, we construct and release IndusCP, the first industrial-level benchmark for CP modeling, which contains 140 challenging tasks from various domains. Our experiments demonstrate that ConstraintLLM achieves state-of-the-art solving accuracy across multiple benchmarks and outperforms the baselines by 2x on the new IndusCP benchmark. Code and data are available at: https://github.com/william4s/ConstraintLLM.
pdf
bib
abs
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim
|
Sungwoong Kim
|
Jihwan Yu
|
Sungjae Lee
|
Jiwan Chung
|
Youngjae Yu
Escape rooms present a unique cognitive challenge that demands exploration-driven planning: with the sole instruction to escape the room, players must actively search their environment, collecting information, and finding solutions through repeated trial and error. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observe that even state-of-the-art multi-modal models generally fail to escape the rooms, showing considerable variation in their progress and problem-solving approaches. We find that integrating memory management and reasoning contributes to efficient exploration and enables successive hypothesis formulation and testing, thereby leading to significant improvements in dynamic and exploration-driven environments.
pdf
bib
abs
ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents
Navid Madani
|
Rohini Srihari
Large Language Models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is more effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparison of Emotional-Support LLMs (ES-LLMs) in an established psychological theory—Clara Hill’s Exploration–Insight–Action (E-I-A) counselling model—thereby delivering a structured, interpretable lens on performance, and (ii) fully automates the pipeline at scale. ESC-Judge proceeds in three stages: (1) it synthesizes realistic help-seeker roles by sampling empirically salient attributes (stressors, personality, life history); (2) it has two candidate ES-Agents conduct separate sessions with the same role, isolating model-specific strategies; and (3) it asks a specialised judge LLM to issue pairwise preferences across rubric-anchored skills that exhaustively cover the E-I-A spectrum. In our empirical study, ESC-Judge matches PhD-level annotators in 85% of Exploration, 83% of Insight, and 86% of Action decisions, demonstrating human-level reliability at a fraction of the cost. We release all code, prompts, synthetic roles, transcripts, and judgment scripts to catalyze transparent progress in emotionally supportive AI
pdf
bib
abs
Neuron-Level Differentiation of Memorization and Generalization in Large Language Models
Ko-Wei Huang
|
Yi-Fu Fu
|
Ching-Yu Tsai
|
Yu-Chieh Tu
|
Tzu-ling Cheng
|
Cheng-Yu Lin
|
Yi-Ting Yang
|
Heng-Yi Liu
|
Keng-Te Liao
|
Da-Cheng Juan
|
Shou-De Lin
We investigate how Large Language Models (LLMs) distinguish between memorization and generalization at the neuron level. Through carefully designed tasks, we identify distinct neuron subsets responsible for each behavior. Experiments on both a GPT-2 model trained from scratch and a pretrained LLaMA-3.2 model fine-tuned with LoRA show consistent neuron-level specialization. We further demonstrate that inference-time interventions on these neurons can steer the model’s behavior toward memorization or generalization. To assess robustness, we evaluate intra-task and inter-task consistency, confirming that these neuron-behavior associations reflect generalizable patterns rather than dataset-specific artifacts. Our findings reveal modular structure in LLMs and enable controlling memorization and generalization behaviors at inference time.
pdf
bib
abs
Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Zhuoxuan Zhang
|
Jinhao Duan
|
Edward Kim
|
Kaidi Xu
Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model’s pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model’s processing pipeline. Finally, we show that through manipulating AENs, we can control LLM’s behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.
pdf
bib
abs
Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks
Supriti Sinhamahapatra
|
Jan Niehues
State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation.In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation.Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.
pdf
bib
abs
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Tianyi Lorena Yan
|
Robin Jia
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets, models, and prompt templates, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs’ internal components interact with different input tokens to support complex factual recall.
pdf
bib
abs
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
Sahithya Ravi
|
Gabriel Herbert Sarch
|
Vibhav Vineet
|
Andrew D Wilson
|
Balasaravanan Thoravi Kumaravel
An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% → 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird’s-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene representations over time from visual signals. Disjoint-3DQA therefore sets a clear, measurable challenge for long-horizon spatial reasoning and aims to catalyze future research at the intersection of vision, language, and embodied AI.
pdf
bib
abs
Enhancing Chain-of-Thought Reasoning via Neuron Activation Differential Analysis
Yiru Tang
|
Kun Zhou
|
Yingqian Min
|
Xin Zhao
|
Jing Sha
|
Zhichao Sheng
|
Shijin Wang
Despite the impressive chain-of-thought(CoT) reasoning ability of large language models (LLMs), its underlying mechanisms remains unclear. In this paper, we explore the inner workings of LLM’s CoT ability via the lens of neurons in the feed-forward layers. We propose an efficient method to identify reasoning-critical neurons by analyzing their activation patterns under reasoning chains of varying quality. Based on it, we devise a rather simple intervention method that directly stimulates these reasoning-critical neurons, to guide the generation of high-quality reasoning chains. Extended experiments validate the effectiveness of our method and demonstrate the critical role these identified neurons play in CoT reasoning.
pdf
bib
abs
PakBBQ: A Culturally Adapted Bias Benchmark for QA
Abdullah Hashmat
|
Muhammad Arham Mirza
|
Agha Ali Raza
With the widespread adoption of Large Language Models (LLMs) across various applications, it is imperative to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.
pdf
bib
abs
MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
Sahil Verma
|
Keegan Hines
|
Jeff Bilmes
|
Charlotte Siska
|
Luke Zettlemoyer
|
Hila Gonen
|
Chandan Singh
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (≈ 120× faster than the next fastest baseline). Code and data are available at https://github.com/vsahil/OmniGuard
pdf
bib
abs
Comparing human and LLM politeness strategies in free production
Haoran Zhao
|
Robert D. Hawkins
Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals – from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness).We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses to English-language scenarios in both constrained and open-ended production tasks.We find that larger models (≥70B parameters) successfully replicate key effects from the computational pragmatics literature, and human evaluators prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies to create distance even in positive contexts, potentially leading to misinterpretations. While LLMs thus demonstrate an impressive command of politeness strategies, these systematic differences provide important groundwork for making intentional choices about pragmatic behavior in human-AI communication.
pdf
bib
abs
ASTRA: A Negotiation Agent with Adaptive and Strategic Reasoning via Tool-integrated Action for Dynamic Offer Optimization
Deuksin Kwon
|
Jiwon Hae
|
Emma Clift
|
Daniel Shamsoddini
|
Jonathan Gratch
|
Gale Lucas
Negotiation requires dynamically balancing self-interest and cooperation within the flow of conversation to maximize one’s own utility. Yet, existing agents struggle due to bounded rationality in human data, low adaptability to counterpart behavior, and limited strategic reasoning. To address this, we introduce principle-driven negotiation agents, powered by ASTRA, a novel framework for turn-level offer optimization grounded in two core principles: opponent modeling and Tit-for-Tat reciprocity. ASTRA operates in three stages: (1) interpreting counterpart behavior, (2) optimizing counteroffers via a tool-integrated action with a linear programming (LP) solver, and (3) selecting offers based on strategy assessment and the partner’s acceptance probability. Through simulations and human evaluations, our agent effectively adapts to an opponent’s shifting stance and achieves favorable outcomes through enhanced adaptability and strategic reasoning. Beyond enhancing negotiation performance, it also serves as a powerful coaching tool, offering interpretable strategic feedback and optimal offer recommendations beyond human bounded rationality, with its potential further validated through human evaluation.
pdf
bib
abs
CARMA: Enhanced Compositionality in LLMs via Advanced Regularisation and Mutual Information Alignment
Nura Aljaafari
|
Danilo Carvalho
|
Andre Freitas
Large language models (LLMs) struggle with compositional generalisation, limiting their ability to systematically combine learned components to interpret novel inputs. While architectural modifications, fine-tuning, and data augmentation improve compositionality, they often have limited adaptability, face scalability constraints, or yield diminishing returns on real data. To address this, we propose CARMA, an intervention that enhances the stability and robustness of compositional reasoning in LLMs while preserving fine-tuned performance. CARMA employs mutual information regularisation and layer-wise stability constraints to mitigate feature fragmentation, ensuring structured representations persist across and within layers. We evaluate CARMA on inverse dictionary modelling and sentiment classification, measuring its impact on semantic consistency, performance stability, and robustness to lexical perturbations. Results show that CARMA reduces the variability introduced by fine-tuning, stabilises token representations, and improves compositional reasoning. While its effectiveness varies across architectures, CARMA’s key strength lies in reinforcing learned structures rather than introducing new capabilities, making it a scalable auxiliary method. These findings suggest that integrating CARMA with fine-tuning can improve compositional generalisation while maintaining task-specific performance in LLMs.
pdf
bib
abs
MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper
Runjia Zeng
|
Guangyan Sun
|
Qifan Wang
|
Tong Geng
|
Sohail Dianat
|
Xiaotian Han
|
Raghuveer Rao
|
Xueling Zhang
|
Cheng Han
|
Lifu Huang
|
Dongfang Liu
Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at https://runjia.tech/emnlp_mept/.
pdf
bib
abs
KG-CQR: Leveraging Structured Relation Representations in Knowledge Graphs for Contextual Query Retrieval
Chi Minh Bui
|
Ngoc Mai Thieu
|
Vinh Van Nguyen
|
Jason J. Jung
|
Khac-Hoai Nam Bui
The integration of knowledge graphs (KGs) with large language models (LLMs) offers significant potential to enhance the retrieval stage in retrieval-augmented generation (RAG) systems. In this study, we propose KG-CQR, a novel framework for Contextual Query Retrieval (CQR) that enhances the retrieval phase by enriching complex input queries with contextual representations derived from a corpus-centric KG. Unlike existing methods that primarily address corpus-level context loss, KG-CQR focuses on query enrichment through structured relation representations, extracting and completing relevant KG subgraphs to generate semantically rich query contexts. Comprising subgraph extraction, completion, and contextual generation modules, KG-CQR operates as a model-agnostic pipeline, ensuring scalability across LLMs of varying sizes without additional training. Experimental results on the RAGBench and MultiHop-RAG datasets demonstrate that KG-CQR outperforms strong baselines, achieving improvements of up to 4–6% in mAP and approximately 2–3% in Recall@25. Furthermore, evaluations on challenging RAG tasks such as multi-hop question answering show that, by incorporating KG-CQR, the performance outperforms the existing baseline in terms of retrieval effectiveness.
pdf
bib
abs
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
Maithili Joshi
|
Palash Nandi
|
Tanmoy Chakraborty
Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capability. Typically LLMs undergo careful alignment training involving human feedback to ensure the acceptance of safe inputs while rejection of harmful or unsafe ones. However, these humongous models are still vulnerable to jailbreak attacks, in which malicious users attempt to generate harmful outputs that safety-aligned LLMs are trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly prevalent in the middle-to-late layers. Based on this observation, we introduce a novel white-box jailbreak method SABER (Safety Alignment Bypass via Extra Residuals) that connects two intermediate layer s and e such that s<e with a residual connection, achieving an improvement of 51% over the best performing baseline GCG on HarmBench test set. Moreover, model demonstrates only a marginal shift in perplexity when evaluated on the validation set of HarmBench.
pdf
bib
abs
When Truthful Representations Flip Under Deceptive Instructions?
Xianxuan Long
|
Yao Fu
|
Runchao Li
|
Mu Sheng
|
Haotian Yu
|
Xiaotian Han
|
Pan Li
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations “flip”, such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model’s instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces.
pdf
bib
abs
Can LLMs simulate the same correct solutions to free-response math problems as real students?
Yuya Asano
|
Diane Litman
|
Erin Walker
Large language models (LLMs) have emerged as powerful tools for developing educational systems. While previous studies have explored modeling student mistakes, a critical gap remains in understanding whether LLMs can generate correct solutions that represent student responses to free-response problems. In this paper, we compare the distribution of solutions produced by four LLMs (one proprietary, two open-sourced general, and one open-sourced math models) with various sampling and prompting techniques and those generated by students, using conversations where students teach math problems to a conversational robot. Our study reveals discrepancies between the correct solutions produced by LLMs and by students. We discuss the practical implications of these findings for the design and evaluation of LLM-supported educational systems.
pdf
bib
abs
Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans
Deuksin Kwon
|
Kaleen Shrestha
|
Bin Han
|
Elena Hayoung Lee
|
Gale Lucas
Large Language Models (LLMs) are increasingly deployed in socially complex, interaction-driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT-4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude-3.7-Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.
pdf
bib
abs
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
Bowen Wang
|
Haiyuan Wan
|
Liwen Shi
|
Chen Yang
|
Peng He
|
Yue Ma
|
Haochen Han
|
Wenhao Li
|
Tiao Tan
|
Yongjian Li
|
Fangming Liu
|
Gong Yifan
|
Sheng Zhang
We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose **RECALL**, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
pdf
bib
abs
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Emmy Liu
|
Amanda Bertsch
|
Lintang Sutawika
|
Lindia Tjuatja
|
Patrick Fernandes
|
Lara Marinov
|
Michael Chen
|
Shreya Singhal
|
Carolin Lawrence
|
Aditi Raghunathan
|
Kiril Gashteovski
|
Graham Neubig
Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25% code, as well as the negative impact of web data on truthfulness. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
pdf
bib
abs
Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics
Jiarui Liu
|
Yueqi Song
|
Yunze Xiao
|
Mingqian Zheng
|
Lindia Tjuatja
|
Jana Schaich Borg
|
Mona T. Diab
|
Maarten Sap
As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, social class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.
pdf
bib
abs
Linear-Time Demonstration Selection for In-Context Learning via Gradient Estimation
Ziniu Zhang
|
Zhenshuo Zhang
|
Dongyue Li
|
Lu Wang
|
Jennifer Dy
|
Hongyang R. Zhang
This paper introduces an algorithm to select demonstration examples for in-context learning of a query set. Given a set of n examples, how can we quickly select k out of n to best serve as the conditioning for downstream inference? This problem has broad applications in prompt tuning and chain-of-thought reasoning. Since model weights remain fixed during in-context learning, previous work has sought to design methods based on the similarity of token embeddings. This work proposes a new approach based on gradients of the output taken in the input embedding space. Our approach estimates model outputs through a first-order approximation using the gradients. Then, we apply this estimation to multiple randomly sampled subsets. Finally, we aggregate the sampled subset outcomes to form an influence score for each demonstration, and select k most relevant examples. This procedure only requires pre-computing model outputs and gradients once, resulting in a linear-time algorithm relative to model and training set sizes. Extensive experiments across various models and datasets validate the efficiency of our approach. We show that the gradient estimation procedure yields approximations of full inference with less than 𝟏% error across six datasets. This allows us to scale up subset selection that would otherwise run full inference by up to 37.7× on models with up to 34 billion parameters, and outperform existing selection methods based on input embeddings by 11% on average.
pdf
bib
abs
Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
Chutong Meng
|
Philipp Koehn
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.
pdf
bib
abs
TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Ezgi Başar
|
Francesca Padovani
|
Jaap Jumelet
|
Arianna Bisazza
We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.
pdf
bib
abs
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
Hanjun Luo
|
Yingbin Jin
|
Yiran Wang
|
Xinfeng Li
|
Tong Shang
|
Xuecheng Liu
|
Ruizhe Chen
|
Kun Wang
|
Hanan Salam
|
Qingsong Wen
|
Zuozhu Liu
The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.
pdf
bib
abs
Reliable and Cost-Effective Exploratory Data Analysis via Graph-Guided RAG
Mossad Helali
|
Yutai Luo
|
Tae Jun Ham
|
Jim Plotts
|
Ashwin Chaugule
|
Jichuan Chang
|
Parthasarathy Ranganathan
|
Essam Mansour
Automating Exploratory Data Analysis (EDA) is critical for accelerating the workflow of data scientists. While Large Language Models (LLMs) offer a promising solution, current LLM-only approaches often exhibit limited accuracy and code reliability on less-studied or private datasets. Moreover, their effectiveness significantly diminishes with open-source LLMs compared to proprietary ones, limiting their usability in enterprises that prefer local models for privacy and cost. To address these limitations, we introduce RAGvis: a novel two-stage graph-guided Retrieval-Augmented Generation (RAG) framework. RAGvis first builds a base knowledge graph (KG) of EDA notebooks and enriches it with structured EDA operation semantics. These semantics are extracted by an LLM guided by our empirically-developed EDA operations taxonomy. Second, in the online generation stage for new datasets, RAGvis retrieves relevant operations from the KG, aligns them to the dataset’s structure, refines them with LLM reasoning, and then employs a self-correcting agent to generate executable Python code. Experiments on two benchmarks demonstrate that RAGvis significantly improves code executability (pass rate), semantic accuracy, and visual quality in generated operations. This enhanced performance is achieved with substantially lower token usage compared to LLM-only baselines. Notably, our approach enables smaller, open-source LLMs to match the performance of proprietary models, presenting a reliable and cost-effective pathway for automated EDA code generation.
pdf
bib
abs
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards
Jaehoon Yun
|
Jiwoong Sohn
|
Jungwoo Park
|
Hyunjae Kim
|
Xiangru Tang
|
Daniel Shao
|
Yong Hoe Koo
|
Ko Minhyeok
|
Qingyu Chen
|
Mark Gerstein
|
Michael Moor
|
Jaewoo Kang
Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80% accuracy on MedQA for the first time using small-scale models of 8 billion parameters.
pdf
bib
abs
Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations
Jin Peng Zhou
|
Séb Arnold
|
Nan Ding
|
Kilian Q Weinberger
|
Nan Hua
|
Fei Sha
Auto-evaluating language models (LMs), *i.e*., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today’s LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing *privileged information* – such as ground-truth solutions or problem-specific guidelines – improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on *RewardBench*, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on *Vibe-Eval*, and approach human expert graders on Olympiad-level math problems.
pdf
bib
abs
SAMULE: Self-Learning Agents Enhanced by Multi-level Reflection
Yubin Ge
|
Salvatore Romeo
|
Jason Cai
|
Monica Sunkara
|
Yi Zhang
Despite the rapid advancements in LLM agents, they still face the challenge of generating meaningful reflections due to inadequate error analysis and a reliance on rare successful trajectories, especially in complex tasks. In this work, we propose SAMULE, a new framework for self-learning agents powered by a retrospective language model that is trained based on Multi-Level Reflection Synthesis. It first synthesizes high-quality reflections across three complementary levels: Single-Trajectory Learning (micro-level) for detailed error correction; Intra-Task Learning (meso-level) to build error taxonomies across multiple trials of the same task, and Inter-Task Learning (macro-level) to extract transferable insights based on same typed errors from diverse task failures. Then we fine-tune a language model serving as the retrospective model to generate reflections during inference. We further extend our framework to interactive settings through a foresight-based reflection mechanism, enabling agents to proactively reflect and adapt during user interactions by comparing predicted and actual responses. Extensive experiments on three challenging benchmarks—TravelPlanner, NATURAL PLAN, and Tau-bench—demonstrate that our approach significantly outperforms reflection-based baselines. Our results highlight the critical role of well-designed reflection synthesis and failure-centric learning in building self-improving LLM agents.
pdf
bib
abs
Database-Augmented Query Representation for Information Retrieval
Soyeong Jeong
|
Jinheon Baek
|
Sukmin Cho
|
Sung Ju Hwang
|
Jong C. Park
Information retrieval models that aim to search for documents relevant to a query have shown multiple successes, which have been applied to diverse tasks. Yet, the query from the user is oftentimes short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, previous studies have proposed expanding the query with a couple of additional (user-related) features related to it. However, they may be suboptimal to effectively augment the query, and there is plenty of other information available to augment it in a relational database. Motivated by this fact, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with the graph-based set-encoding strategy, which considers hierarchies of features in the database without order. We validate our DAQu in diverse retrieval scenarios, demonstrating that it significantly enhances overall retrieval performance over relevant baselines.
pdf
bib
abs
The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
Naama Rivlin-Angert
|
Guy Mor-Lan
We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from parliamentary speeches (1993-2023), Facebook posts, and leading news outlets (2018-2021), of which 1,812 instances (17.4%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline, and benchmark finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F1 of 0.74 for binary PDD detection and a macro-F1 of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male politicians than by their female counterparts, and stronger tendencies among right-leaning actors, with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for analyzing democratic discourse.
pdf
bib
abs
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
Pedram Zaree
|
Md Abdullah Al Mamun
|
Quazi Mishkatul Alam
|
Yue Dong
|
Ihsen Alouani
|
Nael Abu-Ghazaleh
Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms, including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B-chat/AdvBench, using less than a third of the generation time).
pdf
bib
abs
Representation Potentials of Foundation Models for Multimodal Alignment: A Survey
Jianglin Lu
|
Hailing Wang
|
Yi Xu
|
Yizhou Wang
|
Kuo Yang
|
Yun Fu
Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.
pdf
bib
abs
Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation
Ziyin Zhang
|
Jiahao Xu
|
Tian Liang
|
Xingyu Chen
|
Zhiwei He
|
Rui Wang
|
Zhaopeng Tu
Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model’s prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context Speculative Decoding, which is a training-free dynamic length policy for speculative decoding systems that adaptively determines the lengths of draft sequences by referring to the draft entropy. Experimental results on mainstream SD benchmarks as well as reasoning-heavy benchmarks demonstrate the superior performance of SVIP, achieving up to 17% speedup on MT-Bench at 8K context compared with fixed draft lengths, and 22% speedup for QwQ in long-form reasoning.
pdf
bib
abs
Visual-Aware Speech Recognition for Noisy Scenarios
Balaji Darur
|
Karan Singla
Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker’s visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy
pdf
bib
abs
Advancing Arabic Diacritization: Improved Datasets, Benchmarking, and State-of-the-Art Models
Abubakr Mohamed
|
Hamdy Mubarak
Arabic diacritics, similar to short vowels in English, provide phonetic and grammatical information but are typically omitted in written Arabic, leading to ambiguity. Diacritization (aka diacritic restoration or vowelization) is essential for natural language processing. This paper advances Arabic diacritization through the following contributions: first, we propose a methodology to analyze and refine a large diacritized corpus to improve training quality. Second, we introduce WikiNews-2024, a multi-reference evaluation methodology with an updated version of the standard benchmark “WikiNews-2014”. In addition, we explore various model architectures and propose a BiLSTM-based model that achieves state-of-the-art results with 3.12% and 2.70% WER on WikiNews-2014 and WikiNews-2024, respectively. Moreover, we develop a model that preserves user-specified diacritics while maintaining accuracy. Lastly, we demonstrate that augmenting training data enhances performance in low-resource settings.
pdf
bib
abs
Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks
Arjun Arunasalam
|
Madison Pickering
|
Z. Berkay Celik
|
Blase Ur
Large language models (LLMs) can underpin AI assistants that help users with everyday tasks, such as by making recommendations or performing basic computation. Despite AI assistants’ promise, little is known about the implicit values these assistants display while completing subjective everyday tasks. Humans may consider values like environmentalism, charity, and diversity. To what extent do LLMs exhibit these values in completing everyday tasks? How do they compare with humans? We answer these questions by auditing how six popular LLMs complete 30 everyday tasks, comparing LLMs to each other and to 100 human crowdworkers from the US. We find LLMs often do not align with humans, nor with other LLMs, in the implicit values exhibited.
pdf
bib
abs
Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization
Mahmud Wasif Nafee
|
Maiqi Jiang
|
Haipeng Chen
|
Yanfu Zhang
Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In‐context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity–quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose **D**ynamic **R**etriever for **I**n-Context **K**nowledge **E**diting (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a *learnable threshold σ* to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the CounterFact benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries—demonstrating scalable and adaptive knowledge editing.
pdf
bib
abs
LVLMs are Bad at Overhearing Human Referential Communication
Zhengxiang Wang
|
Weiling Li
|
Panagiotis Kaliosis
|
Owen Rambow
|
Susan Brennan
During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.
pdf
bib
abs
Let’s Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM’s Math Capability
Ruida Wang
|
Yuxin Li
|
Yi R. Fung
|
Tong Zhang
Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce **NL-FL HybridReasoning (NFL-HR)**, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the *NL-FL Problem Alignment* method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the *Mixed Problem Input* technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based *Answer Extraction* mechanism. Comprehensive experiments demonstrate that the **NFL-HR** framework achieves **89.80%** and **84.34%** accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.
pdf
bib
abs
TORSO: Template-Oriented Reasoning Towards General Tasks
Minhyuk Kim
|
Seungyoon Lee
|
Heuiseok Lim
The approaches that guide Large Language Models (LLMs) to emulate human reasoning during response generation have emerged as an effective method for enabling them to solve complex problems in a step-by-step manner, thereby achieving superior performance. However, most existing approaches using few-shot prompts to generate responses heavily depend on the provided examples, limiting the utilization of the model’s inherent reasoning capabilities. Moreover, constructing task-specific few-shot prompts is often costly and may lead to inconsistencies across different tasks. In this work, we introduce Template Oriented Reasoning (TORSO), which elicits the model to utilize internal reasoning abilities to generate proper responses across various tasks without the need for manually crafted few-shot examples. Our experimental results demonstrate that TORSO achieves strong performance on diverse LLMs benchmarks with reasonable rationales.
pdf
bib
abs
Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild
Sheshera Mysore
|
Debarati Das
|
Hancheng Cao
|
Bahareh Sarrafzadeh
As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users’ intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.
pdf
bib
abs
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Gagan Mundada
|
Yash Vishe
|
Amit Namburi
|
Xin Xu
|
Zachary Novack
|
Julian McAuley
|
Junda Wu
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored.We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs’ capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate a comprehensive evaluation, we propose a systematic taxonomy,comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering,enabling controlled and scalable assessment of MLLMs’ symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis.We release the dataset and code.
pdf
bib
abs
TRIAL: Token Relations and Importance Aware Late-interaction for Accurate Text Retrieval
Hyukkyu Kang
|
Injung Kim
|
Wook-Shin Han
Late-interaction based multi-vector retrieval systems have greatly advanced the field of information retrieval by enabling fast and accurate search over millions of documents. However, these systems rely on a naive summation of token-level similarity scores which often leads to inaccurate relevance estimation caused by the tokenization of semantic units (e.g., words and phrases) and the influence of low-content words (e.g., articles and prepositions). To address these challenges, we propose **TRIAL**: **T**oken **R**elations and **I**mportance **A**ware **L**ate-interaction, which enhances late interaction by explicitly modeling token relations and token importance in relevance scoring. Extensive experiments on three widely used benchmarks show that TRIAL achieves state-of-the-art accuracy, with an nDCG@10 of 46.3 on MSMARCO (in-domain), and average nDCG@10 scores of 51.09 and 72.15 on BEIR and LoTTE Search (out-of-domain), respectively. With superior accuracy, TRIAL maintains competitive retrieval speed compared to existing late-interaction methods, making it a practical solution for large-scale text retrieval.
pdf
bib
abs
Do Large Language Models excel in Complex Logical Reasoning with Formal Language?
Jin Jiang
|
Jianing Wang
|
Yuchen Yan
|
Yang Liu
|
Jianhua Zhu
|
Mengdi Zhang
|
Liangcai Gao
Large Language Models (LLMs) have been shown to achieve breakthrough performances on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs for deriving reliable reasoning paths, with systematic evaluations of these capabilities still being limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2). All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3). Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance.
pdf
bib
abs
Fair or Framed? Political Bias in News Articles Generated by LLMs
Junho Yoo
|
Youhyun Shin
Despite biases in Large Language Models (LLMs) being widely researched, systematic explorations of political biases in news article generation tasks remain underexplored. This study evaluates political bias across seven LLMs by leveraging our PublicViews dataset-extracted from the TwinViews-13K corpus-comprising 31 topics and 31,692 statements. We analyze 10,850 articles, finding left-leaning bias persists in generation tasks, with neutral content remaining rare even under balanced opinion settings. Models exhibit asymmetric behavior in minority opinion scenarios, amplifying preferred viewpoints when in minority while conforming to majority opinions otherwise. Notably, all models employ ‘stance-flipping quotations” (altering supporters’ statements to express opposite viewpoints) in 33-38% of quotations despite explicit instructions against distortion. Consistent with prior research, increased model size failed to enhance neutrality. This research measures political bias in LLM-generated news, analyzes its mechanisms, and reveals how opinion distribution and explicitness affect bias expression. Our results highlight how LLMs can introduce unintended political bias in generative contexts. We publicly release our PublicViews corpus and code at https://anonymous.4open.science/r/Fair-or-Framed-46F1.
pdf
bib
abs
ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng
|
Kai Tian
|
Kaiyan Zhang
|
Yuru Wang
|
Junqi Gao
|
Runze Liu
|
Sa Yang
|
Jingxuan Li
|
Xinwei Long
|
Jiaheng Ma
|
Biqing Qi
|
Bowen Zhou
Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.
pdf
bib
abs
Grammar Pruning: Enabling Low-Latency Zero-Shot Task-Oriented Language Models for Edge AI
Octavian Alexandru Trifan
|
Jason Lee Weber
|
Marc Titus Trifan
|
Alexandru Nicolau
|
Alexander Veidenbaum
Edge deployment of task-oriented semantic parsers demands high accuracy under tight latency and memory budgets. We present Grammar Pruning, a lightweight zero-shot framework that begins with a user-defined schema of API calls and couples a rule-based entity extractor with an iterative grammar-constrained decoder: extracted items dynamically prune the context-free grammar, limiting generation to only those intents, slots, and values that remain plausible at each step. This aggressive search-space reduction both reduces hallucinations and slashes decoding time. On the adapted FoodOrdering, APIMixSNIPS, and APIMixATIS benchmarks, Grammar Pruning with small language models achieves an average execution accuracy of over 90%—rivaling State-of-the-Art, cloud-based solutions—while sustaining at least 2x lower end-to-end latency than existing methods. By requiring nothing beyond the domain’s full API schema values yet delivering precise, real-time natural-language understanding, Grammar Pruning positions itself as a practical building block for future edge-AI applications that cannot rely on large models or cloud offloading.
pdf
bib
abs
Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies
Terrance Liu
|
Shuyi Wang
|
Daniel Preotiuc-Pietro
|
Yash Chandarana
|
Chirag Gupta
While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named “sub-clause frequency” (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.
pdf
bib
abs
REACT: Representation Extraction And Controllable Tuning to Overcome Overfitting in LLM Knowledge Editing
Haitian Zhong
|
Yuhuan Liu
|
Ziyang Xu
|
Guofan Liu
|
Qiang Liu
|
Shu Wu
|
Zhe Zhao
|
Liang Wang
|
Tieniu Tan
Large language model editing methods frequently suffer from overfitting, wherein factual updates can propagate beyond their intended scope, overemphasizing the edited target even when it’s contextually inappropriate. To address this challenge, we introduce REACT (Representation Extraction And Controllable Tuning), a unified two-phase framework designed for precise and controllable knowledge editing. In the initial phase, we utilize tailored stimuli to extract latent factual representations and apply Principal Component Analysis with a simple learnbale linear transformation to compute a directional “belief shift” vector for each instance. In the second phase, we apply controllable perturbations to hidden states using the obtained vector with a magnitude scalar, gated by a pre-trained classifier that permits edits only when contextually necessary. Relevant experiments on EVOKE benchmarks demonstrate that REACT significantly reduces overfitting across nearly all evaluation metrics, and experiments on COUNTERFACT and MQuAKE shows that our method preserves balanced basic editing performance (reliability, locality, and generality) under diverse editing scenarios.
pdf
bib
abs
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Chung-En Sun
|
Ge Yan
|
Tsui-Wei Weng
Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce ThinkEdit, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model’s parameters, ThinkEdit effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality.
pdf
bib
abs
Incorporating Diverse Perspectives in Cultural Alignment: Survey of Evaluation Benchmarks Through A Three-Dimensional Framework
Meng-Chen Wu
|
Si-Chi Chin
|
Tess Wood
|
Ayush Goyal
|
Narayanan Sadagopan
Large Language Models (LLMs) increasingly serve diverse global audiences, making it critical for responsible AI deployment across cultures. While recent works have proposed various approaches to enhance cultural alignment in LLMs, a systematic analysis of their evaluation benchmarks remains needed. We propose a novel framework that conceptualizes alignment along three dimensions: Cultural Group (who to align with), Cultural Elements (what to align), and Awareness Scope (how to align: majority-focused vs. diversity-aware). Through this framework, we analyze 105 cultural alignment evaluation benchmarks, revealing significant imbalances: Region (37.9%) and Language (28.9%) dominate Cultural Group representation; Social and Political Relations (25.1%) and Speech and Language (20.9%) concentrate Cultural Elements coverage; and an overwhelming majority (97.1%) of datasets adopt majority-focused Awareness Scope approaches. In a case study examining AI safety evaluation across nine Asian countries (Section 5), we demonstrate how our framework reveals critical gaps between existing benchmarks and real-world cultural biases identified in the study, providing actionable guidance for developing more comprehensive evaluation resources tailored to specific deployment contexts.
pdf
bib
abs
Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation
Yubo Xie
|
Chenkai Wang
|
Zongyang Ma
|
Fahui Miao
Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online—commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models’ ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.
pdf
bib
abs
RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models
Luyang Zhang
|
Shuaimin Li
|
Yishuo Li
|
Kunpeng Kang
|
Kaiyuan Zhang
|
Cong Wang
|
Wenpeng Lu
Accurately evaluating the word sense disambiguation (WSD) capabilities of large language models (LLMs) remains challenging, as existing studies primarily rely on single-task evaluations and classification-based metrics that overlook the fundamental differences between generative LLMs and traditional classification models. To bridge this gap, we proposeRoDEval, the first comprehensive evaluation framework specifically tailored for assessing LLM-based WSD methods. RoDEval introduces four novel metrics: Disambiguation Scope, Disambiguation Robustness, Disambiguation Reliability, and Definition Generation Quality Score, enabling a multifaceted evaluation of LLMs’ WSD capabilities. Experimental results using RoDEval across five mainstream LLMs uncover significant limitations in their WSD performance. Specifically, incorrect definition selections in multiple-choice WSD tasks stem not from simple neglect or forget of correct options, but rather from incomplete acquisition of the all senses for polysemous words. Instead, disambiguation reliability is often compromised by the models’ persistent overconfidence. In addition, inherent biases continue to affect performance, and scaling up model parameters alone fails to meaningfully enhance their ability to generate accurate sense definitions. These findings provide actionable insights for enhancing LLMs’ WSD capabilities. The source code and evaluation scripts are open-sourced at https://github.com/DayDream405/RoDEval.
pdf
bib
abs
PychoAgent: Psychology-driven LLM Agents for Explainable Panic Prediction on Social Media during Sudden Disaster Events
Mengzhu Liu
|
Zhengqiu Zhu
|
Chuan Ai
|
Chen Gao
|
Xinghong Li
|
Lingnan He
|
Kaisheng Lai
|
Yingfeng Chen
|
Xin Lu
|
Yong Li
|
Quanjun Yin
Accurately predicting public panic sentiment on social media is crucial for proactive governance and crisis management. Current efforts on this problem face three main challenges: lack of finely annotated data hinders emotion prediction studies, unmodeled risk perception causes prediction inaccuracies, and insufficient interpretability of panic formation mechanisms limits mechanistic insight. We address these issues by proposing a Psychology-driven generative Agent framework (PsychoAgent) for explainable panic prediction based on emotion arousal theory. Specifically, we first construct a fine-grained panic emotion dataset (namely COPE) via human-AI (Large Language Models, LLMs) collaboration, combining scalable LLM-based labeling with human annotators to ensure accuracy for panic emotion and to mitigate biases from linguistic variations. Then, we construct PsychoAgent integrating cross-domain heterogeneous data grounded in psychological mechanisms to model risk perception and cognitive differences in emotion generation. To enhance interpretability, we design an LLM-based role-playing agent that simulates individual psychological chains through dedicatedly designed prompts. Experimental results on our annotated dataset show that PsychoAgent improves panic emotion prediction performance by 13% to 21% compared to baseline models. Furthermore, the explainability and generalization of our approach is validated. Crucially, this represents a paradigm shift from opaque “data-driven fitting” to transparent “role-based simulation with mechanistic interpretation” for panic emotion prediction during emergencies. Our implementation is publicly available at: https://github.com/supersonic0919/PsychoAgent.
pdf
bib
abs
Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs’ Reasoning
Zezhong Wang
|
Xingshan Zeng
|
Weiwen Liu
|
Yufei Wang
|
Liangyou Li
|
Yasheng Wang
|
Lifeng Shang
|
Xin Jiang
|
Qun Liu
|
Kam-Fai Wong
Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.
pdf
bib
abs
Inter-sentence Context Modeling and Structure-aware Representation Enhancement for Conversational Sentiment Quadruple Extraction
Yu Zhang
|
Zhaoman Zhong
|
Huihui Lv
Conversational aspect-based sentiment quadruple analysis (DiaASQ) is a newly-emergent task aiming to extract quadruples of target-aspect-opinion-sentiment from a conversation text. Existing studies struggle to capture complete dialogue semantics, largely due to inadequate inter-utterance modeling and the underutilization of dialogue structure. To address these issues, we propose an Inter-sentence Context Modeling and Structure-aware Representation Enhancement model (ICMSR) to extract dialogue aspect sentiment quadruple. We design the Dialog Inter-sentence Contextual Enhancer (DICE) module after the sentence-by-sentence encoding phase to enhance inter-sentence interactions and mitigate contextual fragmentation caused by traditional sequential encoding. Moreover, to fully exploit structural information within dialogues, we propose the Dialog Feature Amplifier (DFA), which consists of two submodules: STREAM and SMM. The STREAM module integrates diverse structural dialogue information to generate structure-aware sentence representations, effectively improving the modeling of intra-dialogue structural relations. Furthermore, the Structural Multi-scale Mechanism (SMM) employs a multi-scale modeling approach, simulating varying extents of contextual awareness, thereby enhancing the model’s ability to capture cross-sentence structural dependencies. We extensively evaluate our method on benchmark datasets, and the empirical results consistently confirm its effectiveness.
pdf
bib
abs
Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Xiaolong Wei
|
Bo Lu
|
Xingyu Zhang
|
Zhejun Zhao
|
Dongdong Shen
|
Long Xia
|
Dawei Yin
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a Reward Model (RM) trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel, strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments.
pdf
bib
abs
Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety
Chenhao Huang
|
Ziyu Shen
|
Yicong Ren
|
Huiyuan Zheng
|
Jiazheng Zhang
|
Mingxu Chai
|
Ming Zhang
|
Shihan Dou
|
Fan Mo
|
Jie Shi
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Aligning large language models (LLMs) with human preferences is a central challenge for building reliable AI systems. Most existing alignment approaches rely on static signals, such as predefined principles or offline human annotations to guide model behavior toward a fixed approximation of human preferences. However, LLMs can exhibit distributional drift during training, and static alignment mechanisms lack the capacity to adaptively correct misaligned behaviors as they emerge. To address this limitation, we develop a two-stage framework that enables dynamic and continuous alignment. In the first stage, a constitution is continually revised based on observed model behaviors, and models are trained to comply with these evolving principles. In the second stage, this learned constitution is used to guide reinforcement learning, encouraging the model to align with the updated normative signals. We refer to this framework as COCOA: Co-evolution of Constitutions and AI Models. We show that COCOA enables a 7B model to greatly improve safety—raising StrongReject score from 0.741 to 0.935 and Safe-RLHF accuracy from 77.76% to 90.64% without human annotations, reaching performance close to much larger state-of-the-art models.
pdf
bib
abs
Web Intellectual Property at Risk: Preventing Unauthorized Real-Time Retrieval by Large Language Models
Yisheng Zhong
|
Yizhu Wen
|
Junfeng Guo
|
Mehran Kafai
|
Heng Huang
|
Hanqing Guo
|
Zhuangdi Zhu
The protection of cyber Intellectual Property (IP) such as web content is an increasingly critical concern. The rise of large language models (LLMs) with online retrieval capabilities enables convenient access to information but often undermines the rights of original content creators. As users increasingly rely on LLM-generated responses, they gradually diminish direct engagement with original information sources, which will significantly reduce the incentives for IP creators to contribute, and lead to a saturating cyberspace with more AI-generated content. In response, we propose a novel defense framework that empowers web content creators to safeguard their web-based IP from unauthorized LLM real-time extraction and redistribution by leveraging the semantic understanding capability of LLMs themselves. Our method follows principled motivations and effectively addresses an intractable black-box optimization problem. Real-world experiments demonstrated that our methods improve defense success rates from 2.5% to 88.6% on different LLMs, outperforming traditional defenses such as configuration-based restrictions.
pdf
bib
abs
SciEvent: Benchmarking Multi-domain Scientific Event Extraction
Bofu Dong
|
Pritesh Shah
|
Sumedh Sonawane
|
Tiyasha Banerjee
|
Erin Brady
|
Xinya Du
|
Ming Jiang
Scientific information extraction (SciIE) has primarily relied on entity-relation extraction in narrow domains, limiting its applicability to interdisciplinary research and struggling to capture the necessary context of scientific information, often resulting in fragmented or conflicting statements. In this paper, we introduce SciEvent, a novel multi-domain benchmark of scientific abstracts annotated via a unified event extraction (EE) schema designed to enable structured and context-aware understanding of scientific content. It includes 500 abstracts across five research domains, with manual annotations of event segments, triggers, and fine-grained arguments. We define SciIE as a multi-stage EE pipeline: (1) segmenting abstracts into core scientific activities—Background, Method, Result, and Conclusion; and (2) extracting the corresponding triggers and arguments. Experiments with fine-tuned EE models, large language models (LLMs), and human annotators reveal a performance gap, with current models struggling in domains such as sociology and humanities. SciEvent serves as a challenging benchmark and a step toward generalizable, multi-domain SciIE.
pdf
bib
abs
Media Source Matters More Than Content: Unveiling Political Bias in LLM-Generated Citations
Sunhao Dai
|
Zhanshuo Cao
|
Wenjie Wang
|
Liang Pang
|
Jun Xu
|
See-Kiong Ng
|
Tat-Seng Chua
Unlike traditional search engines that present ranked lists of webpages, generative search engines rely solely on in-line citations as the key gateway to original real-world webpages, making it crucial to examine whether LLM-generated citations have biases—particularly for politically sensitive queries. To investigate this, we first construct AllSides-2024, a new dataset comprising the latest real-world news articles (Jan. 2024 - Dec. 2024) labeled with left- or right-leaning stances. Through systematic evaluations, we find that LLMs exhibit a consistent tendency to cite left-leaning sources at notably higher rates compared to traditional retrieval systems (e.g., BM25 and dense retrievers). Controlled experiments further reveal that this bias arises from a preference for media outlets identified as left-leaning, rather than for left-oriented content itself. Meanwhile, our findings show that while LLMs struggle to infer political bias from news content alone, they can almost perfectly recognize the political orientation of media outlets based on their names. These insights highlight the risk that, in the era of generative search engines, information exposure may be disproportionately shaped by specific media outlets, potentially shaping public perception and decision-making.
pdf
bib
abs
RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs
Can Lin
|
Zhengwang Jiang
|
Ling Zheng
|
Qi Zhao
|
Yuhang Zhang
|
Qi Song
|
Wangqiu Zhou
Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs.Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs.To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs.Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
pdf
bib
abs
Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset
Taisei Yamamoto
|
Ryoma Kumon
|
Danushka Bollegala
|
Hitomi Yanaka
Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.
pdf
bib
abs
Chameleon LLMs: User Personas Influence Chatbot Personality Shifts
Jane Xing
|
Tianyi Niu
|
Shashank Srivastava
As large language models (LLMs) integrate into society, their ability to adapt to users is as critical as their accuracy. While prior work has used personality tests to examine the perceived personalities of LLMs, little research has explored whether LLMs adapt their perceived personalities in response to user interactions. We investigate whether and how LLMs exhibit conversational adaptations over prolonged interactions. Using a controlled simulations where a user and chatbot engage in dialogue, we measure the chatbot’s personality shift before and after the conversation. Across multiple models, we find that traits such as Agreeableness, Extraversion, and Conscientiousness are highly susceptible to user influence, whereas Emotional Stability and Intellect remain relatively more stable. Our results suggest that LLMs dynamically adjust their conversational style in response to user personas, raising important implications for AI alignment, trust, and safety.
pdf
bib
abs
GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models
Dylan Hutson
|
Daniel Vennemeyer
|
Aneesh Deshmukh
|
Justin Zhan
|
Tianyu Jiang
We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle—without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43%. Prompting constraints guided by IG—such as enforcing question diversity—enable weaker models to match GPT-4o. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.
pdf
bib
abs
SynC-LLM: Generation of Large-Scale Synthetic Circuit Code with Hierarchical Language Models
Shang Liu
|
Yao Lu
|
Wenji Fang
|
Jing Wang
|
Zhiyao Xie
In recent years, AI-assisted integrated circuit (IC) design methods have shown great potential in boosting IC design efficiency. However, this emerging technique is fundamentally limited by the serious scarcity of publicly accessible large-scale circuit design data, which are mostly private IPs owned by semiconductor companies. In this work, we propose SynC-LLM, the first technique that exploits LLM’s ability to generate new large-scale synthetic digital circuits. In our hierarchical circuit generation process, we first design a directed graph diffusion model to learn and generate the skeleton of large circuits with sequential cells. Then we propose a cone function retrieval technique to annotate each sequential node in the skeleton with a function description. Finally, we apply a level-by-level customized prompting technique utilizing LLM to complete the code at every skeleton cone. Experiments show that our generated circuits are not only valid and fully functional, but also closely resemble realistic large-scale designs and can significantly improve AI models’ performance in multiple IC design tasks. The code and data are open-sourced in https://github.com/hkust-zhiyao/SynCircuitData.
pdf
bib
abs
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Zhiyu Yang
|
Shuo Wang
|
Yukun Yan
|
Yang Deng
LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs’ capabilities to autonomously find and fix runtime logical errors in complex data science code remain largely unexplored. To address this gap, we introduce DSDBench: the Data Science Debugging Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop error tracing and multi-bug detection in data science code debugging. DSDBench adapts datasets from existing data science task benchmarks, such as DABench and MatPlotBench, featuring realistic data science debugging tasks with automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes 1,117 annotated samples with 741 cause-effect error pairs and runtime error messages. Evaluations of state-of-the-art LLMs on DSDBench show significant performance gaps, highlighting challenges in debugging logical runtime errors in data science code. DSDBench offers a crucial resource to evaluate and improve LLMs’ debugging and reasoning capabilities, enabling more reliable AI-assisted data science in the future.
pdf
bib
abs
Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
Libo Zhang
|
Zhaoning Zhang
|
Xubaizhou
|
Rui Li
|
Zhiliang Tian
|
Songzhu Mei
|
Dongsheng Li
With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail—a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates these outputs. By reducing the granularity of data transfer, Dovetail significantly minimizes communication overhead. To further improve efficiency, we optimize the draft model specifically for heterogeneous hardware environments by reducing the number of draft tokens to lower parallel verification latency, increasing model depth to enhance predictive capabilities, and introducing a Dynamic Gating Fusion (DGF) mechanism to improve the integration of feature and embedding information. We conduct comprehensive evaluations of Dovetail across various consumer-grade GPUs, covering multiple tasks and mainstream models. Experimental results on 13B models demonstrate that Dovetail achieves inference speedups ranging from 1.79× to 10.1× across different devices, while maintaining consistency and stability in the distribution of generated texts.
pdf
bib
abs
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang
|
Junjie Hu
|
Ming Jiang
Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines **V**isual **S**emantic **E**diting and **A**ttention **M**odulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLAVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.
pdf
bib
abs
LORAXBENCH: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages
Alham Fikri Aji
|
Trevor Cohn
As one of the world’s most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LORAXBENCH, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset cover 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness ‘Krama’ Javanese.
pdf
bib
abs
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Jingyan Shen
|
Jiarui Yao
|
Rui Yang
|
Yifan Sun
|
Feng Luo
|
Rui Pan
|
Tong Zhang
|
Han Zhao
Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as fine-grained annotations via prompting or structured preference elicitation, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo employs a mixture of preferences to model diverse human preferences, enabling a flexible representation of diverse value systems. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves personalized preference learning on downstream tasks.
pdf
bib
abs
SAFE: Schema-Driven Approximate Distance Join for Efficient Knowledge Graph Querying
Sangoh Lee
|
Sungho Park
|
Wook-Shin Han
To reduce hallucinations in large language models (LLMs), researchers are increasingly investigating reasoning methods that integrate LLMs with external knowledge graphs (KGs). Existing approaches either map an LLM-generated query graph onto the KG or let the LLM traverse the entire graph; the former is fragile because noisy query graphs derail retrieval, whereas the latter is inefficient due to entity-level reasoning over large graphs. In order to tackle these problems, we propose **SAFE** (**S**chema-Driven **A**pproximate Distance Join **F**or **E**fficient Knowledge Graph Querying), a framework that leverages schema graphs for robust query graph generation and efficient KG retrieval. SAFE introduces two key ideas: (1) an Approximate Distance Join (ADJ) algorithm that refines LLM-generated pseudo query graphs by flexibly aligning them with the KG’s structure; and (2) exploiting a compact schema graph to perform ADJ efficiently, reducing overhead and improving retrieval accuracy. Extensive experiments on WebQSP, CWQ and GrailQA demonstrate that SAFE outperforms state-of-the-art methods in both accuracy and efficiency, providing a robust and scalable solution to overcome the inherent limitations of LLM-based knowledge retrieval.
pdf
bib
abs
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Xiwen Liang
|
Min Lin
|
Weiqi Ruan
|
Rongtao Xu
|
Yuecheng Liu
|
Jiaqi Chen
|
Bingqian Lin
|
Yuzheng Zhuang
|
Xiaodan Liang
Existing vision-language planning methods perform well on short-horizon tasks but struggle with long-horizon reasoning in dynamic environments due to the difficulty of training models to generate high-quality reasoning processes. To address this, we propose Structured Preference Optimization (SPO), a framework that enhances reasoning and action selection for long-horizon task planning through structured evaluation and optimized training. SPO introduces: 1) Structured Preference Evaluation and Optimization, which evaluates reasoning chains across task relevance, historical consistency (as part of textual coherence), and image awareness (alignment with visual observations) to construct high-quality preference pairs; and 2) Curriculum-Guided Progressive Learning, enabling the model to adapt from simple to complex tasks, thereby improving generalization and robustness. To advance research in vision-language long-horizon task planning, we introduce ExtendaBench, a comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat 2.0, categorized into ultra-short, short, medium, and long tasks. Experimental results demonstrate that SPO significantly improves reasoning quality and final decision accuracy, outperforming prior methods on long-horizon tasks and underscoring the effectiveness of preference-driven optimization in vision-language task planning. Specifically, SPO achieves a +5.98% GCR and +4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement in Habitat over the best-performing baselines.
pdf
bib
abs
Position: LLMs Can be Good Tutors in English Education
Jingheng Ye
|
Shen Wang
|
Deqing Zou
|
Yibo Yan
|
Kun Wang
|
Hai-Tao Zheng
|
Ruitong Liu
|
Zenglin Xu
|
Irwin King
|
Philip S. Yu
|
Qingsong Wen
While recent efforts have begun integrating large language models (LLMs) into English education, they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that **LLMs have the potential to serve as effective tutors in English Education**. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing English Education through the thoughtful integration of LLMs.
pdf
bib
abs
CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting
Haobo Li
|
Zhaowei Wang
|
Jiachen Wang
|
Yueya Wang
|
Alexis Kai Hon Lau
|
Huamin Qu
Forecasting weather and climate events is crucial for making appropriate measures to mitigate environmental hazards and minimize losses. However, existing environmental forecasting research focuses narrowly on predicting numerical meteorological variables (e.g., temperature), neglecting the translation of these variables into actionable textual narratives of events and their consequences. To bridge this gap, we proposed Weather and Climate Event Forecasting (WCEF), a new task that leverages numerical meteorological raster data and textual event data to predict weather and climate events. This task is challenging to accomplish due to difficulties in aligning multimodal data and the lack of supervised datasets. To address these challenges, we present CLLMate, the first multimodal dataset for WCEF, using 26,156 environmental news articles aligned with ERA5 reanalysis data. We systematically benchmark 32 existing models on CLLMate, including closed-source, open-source, and our fine-tuned models. Our experiments reveal the advantages and limitations of existing MLLMs and the value of CLLMate for the training and benchmarking of the WCEF task. The dataset is available at https://github.com/hobolee/CLLMate.
pdf
bib
abs
Extracting and Combining Abilities For Building Multi-lingual Ability-enhanced Large Language Models
Zhipeng Chen
|
Kun Zhou
|
Liang Song
|
Xin Zhao
|
Bingning Wang
|
Weipeng Chen
|
Ji-Rong Wen
Multi-lingual ability transfer has become increasingly important for the broad application of large language models (LLMs). Existing work highly relies on training with the multi-lingual ability-related data, which may not be available for low-resource languages. To solve it, we propose a **M**ulti-lingual **A**bilities **E**xtraction and **C**ombination approach, named as **MAEC**. Our key idea is to decompose and extract language-agnostic ability-related weights from LLMs, and combine them across different languages by simple addition and subtraction operations without training. Specifically, our MAEC consists of the extraction and combination stages. In the extraction stage, we firstly locate key neurons that are highly related to specific abilities, and then employ them to extract the transferable ability-related weights. In the combination stage, we further select the ability-related tensors that mitigate the linguistic effects, and design a combining strategy based on them and the language-specific weights, to build the multi-lingual ability-enhanced LLM. To assess the effectiveness of our approach, we conduct extensive experiments on LLaMA-3 8B on mathematical and scientific tasks in both high-resource and low-resource lingual scenarios. Experiment results have shown that MAEC can effectively and efficiently extract and combine the advanced abilities, achieving **comparable performance with PaLM**. We will publicly release our code and data.
pdf
bib
abs
Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval
Pranjal A Chitale
|
Bishal Santra
|
Yashoteja Prabhu
|
Amit Sharma
Compact dual-encoder models are widely used for retrieval owing to their efficiency and scalability. However, such models often underperform compared to their Large Language Model (LLM)-based retrieval counterparts, likely due to their limited world knowledge. While LLM-based data augmentation has been proposed as a strategy to bridge this performance gap, there is insufficient understanding of its effectiveness and scalability to real-world retrieval problems. Existing research does not systematically explore key factors such as the optimal augmentation scale, the necessity of using large augmentation models, and whether diverse augmentations improve generalization, particularly in out-of-distribution (OOD) settings. This work presents a comprehensive study of the effectiveness of LLM augmentation for retrieval, comprising over 100 distinct experimental settings of retrieval models, augmentation models and augmentation strategies. We find that, while augmentation enhances retrieval performance, its benefits diminish beyond a certain scale, even with diverse augmentation strategies. Surprisingly, we observe that augmentation with smaller LLMs can achieve performance competitive with larger augmentation models. Moreover, we examine how augmentation effectiveness varies with retrieval model pre-training, revealing that augmentation provides the most benefit to models which are not well pre-trained. Our insights pave the way for more judicious and efficient augmentation strategies, thus enabling informed decisions and maximizing retrieval performance while being more cost-effective.
pdf
bib
abs
Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?
Ashutosh Bajpai
|
Tanmoy Chakraborty
The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose , a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.
pdf
bib
abs
MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
Zixin Chen
|
Hongzhan Lin
|
Kaixin Li
|
Ziyang Luo
|
Yayue Deng
|
Jing Ma
The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs’ detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs’ understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs’ abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
pdf
bib
abs
Multi-perspective Analysis of Large Language Model Domain Specialization: An Experiment in Accounting Audit Procedures Generation
Yusuke Noro
Two major domain specialization approaches for Large Language Models (LLMs), fine-tuning and In-Context Learning (ICL), have been compared across various domains.While prior research has examined the similarities and differences between these approaches in task-specific capabilities, less is known about how they affect the feature of the generated text itself.To address this research gap, we conducted an experimental study using Accounting Audit Procedures Generation (AAPG) task, a highly specialized task requiring expert accounting knowledge.This task provides a practical testbed for a multi-perspective analysis of domain specialization due to its technical complexity and the large gap between general and domain expert knowledge.The results show consistent differences in output characteristics across models when comparing fine-tuning, ICL, and their combined approaches.
pdf
bib
abs
Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent
Xingzuo Li
|
Kehai Chen
|
Yunfei Long
|
Xuefeng Bai
|
Yong Xu
|
Min Zhang
Large language model (LLM) agents typically adopt a step-by-step reasoning framework, in which they interleave the processes of thinking and acting to accomplish the given task. However, this paradigm faces a deep-rooted one-pass issue whereby each generated intermediate thought is plugged into the trajectory regardless of its correctness, which can cause irreversible error propagation. To address the issue, this paper proposes a novel framework called Generator-Assistant Stepwise Rollback (GA-Rollback) to induce better decision-making for LLM agents. Particularly, GA-Rollback utilizes a generator to interact with the environment and an assistant to examine each action produced by the generator, where the assistant triggers a rollback operation upon detection of incorrect actions. Moreover, we introduce two additional strategies tailored for the rollback scenario to further improve its effectiveness. Extensive experiments show that GA-Rollback achieves significant improvements over several strong baselines on three widely used benchmarks. Our analysis further reveals that GA-Rollback can function as a robust plug-and-play module, integrating seamlessly with other methods.
pdf
bib
abs
DocAgent: An Agentic Framework for Multi-Modal Long-Context Document Understanding
Li Sun
|
Liu He
|
Shuyue Jia
|
Yangfan He
|
Chenyu You
Recent advances in large language models (LLMs) have demonstrated significant promise in document understanding and question-answering. Despite the progress, existing approaches can only process short documents due to limited context length or fail to fully leverage multi-modal information. In this work, we introduce DocAgent, a multi-agent framework for long-context document understanding that imitates human reading practice. Specifically, we first extract a structured, tree-formatted outline from documents to help agents identify relevant sections efficiently. Further, we develop an interactive reading interface that enables agents to query and retrieve various types of content dynamically. To ensure answer reliability, we introduce a reviewer agent that cross-checks responses using complementary sources and maintains a task-agnostic memory bank to facilitate knowledge sharing across tasks. We evaluate our method on two long-context document understanding benchmarks, where it bridges the gap to human-level performance by surpassing competitive baselines, while maintaining a short context length. Our code is available at https://github.com/lisun-ai/DocAgent.
pdf
bib
abs
EasyRec: Simple yet Effective Language Models for Recommendation
Xubin Ren
|
Chao Huang
Deep neural networks have emerged as a powerful technique for learning representations from user-item interaction data in collaborative filtering (CF) for recommender systems. However, many existing methods heavily rely on unique user and item IDs, which restricts their performance in zero-shot learning scenarios. Inspired by the success of language models (LMs) and their robust generalization capabilities, we pose the question: How can we leverage language models to enhance recommender systems? We propose EasyRec, an effective approach that integrates text-based semantic understanding with collaborative signals. EasyRec employs a text-behavior alignment framework that combines contrastive learning with collaborative language model tuning. This ensures strong alignment between text-enhanced semantic representations and collaborative behavior information. Extensive evaluations across diverse datasets show EasyRec significantly outperforms state-of-the-art models, particularly in text-based zero-shot recommendation. EasyRec functions as a plug-and-play component that integrates seamlessly into collaborative filtering frameworks. This empowers existing systems with improved performance and adaptability to user preferences. Implementation codes are publicly available at: https://github.com/HKUDS/EasyRec
pdf
bib
abs
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Tianshi Zheng
|
Zheye Deng
|
Hong Ting Tsang
|
Weiqi Wang
|
Jiaxin Bai
|
Zihao Wang
|
Yangqiu Song
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy—Tool, Analyst, and Scientist—to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement.
pdf
bib
abs
Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLMs
Zhen Xiong
|
Yujun Cai
|
Zhecheng Li
|
Yiwei Wang
Recent advances in test-time scaling have enabled Large Language Models (LLMs) to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their impressive reasoning abilities, Large Reasoning Models (LRMs) frequently display unstable behaviors, e.g., hallucinating unsupported premises, overthinking simple tasks, and displaying higher sensitivity to prompt variations. This raises a deeper research question: How can we represent the reasoning process of LRMs to map their minds? To address this, we propose a unified graph-based analytical framework for fine-grained modeling and quantitative analysis of LRM reasoning dynamics. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through a comprehensive analysis of derived reasoning graphs, we also reveal that key structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with models’ performance. The proposed framework enables quantitative evaluation of internal reasoning structure and quality beyond conventional metrics and also provides practical insights for prompt engineering and cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.
pdf
bib
abs
ViPE: Visual Perception in Parameter Space for Efficient Video-Language Understanding
Shichen Lu
|
Tongtian Yue
|
Longteng Guo
|
Handong Li
|
Xingjian He
|
Si Liu
|
Jing Liu
Existing video-language models (Video-LLMs) typically rely on concatenating visual tokens with textual inputs for joint modeling. However, this token-level alignment leads to significant inefficiency, especially when scaling to long videos with dense visual inputs. In this work, we propose a video-to-parameter efficiency paradigm named ViPE that eliminates redundant visual tokens by transforming video content into visual perceptual weights, which are directly injected into the LLM’s parameters. ViPE consists of a visual injection module that compresses video features into a small set of perceptual queries using a hierarchical merge strategy, and a visual perception module that integrates the resulting representations into the LLM through a lightweight LoRA-like mechanism. ViPE achieves performance comparable to token-based baselines such as LLaVA, while reducing FLOPs by 85% and inference time by up to 65%, demonstrating a highly efficient and scalable solution for video understanding.
pdf
bib
abs
Alignment for Efficient Tool Calling of Large Language Models
Hongshen Xu
|
Zihan Wang
|
Zichen Zhu
|
Lei Pan
|
Xingyu Chen
|
Shuai Fan
|
Lu Chen
|
Kai Yu
Recent advancements in tool learning have enabled large language models (LLMs) to integrate external tools, enhancing their task performance by expanding their knowledge boundaries. However, relying on tools often introduces trade-offs between performance, speed, and cost, with LLMs sometimes exhibiting overreliance and overconfidence in tool usage. This paper addresses the challenge of aligning LLMs with their knowledge boundaries to make more intelligent decisions about tool invocation. We propose a multi-objective alignment framework that combines probabilistic knowledge boundary estimation with dynamic decision-making, allowing LLMs to better assess when to invoke tools based on their confidence. Our framework includes two methods for knowledge boundary estimation—consistency-based and absolute estimation—and two training strategies for integrating these estimates into the model’s decision-making process. Experimental results on various tool invocation scenarios demonstrate the effectiveness of our framework, showing significant improvements in tool efficiency by reducing unnecessary tool usage.
pdf
bib
abs
ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models
Jiani Guo
|
Zuchao Li
|
Jie Wu
|
Qianren Wang
|
Yun Li
|
Lefei Zhang
|
Hai Zhao
|
Yujiu Yang
Large Language Models (LLMs), constrained by limited context windows, often face significant performance degradation when reasoning over long contexts. To address this, Retrieval-Augmented Generation (RAG) retrieves and reasons over chunks but frequently sacrifices logical coherence due to its reliance on similarity-based rankings. Similarly, divide-and-conquer frameworks (DCF) split documents into small chunks for independent reasoning and aggregation. While effective for local reasoning, DCF struggles to capture long-range dependencies and risks inducing conflicts by processing chunks in isolation. To overcome these limitations, we propose ToM, a novel Tree-oriented MapReduce framework for long-context reasoning. ToM leverages the inherent hierarchical structure of long documents (e.g., main headings and subheadings) by constructing a DocTree through hierarchical semantic parsing and performing bottom-up aggregation. Using a Tree MapReduce approach, ToM enables recursive reasoning: in the Map step, rationales are generated at child nodes; in the Reduce step, these rationales are aggregated across sibling nodes to resolve conflicts or reach consensus at parent nodes. Experimental results on 70B+ LLMs show that ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning.
pdf
bib
abs
BANMIME : Misogyny Detection with Metaphor Explanation on Bangla Memes
Md Ayon Mia
|
Akm Moshiur Rahman Mazumder
|
Khadiza Sultana Sayma
|
Md Fahim
|
Md Tahmid Hasan Fuad
|
Muhammad Ibrahim Khan
|
Akmmahbubur Rahman
Detecting misogyny in multimodal content remains a notable challenge, particularly in culturally conservative and low-resource contexts like Bangladesh. While existing research has explored hate speech and general meme classification, the nuanced identification of misogyny in Bangla memes, rich in metaphor, humor, and visual-textual interplay, remains severely underexplored. To address this gap, we introduce BanMiMe, the first comprehensive Bangla misogynistic meme dataset comprising 2,000 culturally grounded samples where each meme includes misogyny labels, humor categories, metaphor localization, and detailed human-written explanations. We benchmark the various performance of open and closed-source vision-language models (VLMs) under zero-shot and prompt-based settings and evaluate their capacity for both classification and explanation generation. Furthermore, we systematically explore multiple fine-tuning strategies, including standard, data-augmented, and Chain-of-Thought (CoT) supervision. Our results demonstrate that CoT-based fine-tuning consistently enhances model performance, both in terms of accuracy and in generating meaningful explanations. We envision BanMiMe as a foundational resource for advancing explainable multimodal moderation systems in low-resource and culturally sensitive settings.
pdf
bib
abs
Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
Yifan Lan
|
Yuanpu Cao
|
Weitong Zhang
|
Lu Lin
|
Jinghui Chen
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns.In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, **P**reference **Hi**jacking (**Phi**), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation – a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.
pdf
bib
abs
Retrieval-augmented GUI Agents with Generative Guidelines
Ran Xu
|
Kaixin Ma
|
Wenhao Yu
|
Hongming Zhang
|
Joyce C. Ho
|
Carl Yang
|
Dong Yu
GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inferencetime. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling fine-tuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluatedacross three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% acrosstwo model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.
pdf
bib
abs
COAS2W: A Chinese Older-Adults Spoken-to-Written Transformation Corpus with Context Awareness
Chun Kang
|
Zhigu Qian
|
Zhen Fu
|
Jiaojiao Fu
|
Yangfan Zhou
Spoken language from older adults often deviates from written norms due to omission, disordered syntax, constituent errors, and redundancy, limiting the usefulness of automatic transcripts in downstream tasks. We present COAS2W, a Chinese spoken-to-written corpus of 10,004 utterances from older adults, each paired with a written version, fine-grained error labels, and four-sentence context. Fine-tuned lightweight open-source models on COAS2W outperform larger closed-source models. Context ablation shows the value of multi-sentence input, and normalization improves performance on downstream translation tasks. COAS2W supports the development of inclusive, context-aware language technologies for older speakers. Our annotation convention, data, and code are publicly available at https://github.com/Springrx/COAS2W.
pdf
bib
abs
Answer Convergence as a Signal for Early Stopping in Reasoning
Xin Liu
|
Lu Wang
Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to investigate what is the minimum reasoning required for a model to reach a stable decision. Based on the insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods largely reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.
pdf
bib
abs
VeriFact: Enhancing Long-Form Factuality Evaluation with Refined Fact Extraction and Reference Facts
Xin Liu
|
Lechen Zhang
|
Sheza Munir
|
Yiyang Gu
|
Lu Wang
Large language models (LLMs) excel at generating long-form responses, but evaluating their factuality remains challenging due to complex inter-sentence dependencies within the generated facts. Prior solutions predominantly follow a decompose-decontextualize-verify pipeline but often fail to capture essential context and miss key relational facts. In this paper, we introduce VeriFact, a factuality evaluation framework designed to enhance fact extraction by identifying and resolving incomplete and missing facts to support more accurate verification results. Moreover, we introduce FactRBench , a benchmark that evaluates both precision and recall in long-form model responses, whereas prior work primarily focuses on precision. FactRBench provides reference fact sets from advanced LLMs and human-written answers, enabling recall assessment. Empirical evaluations show that VeriFact significantly enhances fact completeness and preserves complex facts with critical relational information, resulting in more accurate factuality evaluation. Benchmarking various open- and close-weight LLMs on FactRBench indicate that larger models within same model family improve precision and recall, but high precision does not always correlate with high recall, underscoring the importance of comprehensive factuality assessment.
pdf
bib
abs
SQUAB: Evaluating LLM robustness to Ambiguous and Unanswerable Questions in Semantic Parsing
Simone Papicchio
|
Luca Cagliero
|
Paolo Papotti
Large Language Models (LLMs) have demonstrated robust performance in Semantic Parsing (SP) for well-defined queries with unambiguous intent and answerable responses. However, practical user questions frequently deviate from these ideal conditions, challenging the applicability of existing benchmarks. To address this issue, we introduce SQUAB, an automatic dataset generator of Ambiguous and Unanswerable questions. SQUAB generates complex, annotated SP tests using a blend of SQL and LLM capabilities. Results show that SQUAB reduces test generation costs by up to 99% compared to human-based solutions while aligning with real-world question patterns. Furthermore, these tests challenge LLM performance while revealing disparities between public and proprietary datasets. This highlights the need for a dynamic, automatic dataset generator as SQUAB. The code is designed for user extension to accommodate new ambiguous and unanswerable patterns and is available at https://anonymous.4open.science/r/squab-8716/.
pdf
bib
abs
Reliable Evaluation and Benchmarks for Statement Autoformalization
Auguste Poiroux
|
Gail Weiss
|
Viktor Kunčak
|
Antoine Bosselut
Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
pdf
bib
abs
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
Jen-tse Huang
|
Jiantong Qin
|
Jianping Zhang
|
Youliang Yuan
|
Wenxuan Wang
|
Jieyu Zhao
This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., “What is the education level of the person in the image?”) (2) Yes-No comparisons using two images (e.g., “Is the person in the first image more educated than the person in the second image?”) For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.
pdf
bib
abs
Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions
Nannan Huang
|
Haytham M. Fayek
|
Xiuzhen Zhang
Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views. In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have larger impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods.
pdf
bib
abs
AI Sees Your Location—But With A Bias Toward The Wealthy World
Jingyuan Huang
|
Jen-tse Huang
|
Ziyi Liu
|
Xiaoyuan Liu
|
Wenxuan Wang
|
Jieyu Zhao
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
pdf
bib
abs
Faster In-Context Learning for LLMs via N-Gram Trie Speculative Decoding
Jinglin Chen
|
Qiwei Li
|
Zuchao Li
|
Baoyuan Qi
|
Liu Guoming
|
Haojun Ai
|
Hai Zhao
|
Ping Wang
As a crucial method in prompt engineering, In-Context Learning (ICL) enhances the generalization and knowledge utilization capabilities of Large Language Models (LLMs) (Dong et al., 2024). However, the lengthy retrieved contexts and limited token throughput in autoregressive models significantly constrain reasoning speed. To address this challenge, we propose N-Gram Trie Speculative Decoding, a novel approach that leverages the overlap between context and model output. This method constructs an n-gram trie from the context to generate drafts, accelerating token generation for LLMs. We evaluate our approach on summarization, Retrieval-Augmented Generation (RAG), and context-based Question Answering (QA) tasks. Experimental results on Vicuna-7B, Llama2-7B-Chat, and Llama3-8B-Instruct demonstrate substantial speed improvements without compromising accuracy. Compared with various strong baselines, our method achieves the highest mean speedup, showcasing its effectiveness and efficiency.
pdf
bib
abs
From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs
Farid Adilazuarda
|
Chen Cecilia Liu
|
Iryna Gurevych
|
Alham Fikri Aji
Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and data limitations. Previous work aligns LLMs with different cultures using survey data, primarily from the World Values Survey (WVS). However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for tasks like offensiveness classification. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To address these issues, we propose augmenting WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. Our experiments across multiple cultures show that this approach captures more enhances differentiated cultural values and improves downstream classification performances.
pdf
bib
abs
Iterative Prompt Refinement for Safer Text-to-Image Generation
Jinwoo Jeon
|
JunHyeok Oh
|
Hayeong Lee
|
Byung-Jun Lee
Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. \textcolor{red}{WARNING: This paper contains examples of harmful or inappropriate images generated by models.}
pdf
bib
abs
Language Models as Continuous Self-Evolving Data Engineers
Peidong Wang
|
Ming Wang
|
Zhiming Ma
|
Xiaocui Yang
|
Shi Feng
|
Daling Wang
|
Yifei Zhang
|
Kaisong Song
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their further evolution is often hampered by the scarcity of high-quality training data and the heavy reliance of traditional methods on expert-labeled data. This reliance sets a ceiling on LLM performance and is particularly challenging in low data resource scenarios where extensive supervision is unavailable. To address this issue, we propose a novel paradigm named LANCE (**LAN**guage models as **C**ontinuous self-**E**volving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of **3.64** for Qwen2-7B and **1.75** for Qwen2-7B-Instruct. This autonomous data construction paradigm not only lessens reliance on human experts or external models but also ensures data aligns with human preferences, offering a scalable path for LLM self-improvement, especially in contexts with limited supervisory data. Code is available at: https://github.com/Control-derek/LANCE.
pdf
bib
abs
Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference
Hua Cai
|
Shuang Zhao
|
Liang Zhang
|
Xuli Shen
|
Qing Xu
|
Weilin Shen
|
Zihao Wen
|
Tianke Ban
Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing ~17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the model’s performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%. Code is available at: https://github.com/Hanscal/Unilaw-R1.
pdf
bib
abs
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang
|
Mengxi Gao
|
Yibo Yan
|
Xin Zou
|
Yanggan Gu
|
Jungang Li
|
Jingyu Wang
|
Peijie Jiang
|
Aiwei Liu
|
Jia Liu
|
Xuming Hu
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding. However, existing studies have concentrated mainly on visual–textual misalignment, leaving largely unexplored the MLLMs’ ability to preserve an originally correct answer when confronted with misleading information. We reveal a response uncertainty phenomenon: across nine standard datasets, twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue. To systematically quantify this vulnerability, we propose a two-stage evaluation pipeline: (1) elicit each model’s original response on unperturbed inputs; (2) inject explicit (false-answer hints) and implicit (contextual contradictions) misleading instructions, and compute the misleading rate—the fraction of correct-to-incorrect flips. Leveraging the most susceptible examples, we curate the Multimodal Uncertainty Benchmark (MUB), a collection of image–question pairs stratified into low, medium, and high difficulty based on how many of twelve state-of-the-art MLLMs they mislead. Extensive evaluation on twelve open-source and five closed-source models reveals a high uncertainty: average misleading rates exceed 86%, with explicit cues over 67.19% and implicit cues over 80.67%. To reduce the misleading rate, we then fine-tune all open-source MLLMs on a compact 2,000-sample mixed-instruction dataset, reducing misleading rates to 6.97% (explicit) and 32.77% (implicit), boosting consistency by nearly 29.37% on highly deceptive inputs, and slightly improving accuracy on standard benchmarks.
pdf
bib
abs
Evaluating and Aligning Human Economic Risk Preferences in LLMs
Jiaxin Liu
|
Yixuan Tang
|
Yi Yang
|
Kar Yan Tam
Large Language Models (LLMs) are increasingly used in decision-making scenarios that involve risk assessment, yet their alignment with human economic rationality remains unclear. In this study, we investigate whether LLMs exhibit risk preferences consistent with human expectations across different personas. Specifically, we propose an evaluation metric called Risk Disparity Score (RDS) and assess whether LLM-generated responses reflect appropriate levels of risk aversion or risk-seeking behavior based on individual’s persona. Our results reveal that while LLMs make reasonable decisions in simplified, personalized risk contexts, their performance declines in more complex economic decision-making tasks. To address this, we test whether current state-of-art alignment methods such as Direct Preference Optimization(DPO) and In Context Learning(ICL) can enhance LLM adherence to persona-specific risk preferences. We find DPO can improve the economic rationality of LLMs in loss-related parameters, offering a step toward more human-aligned AI decision-making.
pdf
bib
abs
Ensembling Prompting Strategies for Zero-Shot Hierarchical Text Classification with Large Language Models
Mingxuan Xia
|
Zhijie Jiang
|
Haobo Wang
|
Junbo Zhao
|
Tianlei Hu
|
Gang Chen
Hierarchical text classification aims to classify documents into multiple labels within a hierarchical taxonomy, making it an essential yet challenging task in natural language processing. Recently, using Large Language Models (LLM) to tackle hierarchical text classification in a zero-shot manner has attracted increasing attention due to their cost-efficiency and flexibility. Given the challenges of understanding the hierarchy, various HTC prompting strategies have been explored to elicit the best performance from LLMs.However, our empirical study reveals that LLMs are highly sensitive to these prompting strategies—(i) within a task, different strategies yield substantially different results, and (ii) across various tasks, the relative effectiveness of a given strategy varies significantly. To address this, we propose a novel ensemble method, HiEPS, which integrates the results of diverse prompting strategies to promote LLMs’ reliability. We also introduce a path-valid voting mechanism for ensembling, which selects a valid result with the highest path frequency score. Extensive experiments on three benchmark datasets show that HiEPS boosts the performance of single prompting strategies and achieves SOTA results. The source code is available at https://github.com/MingxuanXia/HiEPS.
pdf
bib
abs
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
Eugene Jang
|
Kimin Lee
|
Jin-Woo Chung
|
Keuntae Park
|
Seungwon Shin
Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.
pdf
bib
abs
UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents
Jiwen Zhang
|
Ya-Qi Yu
|
Minghui Liao
|
WenTao Li
|
Jihao Wu
|
Zhongyu Wei
Graphical User Interface (GUI) agents are expected to precisely operate on the screens of digital devices. Existing GUI agents merely depend on current visual observations and plain-text action history, ignoring the significance of history screens. To mitigate this issue, we propose **UI-Hawk**, a multi-modal GUI agent specially designed to process screen streams encountered during GUI navigation. UI-Hawk incorporates a history-aware visual encoder to handle the screen sequences. To acquire a better understanding of screen streams, we select four fundamental tasks—UI grounding, UI referring, screen question answering, and screen summarization. We further propose a curriculum learning strategy to subsequently guide the model from fundamental tasks to advanced screen-stream comprehension.Along with the efforts above, we have also created a benchmark FunUI to quantitatively evaluate the fundamental screen understanding ability of MLLMs. Extensive experiments on FunUI and GUI navigation benchmarks consistently validate that screen stream understanding is essential for GUI tasks.Our code and data are now available at https://github.com/IMNearth/UIHawk.
pdf
bib
abs
UniDebugger: Hierarchical Multi-Agent Framework for Unified Software Debugging
Cheryl Lee
|
Chunqiu Steven Xia
|
Longji Yang
|
Jen-tse Huang
|
Zhouruixing Zhu
|
Lingming Zhang
|
Michael R. Lyu
Software debugging is a time-consuming endeavor involving a series of steps, such as fault localization and patch generation, each requiring thorough analysis and a deep understanding of the underlying logic. While large language models (LLMs) demonstrate promising potential in coding tasks, their performance in debugging remains limited. Current LLM-based methods often focus on isolated steps and struggle with complex bugs. In this paper, we propose the first end-to-end framework, UniDebugger, for unified debugging through multi-agent synergy. It mimics the entire cognitive processes of developers, with each agent specialized as a particular component of this process rather than mirroring the actions of an independent expert as in previous multi-agent systems. Agents are coordinated through a three-level design, following a cognitive model of debugging, allowing adaptive handling of bugs with varying complexities. Experiments on extensive benchmarks demonstrate that UniDebugger significantly outperforms state-of-the-art repair methods, fixing 1.25x to 2.56x bugs on the repo-level benchmark, Defects4J. This performance is achieved without requiring ground-truth root-cause code statements, unlike the baselines. Our source code is available on an anonymous link: https://github.com/BEbillionaireUSD/UniDebugger.
pdf
bib
abs
Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld’s Episode Theory
Ming Li
|
Nan Zhang
|
Chenrui Fan
|
Hong Jiao
|
Yanbin Fu
|
Sydney Peters
|
Qingshu Xu
|
Robert Lissitz
|
Tianyi Zhou
While Large Reasoning Models (LRMs) generate extensive chain-of-thought reasoning, we lack a principled framework for understanding how these thoughts are structured. In this paper, we introduce a novel approach by applying Schoenfeld’s Episode Theory, a classic cognitive framework for human mathematical problem-solving, to analyze the reasoning traces of LRMs. We annotated thousands of sentences and paragraphs from model-generated solutions to math problems using seven cognitive labels (e.g., Plan, Implement, Verify). The result is the first publicly available benchmark for the fine-grained analysis of machine reasoning, including a large annotated corpus and detailed annotation guidebooks. Our preliminary analysis reveals distinct patterns in LRM reasoning, such as the transition dynamics between cognitive states. This framework provides a theoretically grounded methodology for interpreting LRM cognition and enables future work on more controllable and transparent reasoning systems.
pdf
bib
abs
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
Kaikai An
|
Fangkai Yang
|
Liqun Li
|
Junting Lu
|
Sitao Cheng
|
Shuzheng Si
|
Lu Wang
|
Pu Zhao
|
Lele Cao
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Baobao Chang
Recent advances in retrieval-augmented generation (RAG) have substantially improved question-answering systems, particularly for factoid ‘5Ws’ questions. However, significant challenges remain when addressing ‘1H’ questions, specifically how-to questions, which are integral for decision-making and require dynamic, step-by-step responses. The key limitation lies in the prevalent data organization paradigm, chunk, which commonly divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To address this, we propose THREAD, a novel data organization paradigm enabling systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, ‘logic unit’ (LU), where large language models transform documents into more structured and loosely interconnected LUs. Extensive experiments across both open-domain and industrial settings show that THREAD outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Additionally, THREAD demonstrates high adaptability across diverse document formats, reducing retrieval information by up to 75% compared to chunk, and also shows better generalizability to ‘5Ws’ questions, such as multi-hop questions, outperforming other paradigms.
pdf
bib
abs
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Gabriele Sarti
|
Vilém Zouhar
|
Malvina Nissim
|
Arianna Bisazza
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
pdf
bib
abs
STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models
Kai Chen
|
Zihao He
|
Taiwei Shi
|
Kristina Lerman
Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce STEER-BENCH, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, STEER-BENCH includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice questions with corresponding silver labels to test alignment with diverse community norms. It systematically assesses how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives. Our evaluation of 13 popular LLMs using STEER-BENCH reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability.
pdf
bib
abs
Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction
Marija Sakota
|
Robert West
Many recent approaches to structured NLP tasks use an autoregressive language model M to map unstructured input text x to output text y representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs y. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD) which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model M twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.
pdf
bib
abs
MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions
Yeliang Xiu
|
Yongmei Liu
Non-monotonic reasoning (NMR) refers to the fact that conclusions may be invalidated by new information. It is widely used in daily life and legal reasoning. An NMR task usually has multiple extensions, which are sets of plausible conclusions. There are two reasoning modes – skeptical and credulous reasoning, depending on whether to believe facts in all extensions or any one extension. Despite some preliminary work exploring the NMR abilities of LLMs, the multi-extension NMR capabilities of LLMs remain underexplored. In this paper, we synthesize a multi-extension NMR dataset MultiLogicNMR, and construct two variants of the dataset with more extensions or text diversity. We propose a neural-symbolic framework MultiLogicNMRer for multi-extension NMR. Experimental evaluation with the datasets shows that LLMs still face significant challenges in NMR abilities, and reveal the effectiveness of our neural-symbolic framework, with an average accuracy gain of about 15% compared to prompt-based methods, and even outperforming some fine-tuning methods. All code and data are publicly available.
pdf
bib
abs
Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning
Haijiang Liu
|
Qiyuan Li
|
Chao Gao
|
Yong Cao
|
Xiangyu Xu
|
Xun Wu
|
Daniel Hershcovich
|
Jinguang Gu
Introducing **MARK**, the **M**ulti-st**A**ge **R**easoning framewor**K** for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.
pdf
bib
abs
CrystalICL: Enabling In-Context Learning for Crystal Generation
Ruobing Wang
|
Qiaoyu Tan
|
Yili Wang
|
Ying Wang
|
Xin Wang
Designing crystal materials with desired physicochemical properties remains a fundamental challenge in materials science. While large language models (LLMs) have demonstrated strong in-context learning (ICL) capabilities, existing LLM-based crystal generation approaches are limited to zero-shot scenarios and are unable to benefit from few-shot scenarios. In contrast, human experts typically design new materials by modifying relevant known structures which aligns closely with the few-shot ICL paradigm. Motivated by this, we propose CrystalICL, a novel model designed for few-shot crystal generation. Specifically, we introduce a space-group based crystal tokenization method, which effectively reduces the complexity of modeling crystal symmetry in LLMs. We further introduce a condition-structure aware hybrid instruction tuning framework and a multi-task instruction tuning strategy, enabling the model to better exploit ICL by capturing structure-property relationships from limited data. Extensive experiments on four crystal generation benchmarks demonstrate the superiority of CrystalICL over the leading baseline methods on conditional and unconditional generation tasks.
pdf
bib
abs
Towards a Unified Paradigm of Concept Editing in Large Language Models
Zhuowen Han
|
Xinwei Wu
|
Dan Shi
|
Renren Jin
|
Deyi Xiong
Concept editing aims to control specific concepts in large language models (LLMs) and is an emerging subfield of model editing. Despite the emergence of various editing methods in recent years, there remains a lack of rigorous theoretical analysis and a unified perspective to systematically understand and compare these methods. To address this gap, we propose a unified paradigm for concept editing methods, in which all forms of conceptual injection are aligned at the neuron level. We study four representative concept editing methods: Neuron Editing (NE), Supervised Fine-tuning (SFT), Sparse Autoencoder (SAE), and Steering Vector (SV). Then we categorize them into two classes based on their mode of conceptual information injection: indirect (NE, SFT) and direct (SAE, SV). We evaluate above methods along four dimensions: editing reliability, output generalization, neuron level consistency, and mathematical formalization. Experiments show that SAE achieves the best editing reliability. In output generalization, SAE captures features closer to human-understood concepts, while NE tends to locate text patterns rather than true semantics. Neuron-level analysis reveals that direct methods share high neuron overlap, as do indirect methods, indicating methodological commonality within each category. Our unified paradigm offers a clear framework and valuable insights for advancing interpretability and controlled generation in LLMs.
pdf
bib
abs
Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models
Kaiyan Chang
|
Yonghao Shi
|
Chenglong Wang
|
Hang Zhou
|
Chi Hu
|
Xiaoqian Liu
|
Yingfeng Luo
|
Yuan Ge
|
Tong Xiao
|
JingBo Zhu
Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling.In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.
pdf
bib
abs
Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation
Junzhuo Li
|
Bo Wang
|
Xiuze Zhou
|
Xuming Hu
Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.
pdf
bib
abs
RRInf: Efficient Influence Function Estimation via Ridge Regression for Large Language Models and Text-to-Image Diffusion Models
Zhuozhuo Tu
|
Cheng Chen
|
Yuxuan Du
The quality of data plays a vital role in the development of Large-scale Generative Models. Understanding how important a data point is for a generative model is essential for explaining its behavior and improving the performance. The influence function provides a framework for quantifying the impact of individual training data on model predictions. However, the high computational cost has hindered their applicability in large-scale applications. In this work, we present RRInf, a novel and principled method for estimating influence function in large-scale generative AI models. We show that influence function estimation can be transformed into a ridge regression problem. Based on this insight, we develop an algorithm that is efficient and scalable to large models. Experiments on noisy data detection and influential data identification tasks demonstrate that RRInf outperforms existing methods in terms of both efficiency and effectiveness for commonly used large models: RoBERTa-large, Llama-2-13B-chat, Llama-3-8B and stable-diffusion-v1.5.
pdf
bib
abs
Evaluating Spatiotemporal Consistency in Automatically Generated Sewing Instructions
Luisa Geiger
|
Mareike Hartmann
|
Michael Sullivan
|
Alexander Koller
In this paper, we propose a novel, automatic tree-based evaluation metric for LLM-generated step-by-step assembly instructions, that more accurately reflects spatiotemporal aspects of construction than traditional metrics such as BLEU and BERT similarity scores. We apply our proposed metric to the domain of sewing instructions, and show that our metric better correlates with manually-annotated error counts, demonstrating our metric’s superiority for evaluating the spatiotemporal soundness of sewing instructions. Further experiments show that our metric is more robust than traditional approaches against artificially-constructed counterfactual examples that are specifically constructed to confound metrics that rely on textual similarity.
pdf
bib
abs
MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models
Zhen Zhang
|
Yifan Yang
|
Kai Zhen
|
Nathan Susanj
|
Athanasios Mouchtaris
|
Siegfried Kunzmann
|
Zheng Zhang
Large language models have demonstrated exceptional capabilities across diverse tasks, but their fine-tuning demands significant memory, posing challenges for resource-constrained environments. Zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating the need for backpropagation. However, ZO optimization suffers from high gradient variance, and prior research has largely focused on single-task learning, leaving its application to multi-task learning unexplored. Multi-task learning is crucial for leveraging shared knowledge across tasks to improve generalization, yet it introduces unique challenges under ZO settings, such as amplified gradient variance and collinearity. In this paper, we present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization. MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters, reducing the dimensionality of the parameter space and mitigating task conflicts. Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.
pdf
bib
abs
Procedural Environment Generation for Tool-Use Agents
Michael Sullivan
|
Mareike Hartmann
|
Alexander Koller
Although the power of LLM tool-use agents has ignited a flurry of recent research in this area, the curation of tool-use training data remains an open problem\textemdashespecially for online RL training. Existing approaches to synthetic tool-use data generation tend to be non-interactive and/or non-compositional. We introduce RandomWorld, a pipeline for the procedural generation of interactive tools and compositional tool-use data. We show that models tuned via SFT and RL on synthetic RandomWorld data improve on a range of tool-use benchmarks, and set the new SoTA for two metrics on the NESTFUL dataset. Further experiments show that downstream performance scales with the amount of RandomWorld-generated training data, opening up the possibility of further improvement through the use of entirely synthetic data.
pdf
bib
abs
FacLens: Transferable Probe for Foreseeing Non-Factuality in Fact-Seeking Question Answering of Large Language Models
Yanling Wang
|
Haoyang Li
|
Hao Zou
|
Jing Zhang
|
Xinlei He
|
Qi Li
|
Ke Xu
Despite advancements in large language models (LLMs), non-factual responses still persist in fact-seeking question answering. Unlike extensive studies on post-hoc detection of these responses, this work studies non-factuality prediction (NFP), predicting whether an LLM will generate a non-factual response prior to the response generation. Previous NFP methods have shown LLMs’ awareness of their knowledge, but they face challenges in terms of efficiency and transferability. In this work, we propose a lightweight model named Factuality Lens (FacLens), which effectively probes hidden representations of fact-seeking questions for the NFP task. Moreover, we discover that hidden question representations sourced from different LLMs exhibit similar NFP patterns, enabling the transferability of FacLens across different LLMs to reduce development costs. Extensive experiments highlight FacLens’s superiority in both effectiveness and efficiency.
pdf
bib
abs
OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent
Bowen Chen
|
Zhao Wang
|
Shingo Takamatsu
Keyword decision in Sponsored Search Advertising is critical to the success of ad campaigns. While LLM-based methods offer automated keyword generation, they face three major limitations: reliance on large-scale query–keyword pair data, lack of online multi-objective performance monitoring and optimization, and weak quality control in keyword selection. These issues hinder the agentic use of LLMs in fully automating keyword decisions by monitoring and reasoning over key performance indicators such as impressions, clicks, conversions, and CTA effectiveness. To overcome these challenges, we propose OMS, a keyword generation framework that is On-the-fly (requires no training data, monitors online performance, and adapts accordingly), Multi-objective (employs agentic reasoning to optimize keywords based on multiple performance metrics) and Self-reflective (agentically evaluates keyword quality). Experiments on benchmarks and real-world ad campaigns show that OMS outperforms existing methods; Ablation and human evaluations confirm the effectiveness of each component and the quality of generated keywords.
pdf
bib
abs
Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents
Guangfu Guo
|
Xiaoqian Lu
|
Yue Feng
Vision-language models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, Inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (Med-VRAgent). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-RAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.
pdf
bib
abs
TrojanWave: Exploiting Prompt Learning for Stealthy Backdoor Attacks on Large Audio-Language Models
Asif Hanif
|
Maha Tufail Agro
|
Fahad Shamshad
|
Karthik Nandakumar
Prompt learning has emerged as an efficient alternative to full fine-tuning for adapting large audio-language models (ALMs) to downstream tasks. While this paradigm enables scalable deployment via Prompt-as-a-Service frameworks, it also introduces a critical yet underexplored security risk of backdoor attacks. In this work, we present TrojanWave, the first backdoor attack tailored to the prompt-learning setting in frozen ALMs. Unlike prior audio backdoor methods that require training from scratch on full datasets, TrojanWave injects backdoors solely through learnable prompts, making it highly scalable and effective in few-shot settings. TrojanWave injects imperceptible audio triggers in both time and spectral domains to effectively induce targeted misclassification during inference. To mitigate this threat, we further propose TrojanWave-Defense, a lightweight prompt purification method that neutralizes malicious prompts without hampering the clean performance. Extensive experiments across 11 diverse audio classification benchmarks demonstrate the robustness and practicality of both the attack and defense. Our code is publicly available at
https://asif-hanif.github.io/trojanwave/.
pdf
bib
abs
Can LLMs be Literary Companions?: Analysing LLMs on Bengali Figures of Speech Identification
Sourav Das
|
Kripabandhu Ghosh
Despite Bengali being among the most spoken languages bearing cultural importance and richness, the NLP endeavors on it, remain relatively limited. Figures of Speech (FoS) not only contribute to the phonetic and semantic nuances of a language, but they also exhibit aesthetics, expression, and creativity in literature. To our knowledge, in this paper, we present the first ever Bengali figures of speech classification dataset, **BengFoS**, on works of six renowned poets of Bengali literature. We deploy state-of-the-art Large Language Models (LLMs) to this dataset in the zero-shot setup, thereafter fine-tuning the best performing models, and finally dissect them for language model probing. This reveals novel insights on the intrinsic behavior of two open-source LLMs (Llama and DeepSeek) in FoS detection. **Though we have limited ourselves to Bengali, the experimental framework can be reproduced for English as well as for other low-resource languages**.
pdf
bib
abs
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
|
Federico Belotti
|
Marco Molinari
|
Tao Ma
|
Matteo Palmonari
Sparse AutoEncoders (SAEs) have recently been employed as a promising unsupervised approach for understanding the representations of layers of Large Language Models (LLMs). However, with the growth in model size and complexity, training SAEs is computationally intensive, as typically one SAE is trained for each model layer. To address such limitation, we propose Group-SAE, a novel strategy to train SAEs. Our method considers the similarity of the residual stream representations between contiguous layers to group similar layers and train a single SAE per group. To balance the trade-off between efficiency and performance, we further introduce AMAD (Average Maximum Angular Distance), an empirical metric that guides the selection of an optimal number of groups based on representational similarity across layers. Experiments on models from the Pythia family show that our approach significantly accelerates training with minimal impact on reconstruction quality and comparable downstream task performance and interpretability over baseline SAEs trained layer by layer. This method provides an efficient and scalable strategy for training SAEs in modern LLMs.
pdf
bib
abs
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei
|
Tingjing Liao
|
Peiyingxin
|
Yiyang Qi
|
Jiaqi Wang
|
Ruiting Li
|
Feiliang Ren
Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose **R**etrieval **O**ver **C**lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.
pdf
bib
abs
PunMemeCN: A Benchmark to Explore Vision-Language Models’ Understanding of Chinese Pun Memes
Zhijun Xu
|
Siyu Yuan
|
Yiqiao Zhang
|
Jingyu Sun
|
Tong Zheng
|
Deqing Yang
Pun memes, which combine wordplay with visual elements, represent a popular form of humor in Chinese online communications. Despite their prevalence, current Vision-Language Models (VLMs) lack systematic evaluation in understanding and applying these culturally-specific multimodal expressions. In this paper, we introduce PunMemeCN, a novel benchmark designed to assess VLMs’ capabilities in processing Chinese pun memes across three progressive tasks: pun meme detection, sentiment analysis, and chat-driven meme response. PunMemeCN consists of 1,959 Chinese memes (653 pun memes and 1,306 non-pun memes) with comprehensive annotations of punchlines, sentiments, and explanations, alongside 2,008 multi-turn chat conversations incorporating these memes. Our experiments indicate that state-of-the-art VLMs struggle with Chinese pun memes, particularly with homophone wordplay, even with Chain-of-Thought prompting. Notably, punchlines in memes can effectively conceal potentially harmful content from AI detection. These findings underscore the challenges in cross-cultural multimodal understanding and highlight the need for culture-specific approaches to humor comprehension in AI systems.
pdf
bib
abs
UltraIF: Advancing Instruction Following from the Wild
Kaikai An
|
Li Sheng
|
Ganqu Cui
|
Shuzheng Si
|
Ning Ding
|
Yu Cheng
|
Baobao Chang
Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method.
pdf
bib
abs
Identifying Pre-training Data in LLMs: A Neuron Activation-Based Detection Framework
Hongyi Tang
|
Zhihao Zhu
|
Yi Yang
The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM’s pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.
pdf
bib
abs
TreeRare: Syntax Tree-Guided Retrieval and Reasoning for Knowledge-Intensive Question Answering
Boyi Zhang
|
Zhuo Liu
|
Hangfeng He
In real practice, questions are typically complex and knowledge-intensive, requiring Large Language Models (LLMs) to recognize the multifaceted nature of the question and reason across multiple information sources. Iterative and adaptive retrieval, where LLMs decide when and what to retrieve based on their reasoning, has been shown to be a promising approach to resolve complex, knowledge-intensive questions. However, the performance of such retrieval frameworks is limited by the accumulation of reasoning errors and misaligned retrieval results. To overcome these limitations, we propose TreeRare (Syntax Tree-Guided Retrieval and Reasoning, a framework that utilizes syntax trees to guide information retrieval and reasoning for question answering. Following the principle of compositionality, TreeRare traverses the syntax tree in a bottom-up fashion, and in each node, it generates subcomponent-based queries and retrieves relevant passages to resolve localized uncertainty. A subcomponent question answering module then synthesizes these passages into concise, context-aware evidence. Finally, TreeRare aggregates the evidence across the tree to form a final answer. Experiments across five question answering datasets involving ambiguous or multi-hop reasoning demonstrate that TreeRare achieves substantial improvements over existing state-of-the-art methods.
pdf
bib
abs
Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting
Jan Fillies
|
Michael Peter Hoffmann
|
Rebecca Reichel
|
Roman Salzwedel
|
Sven Bodemer
|
Adrian Paschke
A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
pdf
bib
abs
Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
Danielle Cohen
|
Yoni Halpern
|
Noam Kahlon
|
Joel Oren
|
Omri Berkovitch
|
Sapir Caduri
|
Ido Dagan
|
Anatoly Efros
Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.
pdf
bib
abs
On Pruning State-Space LLMs
Tamer Ghattas
|
Michael Hassid
|
Roy Schwartz
Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g., WANDA), while using other methods lead to fast performance degradation.
pdf
bib
abs
An Orthogonal High-Rank Adaptation for Large Language Models
Xin Zhang
|
Guang-Ze Chen
|
Shuzhen Li
|
Zhulin Liu
|
C.L.Philip Chen
|
Tong Zhang
Low-rank adaptation (LoRA) efficiently adapts LLMs to downstream tasks by decomposing LLMs’ weight update into trainable low-rank matrices for fine-tuning. However, the random low-rank matrices may introduce massive task-irrelevant information, while their recomposed form suffer from limited representation spaces under low-rank operations. Such dense and choked adaptation in LoRA impairs the adaptation performance of LLMs on downstream tasks. To address these challenges, this paper proposes OHoRA, an orthogonal high-rank adaptation for parameter-efficient fine-tuning on LLMs. According to the information bottleneck theory, OHoRA decomposes LLMs’ pre-trained weight matrices into orthogonal basis vectors via QR decomposition and splits them into two low-redundancy high-rank components to suppress task-irrelevant information. It then performs dynamic rank-elevated recomposition through Kronecker product to generate expansive task-tailored representation spaces, enabling precise LLM adaptation and enhanced generalization. OHoRA effectively operationalizes the information bottleneck theory to decompose LLMs’ weight matrices into low-redundancy high-rank components and recompose them in rank-elevated manner for more task-tailored representation spaces and precise LLM adaptation. Empirical evaluation shows OHoRA’s effectiveness by outperforming LoRA and its variants and achieving comparable performance to full fine-tuning with only 0.0371% trainable parameters.
pdf
bib
abs
BSFA: Leveraging the Subspace Dichotomy to Accelerate Neural Network Training
WenJie Zhou
|
Bohan Wang
|
Wei Chen
|
Xueqi Cheng
Recent studies (CITATION) highlight a fundamental dichotomy in deep learning optimization: Although parameter updates along the top eigendirections of the loss Hessian (Dom-space) capture most of the update magnitude, they often contribute minimally to loss reduction. In contrast, updates in the orthogonal component (Bulk-space) have smaller magnitudes but drive most learning progress.In this work, we further advance the understanding of this phenomenon and introduce the Bulk-Space-Filtration-Accelerator (BSFA), a novel plug-and-play framework. BSFA accelerates training by differentially scaling update components projected onto these distinct subspaces, simultaneously enhancing stability by moderating updates in the dominant subspace and boosting convergence speed by amplifying those in the bulk-space.To ensure BSFA is both practical and scalable for contemporary large models, we introduce two key innovations: an efficient estimator using Principal Component Analysis (PCA) on historical updates for fast subspace estimation, and a block-wise strategy that applies this estimation on a per-parameter-block basis. These designs make BSFA computationally tractable and highly effective.We demonstrate BSFA’s acceleration across various tasks, notably achieving approximately 2× speedup when pre-training LLaMA-72M on WikiText-103 and LLaMA-134M on OpenWebText compared to vanilla AdamW.
pdf
bib
abs
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Noy Sternlicht
|
Ariel Gera
|
Roy Bar-Haim
|
Tom Hope
|
Noam Slonim
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
pdf
bib
abs
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
Mengyue Wang
|
Shuo Chen
|
Kristian Kersting
|
Volker Tresp
|
Yunpu Ma
Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high computational demands and the redundancy present in the visual data. In this work, we propose METok, a training-free, Multi-stage Event-based Token compression framework designed to accelerate VLLMs’ inference while preserving accuracy. METok progressively eliminates redundant visual tokens across three critical stages: (1) event-aware compression during vision encoding, (2) hierarchical token pruning in the prefilling stage based on semantic alignment and event importance, and (3) a decoding-stage KV Cache optimization that further reduces memory consumption. Our experiments on diverse video benchmarks demonstrate that METok achieves an optimal trade-off between efficiency and accuracy by dynamically selecting informative visual tokens. For instance, equipping LongVA-7B with METok realizes an 80.6% FLOPs reduction and 93.5% KV Cache memory savings, all while maintaining comparable or even superior accuracy.
pdf
bib
abs
VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Yingqi Fan
|
Anhao Zhao
|
Jinlan Fu
|
Junlong Tong
|
Hui Su
|
Yijie Pan
|
Wei Zhang
|
Xiaoyu Shen
Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal tokens. Though efforts have been made to prune tokens in MLLMs, *they lack a fundamental understanding of how MLLMs process and fuse multimodal information*. Through systematic analysis, we uncover a three-stage cross-modal interaction process: (1) Shallow layers recognize task intent, with visual tokens acting as passive attention sinks; (2) Cross-modal fusion occurs abruptly in middle layers, driven by a few critical visual tokens; (3) Deep layers discard vision tokens, focusing solely on linguistic refinement. Based on these findings, we propose *VisiPruner*, a training-free pruning framework that reduces **99.9%** of vision-related attention computations and **62.8%** of FLOPs while maintaining performance. It significantly outperforms existing token pruning methods and generalizes across diverse MLLMs. Beyond pruning, our insights further provide actionable guidelines for training efficient MLLMs by aligning model architecture with its intrinsic layer-wise processing dynamics.
pdf
bib
abs
Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems
Song Jin
|
Juntian Zhang
|
Yuhan Liu
|
Xun Zhang
|
Yufei Zhang
|
Guojun Yin
|
Fei Jiang
|
Wei Lin
|
Rui Yan
Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter , a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research. All codes are released in https://github.com/jinsong8/RecInter.
pdf
bib
abs
SheetDesigner: MLLM-Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection
Qin Chen
|
Yuanyi Ren
|
Xiaojun Ma
|
Mugeng Liu
|
Shi Han
|
Dongmei Zhang
Spreadsheets are critical to data-centric tasks, with rich, structured layouts that enable efficient information transmission. Given the time and expertise required for manual spreadsheet layout design, there is an urgent need for automated solutions.However, existing automated layout models are ill-suited to spreadsheets, as they often (1) treat components as axis-aligned rectangles with continuous coordinates, overlooking the inherently discrete, grid-based structure of spreadsheets; and (2) neglect interrelated semantics, such as data dependencies and contextual links, unique to spreadsheets. In this paper, we first formalize the spreadsheet layout generation task, supported by a seven-criterion evaluation protocol and a dataset of 3,326 spreadsheets. We then introduce SheetDesigner, a zero-shot and training-free framework using Multimodal Large Language Models (MLLMs) that combines rule and vision reflection for component placement and content population. SheetDesigner outperforms five baselines by at least 22.6%. We further find that through vision modality, MLLMs handle overlap and balance well but struggle with alignment, necessitates hybrid rule and visual reflection strategies. Our codes and data is available at Github.
pdf
bib
abs
CAIR: Counterfactual-based Agent Influence Ranker for Agentic AI Workflows
Amit Giloni
|
Chiara Picardi
|
Roy Betser
|
Shamik Bose
|
Aishvariya Priya Rathina Sabapathy
|
Roman Vainshtein
An Agentic AI Workflow (AAW), also known as an LLM-based multi-agent system, is an autonomous system that assembles several LLM-based agents to work collaboratively towards a shared goal. The high autonomy, widespread adoption, and growing interest in such AAWs highlight the need for a deeper understanding of their operations, from both quality and security aspects. To this day, there are no existing methods to assess the influence of each agent on the AAW’s final output. Adopting techniques from related fields is not feasible since existing methods perform only static structural analysis, which is unsuitable for inference time execution. We present Counterfactual-based Agent Influence Ranker (CAIR) - the first method for assessing the influence level of each agent on the AAW’s output and determining which agents are the most influential. By performing counterfactual analysis, CAIR provides a task-agnostic analysis that can be used both offline and at inference time. We evaluate CAIR using an AAWs dataset of our creation, containing 30 different use cases with 230 different functionalities. Our evaluation showed that CAIR produces consistent rankings, outperforms baseline methods, and can easily enhance the effectiveness and relevancy of downstream tasks.
pdf
bib
abs
ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
Yiming Du
|
Yifan Xiang
|
Bin Liang
|
Dahua Lin
|
Kam-Fai Wong
|
Fei Tan
Fine-tuning multi-turn dialogue systems requires high-quality supervision but often suffers from degraded performance when exposed to low-quality data. Supervision errors in early turns can propagate across subsequent turns, undermining coherence and response quality. Existing methods typically address data quality via static prefiltering, which decouples quality control from training and fails to mitigate turn-level error propagation. In this context, we propose **ReSURE** (REgularizing Supervision UnREliability), an adaptive learning method that dynamically down-weights unreliable supervision without explicit filtering. ReSURE estimates per-turn loss distributions using Welford’s online statistics and reweights sample losses on the fly accordingly. Experiments on both single-source and mixed-quality datasets show improved stability and response quality. Notably, ReSURE enjoys positive Spearman correlations (0.21 ~ 1.0 across multiple benchmarks) between response scores and number of samples regardless of data quality, which potentially paves the way for utilizing large-scale data effectively.
pdf
bib
abs
Precise In-Parameter Concept Erasure in Large Language Models
Yoav Gur-Arieh
|
Clara Haya Suslik
|
Yihuai Hong
|
Fazl Barez
|
Mor Geva
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES, a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 41%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
pdf
bib
abs
PhonoThink: Improving Large Language Models’ Reasoning on Chinese Phonological Ambiguities
Jianfei Ma
|
Zhaoxin Feng
|
Emmanuele Chersoni
|
Huacheng Song
|
Ziqi Zhang
Effectively resolving phonological ambiguities is crucial for robust natural language processing, as these ambiguities are pervasive in tasks ranging from speech-to-text, spelling correction, to offensive language detection. However, current Large Language Models (LLMs) frequently struggle to resolve such ambiguities.To address this challenge, we present a framework to enhances LLMs’ phonological capability through a multiple-stage training approach. Our method begins with supervised fine-tuning on well-constructed datasets, including three subtask datasets designed to enhance the model’s foundational phonological knowledge, along with a synthetic dataset of step-by-step reasoning chains. Following this, we apply reinforcement learning to incentivize and stabilize its reasoning.Results show that our framework enables the base model to achieve relatively comparable performance to a much larger model. Our ablation studies reveal that subtask datasets and the synthetic dataset can simultaneously impact as complementary modular enhancers to strengthen LLMs’ integrated application.
pdf
bib
abs
SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL
Jimin Lee
|
Ingeol Baek
|
Byeongjeong Kim
|
Hyunkyung Bae
|
Hwanhee Lee
Text-to-SQL aims to convert natural language questions into executable SQL queries. While previous approaches, such as skeleton-masked selection, have demonstrated strong performance by retrieving similar training examples to guide large language models (LLMs), they struggle in real-world scenarios where such examples are unavailable. To overcome this limitation, we propose Fine-grained Self-Augmentation in-context learning for Text-to-SQL (SAFE-SQL), a novel framework that improves SQL generation by generating and filtering self-augmented examples. SAFE-SQL first prompts an LLM to generate multiple Text-to-SQL examples relevant to the test input. Then SAFE-SQL filters these examples through three relevance assessments, constructing high-quality in-context learning examples. Using self-generated examples, SAFE-SQL surpasses the previous zero-shot, and few-shot Text-to-SQL frameworks, achieving higher execution accuracy. Notably, our approach provides additional performance gains in extra hard and unseen scenarios, where conventional methods often fail.
pdf
bib
abs
ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance
Sijia Yao
|
Pengcheng Huang
|
Zhenghao Liu
|
Yu Gu
|
Yukun Yan
|
Shi Yu
|
Ge Yu
Large language models (LLMs) have demonstrated significant potential in enhancing dense retrieval through query augmentation. However, most existing methods treat the LLM and the retriever as separate modules, overlooking the alignment between generation and ranking objectives. In this work, we propose ExpandR, a unified LLM-augmented dense retrieval framework that jointly optimizes both the LLM and the retriever. ExpandR employs the LLM to generate semantically rich query expansions, which are leveraged to enhance the retriever’s training. Simultaneously, the LLM is trained using Direct Preference Optimization (DPO), guided by a carefully designed reward function that balances retrieval effectiveness and generation consistency. This joint optimization paradigm enables mutual adaptation between the LLM and the retriever, resulting in query expansions that are both informative and well-suited for retrieval. Experimental results on multiple benchmarks show that ExpandR consistently outperforms strong baselines, achieving more than a 5% improvement in retrieval performance. All codes are available at https://github.com/NEUIR/ExpandR.
pdf
bib
abs
Anecdoctoring: Automated Red-Teaming Across Language and Place
Alejandro Cuevas
|
Saloni Dash
|
Bharat Kumar Nayak
|
Dan Vann
|
Madeleine I. G. Daepp
Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose ”anecdoctoring”, a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.
pdf
bib
abs
ACING: Actor-Critic for Instruction Learning in Black-Box LLMs
Salma Kharrat
|
Fares Fourati
|
Marco Canini
The effectiveness of Large Language Models (LLMs) in solving tasks depends significantly on the quality of their instructions, which often require substantial human effort to craft. This underscores the need for automated instruction optimization. However, optimizing instructions is particularly challenging when working with black-box LLMs, where model parameters and gradients are inaccessible. We introduce ACING, an actor-critic reinforcement learning framework that formulates instruction optimization as a stateless, continuous-action problem, enabling exploration of infinite instruction spaces using only black-box feedback. ACING automatically discovers prompts that outperform human-written prompts in 76% of instruction-induction tasks, with gains of up to 33 points and a 10-point median improvement over the best automatic baseline in 33 tasks spanning instruction-induction, summarization, and chain-of-thought reasoning. Extensive ablations highlight its robustness and efficiency. An implementation of ACING is available at
https://github.com/salmakh1/ACING.
pdf
bib
abs
Women, Infamous, and Exotic Beings: A Comparative Study of Honorific Usages in Wikipedia and LLMs for Bengali and Hindi
Sourabrata Mukherjee
|
Atharva Mehta
|
Sougata Saha
|
Akhil Arora
|
Monojit Choudhury
The obligatory use of third-person honorifics is a distinctive feature of several South Asian languages, encoding nuanced socio-pragmatic cues such as power, age, gender, fame, and social distance.In this work, (i) We present the first large-scale study of third-person honorific pronoun and verb usage across 10,000 Hindi and Bengali Wikipedia articles with annotations linked to key socio-demographic attributes of the subjects, including gender, age group, fame, and cultural origin.(ii) Our analysis uncovers systematic intra-language regularities but notable cross-linguistic differences: honorifics are more prevalent in Bengali than in Hindi, while non-honorifics dominate while referring to infamous, juvenile, and culturally “exotic” entities. Notably, in both languages, and more prominently in Hindi, men are more frequently addressed with honorifics than women.(iii) To examine whether large language models (LLMs) internalize similar socio-pragmatic norms, we probe six LLMs using controlled generation and translation tasks over 1,000 culturally balanced entities. We find that LLMs diverge from Wikipedia usage, exhibiting alternative preferences in honorific selection across tasks, languages, and socio-demographic attributes. These discrepancies highlight gaps in the socio-cultural alignment of LLMs and open new directions for studying how LLMs acquire, adapt, or distort social-linguistic norms. Our code and data are publicly available at https://github.com/souro/honorific-wiki-llm
pdf
bib
abs
Process-Supervised Reward Models for Verifying Clinical Note Generation: A Scalable Approach Guided by Domain Expertise
Hanyin Wang
|
Chufan Gao
|
Qiping Xu
|
Bolun Liu
|
Guleid Hussein
|
Hariprasad Reddy Korsapati
|
Mohamad El Labban
|
Kingsley Iheasirim
|
Mohamed Hassan
|
Gokhan Anil
|
Brian Bartlett
|
Jimeng Sun
Process-supervised reward models (PRMs) excel at providing step-by-step verification for large language model (LLM) outputs in domains like mathematics and coding. However, their application to fields lacking ground-truth answers, such as clinical note generation, poses significant challenges. We introduce a novel framework for training PRMs to deliver step-level reward signals for LLM-generated clinical notes. By precisely defining meaningful “steps,” injecting realistic “errors” informed by domain expertise, and leveraging LLMs to generate process supervision data at scale, we overcome previous limitations. Our PRM, built on LLaMA-3.1 8B, consistently outperforms proprietary reasoning and non-reasoning models, achieving state-of-the-art performance on two key evaluations: (1) distinguishing gold-standard from error-containing samples with 98.8% accuracy, and (2) selecting physician-preferred clinical notes with 56.2% accuracy. We investigate critical components for effective PRM training, including optimal loss functions and data selection strategies, and present a comprehensive physician reader study identifying predictors of downstream Best-of-N performance. Our study sheds light on unlocking the potential of PRMs for diverse generative tasks across domains.
pdf
bib
abs
GCML: Gradient Coherence Guided Meta-Learning for Cross-Domain Emerging Topic Rumor Detection
Zejiang He
|
Jingyuan Huang
|
Menglong Lu
|
Zhen Huang
|
Shanshan Liu
|
Zhiliang Tian
|
Dongsheng Li
With the emergence of new topics on social media as sources of rumor propagation, addressing the domain shift between the source and target domain and the target domain samples scarcity remains a crucial task in cross-domain rumor detection. Traditional deep learning-based methods and LLM-based methods are mostly focused on the in-domain condition, thus having poor performance in cross-domain setting. Existing domain adaptation rumor detection approaches ignore the data generalization differences and rely on a large amount of unlabeled target domain samples to achieve domain adaptation, resulting in less effective on emerging topic rumor detection. In this paper, we propose a Gradient Coherence guided Meta-Learning approach (GCML) for emerging topics rumor detection. Firstly, we calculate the task generalization score of each source task (sampled from source domain) from a gradient coherence perspective, and selectively learn more “generalizable” tasks that are more beneficial in adapting to the target domain. Secondly, we leverage meta-learning to alleviate the target domain samples scarcity, which utilizes task generalization scores to re-weight meta-test gradients and adaptively updates learning rate. Extensive experimental results on real-world datasets show that our method substantially outperforms SOTA baselines.
pdf
bib
abs
Can LLMs Generate and Solve Linguistic Olympiad Puzzles?
Neh Majmudar
|
Elena Filatova
In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI’s o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.
pdf
bib
abs
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
Zihan Liao
|
Jun Wang
|
Hang Yu
|
Lingxiao Wei
|
Jianguo Li
|
Jun Wang
|
Wei Zhang
Processing long contexts is increasingly important for Large Language Models (LLMs) in tasks like multi-turn dialogues, code generation, and document summarization. This paper addresses the challenges of achieving high long-context performance, low computational complexity, and compatibility with pretrained models – collectively termed the “impossible triangle”. We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. E2LLM divides long contexts into chunks, compresses each into soft prompts using a pretrained text encoder, and aligns these representations with a decoder-only LLM via an adapter. To enhance the LLM’s reasoning with these soft prompts, we employ two training objectives: encoder output reconstruction and long-context instruction fine-tuning. Extensive experiments reveal that E2LLM not only outperforms 8 state-of-the-art (SOTA) methods in effectiveness and efficiency for document summarization and question answering, but also achieves the best performance on LongBench v2 among models of comparable size. The source code is available at
https://github.com/codefuse-ai/E2LLM.
pdf
bib
abs
DivScore: Zero-Shot Detection of LLM-Generated Text in Specialized Domains
Zhihui Chen
|
Kai He
|
Yucheng Huang
|
Yunxiao Zhu
|
Mengling Feng
Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. Experiments on medical and legal datasets show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall at 0.1% false positive rate threshold. In adversarial settings, DivScore demonstrates superior robustness to other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall.
pdf
bib
abs
Multi-Document Event Extraction Using Large and Small Language Models
Qingkai Min
|
Zitian Qu
|
Qipeng Guo
|
Xiangkun Hu
|
Zheng Zhang
|
Yue Zhang
Multi-document event extraction aims to aggregate event information from diverse sources for a comprehensive understanding of complex events. Despite its practical significance, this task has received limited attention in existing research. The inherent challenges include handling complex reasoning over long contexts and intricate event structures. In this paper, we propose a novel collaborative framework that integrates large language models for multi-step reasoning and fine-tuned small language models to handle key subtasks, guiding the overall reasoning process. We introduce a new benchmark for multi-document event extraction and propose an evaluation metric designed for comprehensive assessment of multiple aggregated events. Experimental results demonstrate that our approach significantly outperforms existing methods, providing new insights into collaborative reasoning to tackle the complexities of multi-document event extraction.
pdf
bib
abs
MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications
Zike Yuan
|
Ming Liu
|
Hui Wang
|
Bing Qin
Graph-theoretic problems arise in real-world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models offer potential solutions but face several challenges, including limited accuracy, input length constraints, and suboptimal algorithm selection. To address these challenges, we propose MA-GTS(Multi-Agent Graph Theory Solver), a multi-agent framework that decomposes these complex problems through agent collaboration. MA-GTS maps the implicitly expressed text-based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. We validate MA-GTS using the G-REAL dataset, a real-world-inspired graph theory dataset we created. Experimental results show that MA-GTS outperforms state-of-the-art methods in cost-effectiveness, accuracy, and scalability, achieving strong results on multiple benchmarks (G-REAL 93.6%, GraCoRe 96.9% ,NLGraph 98.4%) with robust performance on both closed- and open-source models.
pdf
bib
abs
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
Weiqiao Shan
|
Yuang Li
|
Yuhao Zhang
|
Yingfeng Luo
|
Chen Xu
|
Xiaofeng Zhao
|
Long Meng
|
Yunfei Lu
|
Min Zhang
|
Hao Yang
|
Tong Xiao
|
JingBo Zhu
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, speaker number verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.
pdf
bib
abs
CIKT: A Collaborative and Iterative Knowledge Tracing Framework with Large Language Models
Runze Li
|
Siyu Wu
|
Jun Wang
|
Wei Zhang
Knowledge Tracing (KT) aims to model a student’s learning state over time and predict their future performance. However, traditional KT methods often face challenges in explainability, scalability, and effective modeling of complex knowledge dependencies. While Large Language Models (LLMs) present new avenues for KT, their direct application often struggles with generating structured, explainable student representations and lacks mechanisms for continuous, task-specific refinement. To address these gaps, we propose Collaborative Iterative Knowledge Tracing (CIKT), a framework that harnesses LLMs to enhance both prediction accuracy and explainability. CIKT employs a dual-component architecture: an Analyst generates dynamic, explainable user profiles from student historical responses, and a Predictor utilizes these profiles to forecast future performance. The core of CIKT is a synergistic optimization loop. In this loop, the Analyst is iteratively refined based on the predictive accuracy of the Predictor, which conditions on the generated profiles, and the Predictor is subsequently retrained using these enhanced profiles. Evaluated on multiple educational datasets, CIKT demonstrates significant improvements in prediction accuracy, offers enhanced explainability through its dynamically updated user profiles, and exhibits improved scalability. Our work presents a robust and explainable solution for advancing knowledge tracing systems, effectively bridging the gap between predictive performance and model transparency.
pdf
bib
abs
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
Chenlin Liu
|
Minghui Fang
|
Patrick Zhang
|
Wei Zhou
|
Jie Gao
|
Jiqing Han
Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.
pdf
bib
abs
MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Correction
Yuyang Wu
|
Jinhui Ye
|
Shuhao Zhang
|
Lu Dai
|
Yonatan Bisk
|
Olexandr Isayev
Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e., (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.
pdf
bib
abs
Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities
Xiaoyu Luo
|
Yiyi Chen
|
Johannes Bjerva
|
Qiongxiu Li
We present the first comprehensive study of Memorization in Multilingual Large Language Models (MLLMs), analyzing 95 languages using models across diverse model scales, architectures, and memorization definitions. As MLLMs are increasingly deployed, understanding their memorization behavior has become critical. Yet prior work has focused primarily on monolingual models, leaving multilingual memorization underexplored, despite the inherently long-tailed nature of training corpora. We find that the prevailing assumption, that memorization is highly correlated with training data availability, fails to fully explain memorization patterns in MLLMs. We hypothesize that treating languages in isolation — ignoring their similarities — obscures the true patterns of memorization. To address this, we propose a novel graph-based correlation metric that incorporates language similarity to analyze cross-lingual memorization. Our analysis reveals that among similar languages, those with fewer training tokens tend to exhibit higher memorization, a trend that only emerges when cross-lingual relationships are explicitly modeled. These findings underscore the importance of a language-aware perspective in evaluating and mitigating memorization vulnerabilities in MLLMs. This also constitutes empirical evidence that language similarity both explains Memorization in MLLMs and underpins Cross-lingual Transferability, with broad implications for multilingual NLP.
pdf
bib
abs
Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Chaojun Nie
|
Jun Zhou
|
Guanxiang Wang
|
Shisong Wu
|
Zichen Wang
Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.
pdf
bib
abs
LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
Jian Zhang
|
Junyi Guo
|
Junyi Yuan
|
Huanda Lu
|
Yanlin Zhou
|
Fangyu Wu
|
Qiufeng Wang
|
Dongming Lu
Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose C^3, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. C^3 introduces a bidirectional validation mechanism to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency verification through adaptive query control. Experiments on the cultural heritage dataset CulTi and general benchmarks MSCOCO and Flickr30K demonstrate that C^3 achieves state-of-the-art performance in both fine-tuned and zero-shot settings.
pdf
bib
abs
Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions
Nicholas Deas
|
Kathleen McKeown
We introduce and study artificial impressions–patterns in LLMs’ internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.
pdf
bib
abs
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Jiulong Wu
|
Zhengliang Shi
|
Shuaiqiang Wang
|
Jizhou Huang
|
Dawei Yin
|
Lingyong Yan
|
Min Cao
|
Min Zhang
Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 80.4% on Object HalBench and 52.6% on MM HalBench, thereby enhancing the trustworthiness of LVLMs. The code and dataset will be made publicly available.
pdf
bib
abs
3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection
Hongxin Ding
|
Yue Fang
|
Runchuan Zhu
|
Xinke Jiang
|
Jinyang Zhang
|
Yongxin Xu
|
Weibin Liao
|
Xu Chu
|
Junfeng Zhao
|
Yasha Wang
Large Language Models (LLMs) excel in general language tasks, motivating their adaptation to specialized domains such as healthcare. Effective domain adaptation typically involves supervised fine-tuning (SFT) on carefully selected instruction-tuning data. Current data selection methods adopt a data-centric approach, relying on external annotations and heuristics to identify externally defined high-quality or challenging data. Our exploratory experiments highlight this approach fails to improve the model’s domain performance, due to misalignment between selected data and the model’s knowledge distribution. To tackle this, we propose Decomposed Difficulty-based Data Selection (3DS), a two-stage model-centric data selection framework that aligns data selection with the model’s distribution. 3DS employs Prompt-Driven Data Selection to filter out noise based on the model’s knowledge via explicit alignment in Stage#1, then adopts Decomposed Difficulty-based Data Selection to guide selection via three novel data difficulty metrics, including Instruction Understanding, Response Confidence, and Response Correctness in Stage#2, enhanced by an attention-based importance weighting mechanism for accurate calibration.Extensive experiments in the healthcare domain show 3DS outperforms existing methods by up to 2.97% accuracy, with additional validation in law and general domains, confirming its generalization ability. Our dataset and code are open-sourced at https://github.com/PuppyKnightUniversity/3DS.
pdf
bib
abs
InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
Kirolos Ataallah
|
Eslam Mohamed Bakr
|
Mahmoud Ahmed
|
Chenhui Gou
|
Khushbu Pahwa
|
Jian Ding
|
Mohamed Elhoseiny
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously.InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes.(2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K.(3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking).(4) Rich annotation formats, including both multiple-choice and open-ended questions.We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models, such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1% on grounding-based skills, with most models performing near or just above random chance.(2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding.(3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding.Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.
pdf
bib
abs
Intrinsic Test of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
|
Lei Yu
|
Haiqin Yang
|
Shauli Ravfogel
|
Mor Geva
The task of “unlearning” certain concepts in large language models (LLMs) has gained attention for its role in mitigating harmful, private, or incorrect outputs. Current evaluations mostly rely on behavioral tests, without monitoring residual knowledge in model parameters, which can be adversarially exploited to recover erased information. We argue that unlearning should also be assessed internally by tracking changes in the parametric traces of unlearned concepts. To this end, we propose a general evaluation methodology that uses vocabulary projections to inspect concepts encoded in model parameters. We apply this approach to localize “concept vectors” — parameter vectors encoding concrete concepts — and construct ConceptVectors, a benchmark of hundreds of such concepts and their parametric traces in two open-source LLMs. Evaluation on ConceptVectors shows that existing methods minimally alter concept vectors, mostly suppressing them at inference time, while direct ablation of these vectors removes the associated knowledge and reduces adversarial susceptibility. Our findings reveal limitations of behavior-only evaluations and advocate for parameter-based assessments. We release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
pdf
bib
abs
Speculative Streaming: Efficient and Scalable Speculative Decoding with Multi-Stream Attention
Nikhil Bhendawade
|
Irina Belousova
|
Qichen Fu
|
Henry Mason
|
Antonie Lin
|
Mohammad Rastegari
|
Mahyar Najibi
Speculative decoding is a prominent technique for accelerating LLM inference by leveraging an auxiliary draft model, but its effectiveness is limited by the autoregressive nature of draft generation, where acceptance rates depend on the draft model’s size. Scaling the draft model improves acceptance but also increases speculation latency, limiting overall speedup. Furthermore, fine-tuning both the draft and target models is often necessary to achieve high acceptance rates, adding complexity to inference systems as the number of downstream tasks grows. Single-model approaches like Medusa generate speculative tokens non-autoregressively but lack token dependencies, limiting effectiveness. Alternatives like Hydra and Eagle incorporate token dependencies but rely on dedicated heads, making speculation independent of the base model and limiting the extent to which stronger base models can improve speculation.We introduce a novel speculative decoding method that integrates speculative draft generation directly within the target model using multi-stream attention. This improves acceptance rates by introducing interdependencies between speculative tokens while ensuring non-autoregressive draft generation with minimal overhead. As target models scale in size and quality, speculative generation improves naturally with our method, unlike prior approaches. Furthermore, our approach is both parameter- and FLOP-efficient, requiring over 1000X fewer additional parameters than Medusa, making it highly suitable for resource-constrained devices. We design our method to operate in two modes: (1) Lossless mode, a plug-and-play method that preserves the output of any pre-trained model; and (2) Shared mode, optimizing both speedup and downstream performance. We demonstrate a 2–3.5X speedup across diverse tasks, including summarization, translation, question answering, mathematical reasoning, SQL generation, and retrieval-augmented generation (RAG).
pdf
bib
abs
Evaluating Cognitive-Behavioral Fixation via Multimodal User Viewing Patterns on Social Media
Yujie Wang
|
Yunwei Zhao
|
Jing Yang
|
Han Han
|
Shiguang Shan
|
Jie Zhang
Digital social media platforms frequently contribute to cognitive-behavioral fixation, a phenomenon in which users exhibit sustained and repetitive engagement with narrow content domains. While cognitive-behavioral fixation has been extensively studied in psychology, methods for computationally detecting and evaluating such fixation remain underexplored. To address this gap, we propose a novel framework for assessing cognitive-behavioral fixation by analyzing users’ multimodal social media engagement patterns. Specifically, we introduce a multimodal topic extraction module and a cognitive-behavioral fixation quantification module that collaboratively enable adaptive, hierarchical, and interpretable assessment of user behavior. Experiments on existing benchmarks and a newly curated multimodal dataset demonstrate the effectiveness of our approach, laying the groundwork for scalable computational analysis of cognitive fixation. All code in this project is publicly available for research purposes at https://github.com/Liskie/cognitive-fixation-evaluation.
pdf
bib
abs
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
Mario Sanz-Guerrero
|
Minh Duc Bui
|
Katharina von der Wense
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string “*Answer:*” to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy – tokenizing the space *together* with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model’s confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.
pdf
bib
abs
VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang
|
Heyang Liu
|
Ziyang Cheng
|
Ronghua Wu
|
Qunshan Gu
|
Yanfeng Wang
|
Yu Wang
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. In this work, we introduce VocalNet, a series of high-performance speech LLMs featuring a scalable and model-agnostic training framework as well as a novel multi-token prediction (MTP) paradigm for speech generation. We first propose an efficient two-stage training framework that enables LLMs to acquire real-time speech interaction capabilities. Through extensive experiments on various training configurations, we ensure both simplicity and effectiveness in the training strategy. Furthermore, inspired by advances in language modeling, we introduce MTP into the domain of speech LLMs—an alternative to traditional next-token prediction (NTP)—which enables the model to predict multiple future tokens at each step. Through systematic analysis and improved implementation, we show that MTP not only accelerates inference speed but also significantly enhances speech quality. Experimental results demonstrate that VocalNet achieves performance comparable to state-of-the-art Omni LLMs while outperforming existing open-source speech LLMs, despite using limited training data.
pdf
bib
abs
Path Drift in Large Reasoning Models: How First-Person Commitments Override Safety
Yuyi Huang
|
Runzhe Zhan
|
Lidia S. Chao
|
Ailin Tao
|
Derek F. Wong
As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.
pdf
bib
abs
CBP-Tuning: Efficient Local Customization for Black-box Large Language Models
Jiaxuan Zhao
|
Naibin Gu
|
Yuchen Feng
|
Xiyu Liu
|
Peng Fu
|
Zheng Lin
|
Weiping Wang
The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.
pdf
bib
abs
Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment
Ahmed Karim
|
Qiao Wang
|
Zheng Yuan
Automated Essay Scoring (AES) systems now attain near–human agreement on some public benchmarks, yet real-world adoption—especially in high-stakes examinations—remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs enjoying formal coverage guarantees. Two open-source Large Language Models—Llama-3 8B and Qwen-2.5 3B—are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90% risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.
pdf
bib
abs
Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts
Georgios Chochlakis
|
Peter Wu
|
Tikka Arjun Singh Bedi
|
Marcus Ma
|
Kristina Lerman
|
Shrikanth Narayanan
Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error.We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying.We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at https://github.com/gchochla/liahr.
pdf
bib
abs
Do It Yourself (DIY): Modifying Images for Poems in a Zero-Shot Setting Using Weighted Prompt Manipulation
Sofia Jamil
|
Kotla Sai Charan
|
Sriparna Saha
|
Koustava Goswami
|
Joseph K J
Poetry is an expressive form of art that invites multiple interpretations, as readers often bring their own emotions, experiences, and cultural backgrounds into their understanding of a poem. Recognizing this, we aim to generate images for poems and improve these images in a zero-shot setting, enabling audiences to modify images as per their requirements. To achieve this, we introduce a novel Weighted Prompt Manipulation (WPM) technique, which systematically modifies attention weights and text embeddings within diffusion models. By dynamically adjusting the importance of specific words, WPM enhances or suppresses their influence in the final generated image, leading to semantically richer and more contextually accurate visualizations. Our approach exploits diffusion models and large language models (LLMs) such as GPT in conjunction with existing poetry datasets, ensuring a comprehensive and structured methodology for improved image generation in the literary domain. To the best of our knowledge, this is the first attempt at integrating weighted prompt manipulation for enhancing imagery in poetic language.
pdf
bib
abs
Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
Haozhe Zhao
|
Shuzheng Si
|
Liang Chen
|
Yichi Zhang
|
Maosong Sun
|
Baobao Chang
|
Minjia Zhang
Large vision-language models (LVLMs) have achieved impressive results in vision-language tasks. However, Therefore, we propose LACING, designed to address such bias with Mu ̲Ltimodal Du ̲Al-attention Me ̲Chan ̲Ism (MDA) a ̲Nd Soft-Image ̲Guidance (SIG). Specifically, MDA adopts a parallel dual-attention mechanism that constructs separate attention for visual and text inputs to enhance integration of visual inputs across model. SIG uses a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs during inference. Experiments across different model architectures and scales demonstrate that LACING effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without additional resources.
pdf
bib
abs
Who Holds the Pen? Caricature and Perspective in LLM Retellings of History
Lubna Zahan Lamia
|
Mabsur Fatin Bin Hossain
|
Md Mosaddek Khan
Large language models (LLMs) are no longer just language generators—they are increasingly used to simulate human behavior, perspectives, and demographic variation across social domains, from public opinion surveys to experimental research. Amid this shift, the use of LLMs to simulate historical narratives has emerged as a timely frontier. It is crucial to scrutinize the asymmetries these models embed when framing, interpreting, and retelling the past. Building on prior work that defines caricature as the combination of individuation and exaggeration, we analyze LLM-generated responses across 197 historically significant events—each featuring a directly and an indirectly affected persona. We find that LLMs reliably distinguish persona-based responses from neutral baselines, and that directly affected personas consistently exhibit higher exaggeration—amplifying identity-specific portrayals. Beyond lexical patterns, personas often frame the same event in conflicting ways—especially in military, political, and morally charged contexts. Grammatical analysis further reveals that direct personas adopt more passive constructions in institutional contexts, but shift to active framing when emotional immediacy is foregrounded. Our findings show how subtle asymmetries in tone, stance, and emphasis—not overt toxicity—can quietly, yet systematically, distort how history is told and remembered.
pdf
bib
abs
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
Minxuan Lv
|
Zhenpeng Su
|
Leiyu Pan
|
Yizhe Xiong
|
Zijia Lin
|
Hui Chen
|
Wei Zhou
|
Jungong Han
|
Guiguang Ding
|
Wenwu Ou
|
Di Zhang
|
Kun Gai
|
Songlin Hu
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
pdf
bib
abs
Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty
Peilin Wu
|
Mian Zhang
|
Xinlu Zhang
|
Xinya Du
|
Zhiyu Chen
Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models’ uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model’s uncertainty in its search decisions. To address this, we propose β-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that β-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.
pdf
bib
abs
Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Francesca Padovani
|
Jaap Jumelet
|
Yevgen Matusevych
|
Arianna Bisazza
Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can outperform LMs trained on an equal amount of adult-directed text like Wikipedia. However, it remains unclear whether these results generalize across languages, architectures, and evaluation settings. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in these benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
pdf
bib
abs
Benchmarking Debiasing Methods for LLM-based Parameter Estimates
Nicolas Audinet de Pieuchon
|
Adel Daoud
|
Connor Thomas Jerzak
|
Moa Johansson
|
Richard Johansson
Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations.Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method’s performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
pdf
bib
abs
(Almost) Free Modality Stitching of Foundation Models
Jaisidh Singh
|
Diganta Misra
|
Boris Knyazev
|
Antonio Orvieto
Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with a text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N × M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10×, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
pdf
bib
abs
VERITAS: Leveraging Vision Priors and Expert Fusion to Improve Multimodal Data
Tingqiao Xu
|
Ziru Zeng
|
Jiayu Chen
The quality of supervised fine-tuning (SFT) data is crucial for the performance of large multimodal models (LMMs), yet current data enhancement methods often suffer from factual errors and hallucinations due to inadequate visual perception. To address this challenge, we propose VERITAS, a pipeline that systematically integrates vision priors and multiple state-of-the-art LMMs with statistical methods to enhance SFT data quality. VERITAS leverages visual recognition models (RAM++) and OCR systems (PP-OCRv4) to extract structured vision priors, which are combined with images, questions, and answers. Three LMMs (GPT-4o, Gemini-2.5-Pro, Doubao-1.5-pro) evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score serving as ground truth. Using this consensus, we train a lightweight critic model via Group Relative Policy Optimization (GRPO), enhancing reasoning capabilities efficiently. Each LMM then refines the original answers based on the critiques, generating new candidate answers; we select the highest-scoring one as the final refined answer. Experiments across six multimodal benchmarks demonstrate that models fine-tuned with data processed by VERITAS consistently outperform those using raw data, particularly in text-rich and fine-grained reasoning tasks. Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient. We release our pipeline, datasets, and model checkpoints to advance research in multimodal data optimization.
pdf
bib
abs
Rescorla-Wagner Steering of LLMs for Undesired Behaviors over Disproportionate Inappropriate Context
Rushi Wang
|
Jiateng Liu
|
Cheng Qian
|
Yifan Shen
|
Yanzhou Pan
|
Zhaozhuo Xu
|
Ahmed Abbasi
|
Heng Ji
|
Denghui Zhang
Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable solution for improving LLM safety in real-world use.
pdf
bib
abs
Exploring Artificial Image Generation for Stance Detection
Zhengkang Zhang
|
Zhongqing Wang
|
Guodong Zhou
Stance detection is a task aimed at identifying and analyzing the author’s stance from text. Previous studies have primarily focused on the text, which may not fully capture the implicit stance conveyed by the author. To address this limitation, we propose a novel approach that transforms original texts into artificially generated images and uses the visual representation to enhance stance detection. Our approach first employs a text-to-image model to generate candidate images for each text. These images are carefully crafted to adhere to three specific criteria: textual relevance, target consistency, and stance consistency. Next, we introduce a comprehensive evaluation framework to select the optimal image for each text from its generated candidates. Subsequently, we introduce a multimodal stance detection model that leverages both the original textual content and the generated image to identify the author’s stance. Experiments demonstrate the effectiveness of our approach and highlight the importance of artificially generated images for stance detection.
pdf
bib
abs
Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech
Jonathan Pofcher
|
Christopher M Homan
|
Randall Sell
|
Ashiqur R. KhudaBukhsh
This paper makes three contributions. First, via a substantial corpus of 1,419,047 comments posted on 3,161 YouTube news videos of major US cable news outlets, we analyze how users engage with LGBTQ+ news content. Our analyses focus both on positive and negative content. In particular, we construct a hope speech classifier that detects positive (hope speech), negative, neutral, and irrelevant content. Second, in consultation with a public health expert specializing on LGBTQ+ health, we conduct an annotation study with a balanced and diverse political representation and release a dataset of 3,750 instances with crowd-sourced labels and detailed annotator demographic information. Finally, beyond providing a vital resource for the LGBTQ+ community, our annotation study and subsequent in-the-wild assessments reveal (1) strong association between rater political beliefs and how they rate content relevant to a marginalized community, (2) models trained on individual political beliefs exhibit considerable in-the-wild disagreement, and (3) zero-shot large language models (LLMs) align more with liberal raters.
pdf
bib
abs
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
Andong Hua
|
Kenan Tang
|
Chenhe Gu
|
Jindong Gu
|
Eric Wong
|
Yao Qin
Prompt sensitivity, referring to the phenomenon where paraphrasing (that is, repeating something written or spoken using different words) leads to significant changes in large language model performance, has been widely accepted as a core limitation of large language models. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of large language models, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate seven large language models (for example, the GPT and Gemini families) across six benchmarks, including both multiple-choice and open-ended tasks on twelve diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt large language model as a judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern large language models are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
pdf
bib
abs
Topic Coverage-based Demonstration Retrieval for In-Context Learning
Wonbin Kweon
|
SeongKu Kang
|
Runchu Tian
|
Pengcheng Jiang
|
Jiawei Han
|
Hwanjo Yu
The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input.To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples.In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model.Specifically, TopicK estimates the topics required by the input and assesses the model’s knowledge on those topics.TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge.We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs.Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.
pdf
bib
abs
On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts
Linlu Qiu
|
Cedegao E. Zhang
|
Joshua B. Tenenbaum
|
Yoon Kim
|
Roger P. Levy
Language use is shaped by pragmatics—i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from *Wavelength*, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs’ pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.
pdf
bib
abs
MuseScorer: Idea Originality Scoring At Scale
Ali Sarosh Bangash
|
Krish Veera
|
Ishfat Abrar Islam
|
Raiyan Abdul Baten
An objective, face-valid method for scoring idea originality is to measure each idea’s statistical infrequency within a population—an approach long used in creativity research. Yet, computing these frequencies requires manually bucketing idea rephrasings, a process that is subjective, labor-intensive, error-prone, and brittle at scale. We introduce MuseScorer, a fully automated, psychometrically validated system for frequency-based originality scoring. MuseScorer integrates a Large Language Model (LLM) with externally orchestrated retrieval: given a new idea, it retrieves semantically similar prior idea-buckets and zero-shot prompts the LLM to judge whether the idea fits an existing bucket or forms a new one. These buckets enable frequency-based originality scoring without human annotation. Across five datasets (Nparticipants=1143, nideas=16,294), MuseScorer matches human annotators in idea clustering structure (AMI =0.59) and participant-level scoring (r = 0.89), while demonstrating strong convergent and external validity. The system enables scalable, intent-sensitive, and human-aligned originality assessment for creativity research.
pdf
bib
abs
SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
Joao Fonseca
|
Andrew Bell
|
Julia Stoyanovich
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs “self-reflect,” may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict “normal” model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we make three contributions: (1) We introduce SAFENUDGE, a novel safeguard that combines Controlled Text Generation and “nudging.” SAFENUDGE triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by between 28.1% and 37.3% by guiding the LLM towards a safe response. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Second, it supports tunable SPTs, meaning practitioners can set their own tolerance for trade-offs balancing safety and restrictions to normal model behavior. Third, we release the source code for SAFENUDGE at https://github.com/joaopfonseca/SafeNudge. It is open source and compatible with the HuggingFace transformers library.
pdf
bib
abs
RaDeR: Reasoning-aware Dense Retrieval Models
Debrup Das
|
Sam O’Nuallain
|
Razieh Rahimi
We propose RaDeR, a set of reasoning-based dense retrieval models trained with data derived from mathematical problem solving using large language models (LLMs). Our method leverages retrieval-augmented reasoning trajectories of an LLM and self-reflective relevance evaluation, enabling the creation of both diverse and hard-negative samples for reasoning-intensive relevance. RaDeR retrievers, trained for mathematical reasoning, effectively generalize to diverse reasoning tasks in the BRIGHT and RAR-b benchmarks, consistently outperforming strong baselines in overall performance. Notably, RaDeR achieves significantly higher performance than baselines on the Math and Coding splits. In addition, RaDeR presents the first dense retriever that outperforms BM25 when queries are Chain-of-Thought reasoning steps, underscoring the critical role of reasoning-based retrieval to augment reasoning language models. Furthermore RaDeR achieves comparable or superior performance while using only 2.5% of the training data used by the concurrent work ReasonIR, highlighting the quality of our synthesized training data. Our code, data, and retrieval models are publicly available.
pdf
bib
abs
A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Bhuiyan Sanjid Shafique
|
Ashmal Vayani
|
Muhammad Maaz
|
Hanoona Abdul Rasheed
|
Dinura Dissanayake
|
Mohammed Irfan Kurpath
|
Yahya Hmaiti
|
Go Inoue
|
Jean Lahoud
|
Md. Safirur Rashid
|
Shadid Intisar Quasem
|
Maheen Fatima
|
Franco Vidal
|
Mykola Maslych
|
Ketan Pravin More
|
Sanoojan Baliah
|
Hasindri Watawana
|
Yuhao Li
|
Fabian Farestam
|
Leon Schaller
|
Roman Tymtsiv
|
Simon Weber
|
Hisham Cholakkal
|
Ivan Laptev
|
Shin’ichi Satoh
|
Michael Felsberg
|
Mubarak Shah
|
Salman Khan
|
Fahad Shahbaz Khan
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: Arabic, Bengali, Chinese, English, French, German, Hindi, Japanese, Russian, Sinhala, Spanish, Swedish, Tamil, and Urdu. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released.
pdf
bib
abs
DRES: Fake news detection by dynamic representation and ensemble selection
Faramarz Farhangian
|
Leandro Augusto Ensina
|
George D C Cavalcanti
|
Rafael M. O. Cruz
The rapid spread of information via social media has made text-based fake news detection critically important due to its societal impact. This paper presents a novel detection method called Dynamic Representation and Ensemble Selection (DRES) for identifying fake news based solely on text. DRES leverages instance hardness measures to estimate the classification difficulty for each news article across multiple textual feature representations. By dynamically selecting the textual representation and the most competent ensemble of classifiers for each instance, DRES significantly enhances prediction accuracy. Extensive experiments show that DRES achieves notable improvements over state-of-the-art methods, confirming the effectiveness of representation selection based on instance hardness and dynamic ensemble selection in boosting performance. Codes and data are available at: at:https://github.com/FFarhangian/FakeNewsDetection_DRES
pdf
bib
abs
A Graph-Theoretical Framework for Analyzing the Behavior of Causal Language Models
Rashin Rahnamoun
|
Mehrnoush Shamsfard
Recent progress in natural language processing has popularized causal language models, but their internal behavior remains poorly understood due to the high cost and reliance on large-scale benchmarks in existing analysis methods. To address these challenges, we introduce a graph-theoretical framework for analyzing causal language models. Our method constructs graphs from model outputs by linking high-probability token transitions and applies classical metrics to capture linguistic features of model behavior. Based on previous works, none have examined or applied graph analysis from this perspective. For the first time, a macroscopic view of the overall behavior of a language model is provided by analyzing the mathematical characteristics of small sample graphs derived from the generated outputs. We first discuss the metrics theoretically, then demonstrate how they work through experiments, followed by some applications of this graph-theoretical framework in natural language processing tasks. Through experiments across training steps and model sizes, we demonstrate that these metrics can reflect model evolution and predict performance with minimal data. We further validate our findings by comparing them with benchmark accuracy scores, highlighting the reliability of our metrics. In contrast to existing evaluation methods, our approach is lightweight, efficient, and especially well-suited for low-resource settings. Our implementation codes are available at this GitHub repository.
pdf
bib
abs
Membership and Memorization in LLM Knowledge Distillation
Ziqi Zhang
|
Ali Shahin Shamsabadi
|
Hanxiao Lu
|
Yifeng Cai
|
Hamed Haddadi
Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ”teacher” to a smaller ”student” model. However, students may inherit the teacher’s privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership privacy risks inherent in six LLM KD techniques.Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin.
pdf
bib
abs
Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models
Masahiro Kaneko
|
Alham Fikri Aji
|
Timothy Baldwin
Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates. However, their effectiveness is highly sensitive to example selection, particularly in multilingual settings. Based on the findings of existing work, three key factors influence multilingual ICL: (1) semantic similarity, (2) linguistic alignment, and (3) language-specific performance. However, existing approaches address these factors independently, without explicitly disentangling their combined impact, leaving optimal example selection underexplored. To address this gap, we propose balanced multi-factor ICL (BMF-ICL), a method that quantifies and optimally balances these factors for improved example selection. Experiments on mCSQA and TYDI across four MLLMs demonstrate that BMF-ICL outperforms existing methods. Further analysis highlights the importance of incorporating all three factors and the importance of selecting examples from multiple languages.
pdf
bib
abs
Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive‐k
Chihiro Taguchi
|
Seiji Maekawa
|
Nikita Bhutani
Retrieval-augmented generation (RAG) and long-context language models (LCLMs) both address context limitations of LLMs in open-domain QA. However, optimal external context to retrieve remains an open problem: fixed retrieval budgets risk wasting tokens or omitting key evidence. Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM prompting and perform well on factoid QA, but struggle with aggregation QA where optimal context size is unknown and variable. We present Adaptive‐k retrieval, a simple and effective single-pass method that selects a query-specific number of passages by applying a threshold to the similarity scores between the query and candidate passages. It does not require model fine-tuning, extra LLM calls or changes to existing retriever–reader pipelines. On both factoid and aggregation QA benchmarks, Adaptive‐k matches or outperforms fixed‐k baselines while using up to 10x fewer tokens than full-context input, and still retrieves 70% of relevant passages. It improves accuracy across five LCLMs and two embedding models, highlighting that dynamically adjusting context size leads to more efficient and accurate QA.
pdf
bib
abs
Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark
Chihiro Taguchi
|
Seng Mai
|
Keita Kurabe
|
Yusuke Sakai
|
Georgina Agyei
|
Soudabeh Eslami
|
David Chiang
Multilingual machine translation (MT) benchmarks play a central role in evaluating the capabilities of modern MT systems. Among them, the FLORES+ benchmark is widely used, offering English-to-many translation data for over 200 languages, curated with strict quality control protocols. However, we study data in four languages (Asante Twi, Japanese, Jinghpaw, and South Azerbaijani) and uncover critical shortcomings in the benchmark’s suitability for truly multilingual evaluation. Human assessments reveal that many translations fall below the claimed 90% quality standard, and the annotators report that source sentences are often too domain-specific and culturally biased toward the English-speaking world. We further demonstrate that simple heuristics, such as copying named entities, can yield non-trivial BLEU scores, suggesting vulnerabilities in the evaluation protocol. Notably, we show that MT models trained on naturalistic data perform poorly on FLORES+ while achieving significant gains on our domain-relevant evaluation set. Based on these findings, we advocate for multilingual MT benchmarks that use domain-general, named-entity-agnostic, and culturally neutral source texts to better reflect real-world translation challenges.
pdf
bib
abs
Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games
César Guerra-Solano
|
Zhuochun Li
|
Xiang Lorraine Li
Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
pdf
bib
abs
Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models
Renjie Pi
|
Kehao Miao
|
Li Peihang
|
Runtao Liu
|
Jiahui Gao
|
Jipeng Zhang
|
Xiaofang Zhou
Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the “sycophantic modality gap.” To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user’s instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.
pdf
bib
abs
MR. Judge: Multimodal Reasoner as a Judge
Renjie Pi
|
Haoping Bai
|
Qibin Chen
|
Xiaoming Simon Wang
|
Jiulong Shan
|
Xiaojiang Liu
|
Meng Cao
The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each response, we formulate the judgement process as a reasoning-inspired multiple-choice problem. Specifically, the judge model first conducts deliberate reasoning covering different aspects of the responses and eventually selects the best response from them. This reasoning process not only improves the interpretibility of the judgement, but also greatly enhances the performance of MLLM judges. To cope with the lack of questions with scored responses, we propose the following strategy to achieve automatic annotation: 1) Reverse Response Candidates Synthesis: starting from a supervised fine-tuning (SFT) dataset, we treat the original response as the best candidate and prompt the MLLM to generate plausible but flawed negative candidates. 2) Text-based reasoning distillation: we carefully design a data synthesis pipeline for distilling the reasoning capability from a text-based reasoning model, which is adopted to enable the MLLM judges to regain complex reasoning ability via warm up supervised fine-tuning. Experiments demonstrate that our MR. Judge is effective across a wide range of tasks. Specifically, our MR. Judge-7B surpasses GPT-4o by 9.9% on VL-RewardBench, and improves performance on MM-Vet during inference-time scaling by up to 7.7%.
pdf
bib
abs
MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines
Lei Gao
|
Amir Ziashahabi
|
Yue Niu
|
Salman Avestimehr
|
Murali Annavaram
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. The next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Given the sensitive nature of such private data, it is desirable to fine-tune these models on edge devices to improve user trust. However, fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands, as well as limited infrastructure support. We observe that inference engines (e.g., ExecuTorch) can be repurposed for fine-tuning by leveraging zeroth-order (ZO) optimization, which uses multiple forward passes to approximate gradients. While promising, direct application of ZO methods on edge devices is inefficient due to the high computational cost of multiple forward passes required for accurate gradient estimation, and their deployment has been largely unexplored in practice. We introduce MobiZO, a resource-efficient fine-tuning framework for LLMs specifically designed for edge devices. MobiZO combines three key innovations: (1) a parallelized randomized gradient estimator that employs both outer-loop and inner-loop parallelism to eliminate sequential forward passes, (2) a specialized Multi-Perturbed LoRA (MP-LoRA) module that enables efficient realization of both inner and outer loop parallelism, and (3) a seamless integration with ExecuTorch for on-device training, requiring no modifications to the runtime. Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy, paving the way for practical deployment of LLMs in real-time, on-device applications. Code available at:
https://github.com/leigao97/MobiZO.
pdf
bib
abs
Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Wafa Al Ghallabi
|
Ritesh Thawkar
|
Sara Ghaboura
|
Ketan Pravin More
|
Omkar Thawakar
|
Hisham Cholakkal
|
Salman Khan
|
Rao Muhammad Anwer
Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce “Fann or Flop”, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release “Fann or Flop” along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic-capable language models.
pdf
bib
abs
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning
Joshua Ong Jun Leang
|
Aryo Pradipta Gema
|
Shay B Cohen
Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present **Chain of Mathematically Annotated Thought (CoMAT)**, which enhances reasoning through two stages: *Symbolic Conversion* (converting natural language queries into symbolic form) and *Reasoning Execution* (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks.
pdf
bib
abs
s1: Simple test-time scaling
Niklas Muennighoff
|
Zitong Yang
|
Weijia Shi
|
Xiang Lisa Li
|
Li Fei-Fei
|
Hannaneh Hajishirzi
|
Luke Zettlemoyer
|
Percy Liang
|
Emmanuel Candes
|
Tatsunori Hashimoto
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
pdf
bib
abs
Learning Subjective Label Distributions via Sociocultural Descriptors
Mohammed Fayiz Parappan
|
Ricardo Henao
Subjectivity in NLP tasks, _e.g._, toxicity classification, has emerged as a critical challenge precipitated by the increased deployment of NLP systems in content-sensitive domains. Conventional approaches aggregate annotator judgements (labels), ignoring minority perspectives, and overlooking the influence of the sociocultural context behind such annotations. We propose a framework where subjectivity in binary labels is modeled as an empirical distribution accounting for the variation in annotators through human values extracted from sociocultural descriptors using a language model. The framework also allows for downstream tasks such as population and sociocultural group-level majority label prediction. Experiments on three toxicity datasets covering human-chatbot conversations and social media posts annotated with diverse annotator pools demonstrate that our approach yields well-calibrated toxicity distribution predictions across binary toxicity labels, which are further used for majority label prediction across cultural subgroups, improving over existing methods.
pdf
bib
abs
COM-BOM: Bayesian Exemplar Search for Efficiently Exploring the Accuracy-Calibration Pareto Frontier
Gaoxiang Luo
|
Aryan Deshwal
Selecting an optimal set of exemplars is critical for good performance of in-context learning. However, prior exemplar search methods narrowly optimize for predictive accuracy, critically neglecting model calibration—a key determinant of trustworthiness and safe deployment. In this paper, we formulate exemplar selection as a multi-objective optimization problem, explicitly targeting both the maximization of predictive accuracy and the minimization of expected calibration error. We solve this problem with a sample-efficient Combinatorial Bayesian Optimization algorithm (COM-BOM) to find the Pareto-front that optimally trade-offs the two objectives of accuracy and calibration. We evaluate COM-BOM on multiple tasks from un-saturated MMLU-pro benchmark and find that COM-BOM beats or matches the baselines in jointly optimizing the two objectives, while requiring a minimal number of LLM API calls.
pdf
bib
abs
ML-Promise: A Multilingual Dataset for Corporate Promise Verification
Yohei Seki
|
Hakusen Shu
|
Anaïs Lhuissier
|
Hanwool Lee
|
Juyeon Kang
|
Min-Yuh Day
|
Chung-Chi Chen
Promises made by politicians, corporate leaders, and public figures have a significant impact on public perception, trust, and institutional reputation. However, the complexity and volume of such commitments, coupled with difficulties in verifying their fulfillment, necessitate innovative methods for assessing their credibility. This paper introduces the concept of Promise Verification, a systematic approach involving steps such as promise identification, evidence assessment, and the evaluation of timing for verification. We propose the first multilingual dataset, ML-Promise, which includes English, French, Chinese, Japanese, and Korean, aimed at facilitating in-depth verification of promises, particularly in the context of Environmental, Social, and Governance (ESG) reports. Given the growing emphasis on corporate environmental contributions, this dataset addresses the challenge of evaluating corporate promises, especially in light of practices like greenwashing. Our findings also explore textual and image-based baselines, with promising results from retrieval-augmented generation (RAG) approaches. This work aims to foster further discourse on the accountability of public commitments across multiple languages and domains.
pdf
bib
abs
Reading Between the Prompts: How Stereotypes Shape LLM’s Implicit Personalization
Vera Neplenbroek
|
Arianna Bisazza
|
Raquel Fernández
Generative Large Language Models (LLMs) infer user’s demographic information from subtle cues in the conversation — a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models’ latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model’s internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.
pdf
bib
abs
Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
Yen-Ju Lu
|
Thomas Thebaud
|
Laureano Moro-Velazquez
|
Najim Dehak
|
Jesus Villalba
We present Paired by the Teacher (PbT), a two-stage teacher–student pipeline that synthesizes accurate input–output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks—document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)—as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
pdf
bib
abs
Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation
Di Wu
|
Seth Aycock
|
Christof Monz
Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps. _Translating Step-by-step_ (Briakou et al., 2024), for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24 test data. In this work, we scrutinise this strategy’s effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process via CoT, at least for the models on test; and we show prompting LLMs to “translate again” and self-refine yields even better results than human-like step-by-step prompting. While the decomposition influences translation behaviour, faithfulness to the decomposition has both positive and negative effects on translation. Our analysis therefore suggests a divergence between the optimal translation strategies for humans and LLMs.
pdf
bib
abs
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
Ingeol Baek
|
Hwan Chang
|
Sunghyun Ryu
|
Hwanhee Lee
Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.
pdf
bib
abs
Explainability and Interpretability of Multilingual Large Language Models: A Survey
Lucas Resck
|
Isabelle Augenstein
|
Anna Korhonen
Multilingual large language models (MLLMs) demonstrate state-of-the-art capabilities across diverse cross-lingual and multilingual tasks. Their complex internal mechanisms, however, often lack transparency, posing significant challenges in elucidating their internal processing of multilingualism, cross-lingual transfer dynamics and handling of language-specific features. This paper addresses this critical gap by presenting a survey of current explainability and interpretability methods specifically for MLLMs. To our knowledge, it is the first comprehensive review of its kind. Existing literature is categorised according to the explainability techniques employed, the multilingual tasks addressed, the languages investigated and available resources. The survey further identifies key challenges, distils core findings and outlines promising avenues for future research within this rapidly evolving domain.
pdf
bib
abs
Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities
Youngwoo Kim
|
Himanshu Beniwal
|
Steven L. Johnson
|
Thomas Hartvigsen
Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities.Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.
pdf
bib
abs
AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models
Vatsal Malaviya
|
Agneet Chatterjee
|
Maitreya Patel
|
Yezhou Yang
|
Chitta Baral
Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images. Project Page : https://vatsal-malaviya.github.io/AcT2I/
pdf
bib
abs
Assessing French Readability for Adults with Low Literacy: A Global and Local Perspective
Wafa Aissa
|
Thibault Bañeras-Roux
|
Elodie Vanzeveren
|
Lingyun Gao
|
Rodrigo Wilkens
|
Thomas François
This study presents a novel approach to assessing French text readability for adults with low literacy skills, addressing both global (full-text) and local (segment-level) difficulty. We introduce a dataset of 461 texts annotated using a difficulty scale developed specifically for this population. Using this corpus, we conducted a systematic comparison of key readability modeling approaches, including machine learning techniques based on linguistic variables, fine-tuning of CamemBERT, a hybrid approach combining CamemBERT with linguistic variables, and the use of generative language models (LLMs) to carry out readability assessment at both global and local levels.
pdf
bib
abs
LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval
Joohyung Yun
|
Doyup Lee
|
Wook-Shin Han
Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements. An effective multimodal retriever needs to handle two main challenges: (1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and (2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents. To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations. First, we introduce a layered component graph, explicitly representing multimodal information at two layers—each representing coarse and fine granularity—facilitating efficient yet precise reasoning. Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction. Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on all five benchmarks, notably without additional fine-tuning. We make the artifacts publicly available at github.com/joohyung00/lilac.
pdf
bib
abs
DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning
Tanmay Parekh
|
Kartik Mehta
|
Ninareh Mehrabi
|
Kai-Wei Chang
|
Nanyun Peng
Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4–7% average F1 gains over the best baseline – establishing DiCoRe as a strong zero-shot ED framework.
pdf
bib
abs
SNaRe: Domain-aware Data Generation for Low-Resource Event Detection
Tanmay Parekh
|
Yuxuan Dong
|
Lucas Bandarkar
|
Artin Kim
|
I-Hung Hsu
|
Kai-Wei Chang
|
Nanyun Peng
Event Detection (ED) – the task of identifying event mentions from natural language text – is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics to mitigate domain drift. Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions, ensuring high annotation quality. Experimentation on three diverse domain ED datasets reveals how SNaRe outperforms the best baseline, achieving average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1 improvement for multilingual generation. Analyzing the generated trigger hit rate and human evaluation substantiates SNaRe’s stronger annotation quality and reduced domain drift.
pdf
bib
abs
Table-R1: Inference-Time Scaling for Table Reasoning Tasks
Zheyuan Yang
|
Lyuhao Chen
|
Arman Cohan
|
Yilun Zhao
In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
pdf
bib
abs
LimRank: Less is More for Reasoning-Intensive Information Reranking
Tingyu Song
|
Yilun Zhao
|
Siyue Zhang
|
Chen Zhao
|
Arman Cohan
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
pdf
bib
abs
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
Mihir Parmar
|
Xin Liu
|
Palash Goyal
|
Yanfei Chen
|
Long Le
|
Swaroop Mishra
|
Hossein Mobahi
|
Jindong Gu
|
Zifeng Wang
|
Hootan Nakhost
|
Chitta Baral
|
Chen-Yu Lee
|
Tomas Pfister
|
Hamid Palangi
Recent agent frameworks and inference-time algorithms often struggle with natural planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing methods for these tasks either perform task-level verification without considering constraints or apply inference-time algorithms without adapting to instance-level complexity. To address these limitations, we propose PlanGEN, a model-agnostic and easily scalable agent framework with three key components: constraint, verification, and selection agents. Specifically, our approach proposes constraint-guided iterative verification to enhance performance of inference-time algorithms–Best of 𝒩, Tree-of-Thought, and REBASE. In PlanGEN framework, the selection agent optimizes algorithm choice based on instance complexity, ensuring better adaptability to complex planning problems. Experimental results demonstrate significant improvements over the strongest baseline across multiple benchmarks, achieving state-of-the-art results on NATURAL PLAN (~8%↑), OlympiadBench (~4%↑), DocFinQA (~7%↑), and GPQA (~1%↑). Our key finding highlights that constraint-guided iterative verification improves inference-time algorithms, and adaptive selection further boosts performance on complex planning and reasoning problems.
pdf
bib
abs
An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
Shubham Gandhi
|
Atharva Naik
|
Yiqing Xie
|
Carolyn Rose
We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong–weak collaboration substantially boosts the weak model’s performance at a fraction of the cost, pipeline and context-based methods being most efficient.
pdf
bib
abs
What are Foundation Models Cooking in the Post-Soviet World?
Anton Lavrouk
|
Tarek Naous
|
Alan Ritter
|
Wei Xu
The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multi-modal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multi-modal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pre-training data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding.
pdf
bib
abs
LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning
Tianshi Zheng
|
Cheng Jiayang
|
Chunyang Li
|
Haochen Shi
|
Zihao Wang
|
Jiaxin Bai
|
Yangqiu Song
|
Ginny Wong
|
Simon See
Modern large language models (LLMs) employ diverse logical inference mechanisms for reasoning, making the strategic optimization of these approaches critical for advancing their capabilities. This paper systematically investigate the **comparative dynamics** of inductive (System 1) versus abductive/deductive (System 2) inference in LLMs. We utilize a controlled analogical reasoning environment, varying modality (textual, visual, symbolic), difficulty, and task format (MCQ / free-text). Our analysis reveals System 2 pipelines generally excel, particularly in visual/symbolic modalities and harder tasks, while System 1 is competitive for textual and easier problems. Crucially, task format significantly influences their relative advantage, with System 1 sometimes outperforming System 2 in free-text rule-execution. These core findings generalize to broader in-context learning. Furthermore, we demonstrate that advanced System 2 strategies like hypothesis selection and iterative refinement can substantially scale LLM reasoning. This study offers foundational insights and actionable guidelines for strategically deploying logical inference to enhance LLM reasoning.
pdf
bib
abs
EcoLoRA: Communication-Efficient Federated Fine-Tuning of Large Language Models
Han Liu
|
Ruoyao Wen
|
Srijith Nair
|
Jia Liu
|
Wenjing Lou
|
Chongjie Zhang
|
William Yeoh
|
Yevgeniy Vorobeychik
|
Ning Zhang
To address data locality and privacy restrictions, Federated Learning (FL) has recently been adopted to fine-tune large language models (LLMs), enabling improved performance on various downstream tasks without requiring aggregated data. However, the repeated exchange of model updates in FL can result in prohibitively high communication costs, hindering the distributed learning process. To address this challenge, we propose EcoLoRA, a novel communication-efficient federated fine-tuning framework for LLMs. Leveraging the modular structure, we propose a round-robin segment sharing scheme, where each client uploads only a complementary LoRA segment per round to reduce network bandwidth. It is further combined with adaptive sparsification methods tailored to LoRA’s training dynamics and lossless encoding techniques. We conduct extensive evaluations on both question-answering and value-alignment tasks across multiple datasets and models. The results show that EcoLoRA significantly reduces communication overhead without compromising performance. For instance, it reduces communication time by up to 79% and total training time by up to 65%.
pdf
bib
abs
Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?
Boxiang Ma
|
Ru Li
|
Wang Yuanlong
|
Hongye Tan
|
Xiaoli Li
Driven by vast and diverse textual data, large language models (LLMs) have demonstrated impressive performance across numerous natural language processing (NLP) tasks. Yet, a critical question persists: does their generalization arise from mere memorization of training data or from deep semantic understanding? To investigate this, we propose a bi-perspective evaluation framework to assess LLMs’ scenario cognition—the ability to link semantic scenario elements with their arguments in context. Specifically, we introduce a novel scenario-based dataset comprising diverse textual descriptions of fictional facts, annotated with scenario elements. LLMs are evaluated through their capacity to answer scenario-related questions (model output perspective) and via probing their internal representations for encoded scenario elements-argument associations (internal representation perspective). Our experiments reveal that current LLMs predominantly rely on superficial memorization, failing to achieve robust semantic scenario cognition, even in simple cases. These findings expose critical limitations in LLMs’ semantic understanding and offer cognitive insights for advancing their capabilities.
pdf
bib
abs
Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection
Hong Zhang
|
Feng Zhao
|
Ruilin Zhao
|
Cheng Yan
|
Kangzheng Liu
Large Language Models (LLMs) have demonstrated a remarkable understanding of language nuances through instruction tuning, enabling them to effectively tackle various natural language processing tasks. Recent research has focused on the quality of instruction data rather than the quantity of instructions. However, existing high-quality instruction selection methods rely on external models or rules, overlooking the intrinsic association between pre-trained model and instruction data, making it difficult to select data that align with the preferences of pre-trained model. To address this challenge, we propose a strategy that utilizes noise injection to identify the quality of instruction data, without relying on external model. We also implement the strategy of combining inter-class diversity and intra-class diversity to improve model performance. The experimental results demonstrate that our method significantly outperforms the model trained on the entire dataset and established baselines. Our study provides a new perspective on noise injection in the field of instruction tuning, and also illustrates that the pre-trained model itself should be considered in defining high-quality. Additionally, we publish our selected high-quality instruction data.
pdf
bib
abs
Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Xin Gao
|
Ruiyi Zhang
|
Daniel Du
|
Saurabh Mahindre
|
Sai Ashish Somayajula
|
Pengtao Xie
Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.
pdf
bib
abs
DSVD: Dynamic Self-Verify Decoding for Faithful Generation in Large Language Models
YiQiu Guo
|
Yuchen Yang
|
Zhe Chen
|
Pingjie Wang
|
Yusheng Liao
|
Ya Zhang
|
Yanfeng Wang
|
Yu Wang
The reliability of large language models remains a critical challenge, particularly due to their susceptibility to hallucinations and factual inaccuracies during text generation. Existing solutions either underutilize models’ self-correction with preemptive strategies or use costly post-hoc verification. To further explore the potential of real-time self-verification and correction, we present Dynamic Self-Verify Decoding (DSVD), a novel decoding framework that enhances generation reliability through real-time hallucination detection and efficient error correction. DSVD integrates two key components: (1) parallel self-verification architecture for continuous quality assessment, (2) dynamic rollback mechanism for targeted error recovery. Extensive experiments across five benchmarks demonstrate DSVD’s effectiveness, achieving significant improvement in truthfulness (Quesetion-Answering) and factual accuracy (FActScore). Results show the DSVD can be further incorporated with existing faithful decoding methods to achieve stronger performance. Our work establishes that real-time self-verification during generation offers a viable path toward more trustworthy language models without sacrificing practical deployability.
pdf
bib
abs
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon
|
Seongtae Hong
|
Jaehyung Seo
|
Heuiseok Lim
Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and code-verifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs
pdf
bib
abs
Generative Annotation for ASR Named Entity Correction
Yuanchang Luo
|
Daimeng Wei
|
Shaojun Li
|
Hengchao Shang
|
Jiaxin Guo
|
Zongyao Li
|
Zhanglin Wu
|
Xiaoyu Chen
|
Zhiqiang Rao
|
Jinlong Yang
|
Hao Yang
End-to-end automatic speech recognition systems often fail to transcribe domain-speciffcnamed entities, causing catastrophic failuresin downstream tasks. Numerous fast and lightweight named entity correction (NEC) models have been proposed in recent years. These models, mainly leveraging phonetic-level edit distance algorithms, have shown impressive performances. However, when theforms of the wrongly-transcribed words(s) and the ground-truth entity are signiffcantly different, these methods often fail to locate the wrongly transcribed words in hypothesis, thus limiting their usage. We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities. With speech sound features and candidate entities, we inovatively design a generative method to annotate entityerrors in ASR transcripts and replace the textwith correct entities. This method is effective inscenarios of word form difference. We test ourmethod using open-source and self-constructed test sets. The results demonstrate that our NEC method can bring signiffcant improvement to entity accuracy. We will open source our self constructed test set and training data.
pdf
bib
abs
SOLAR: Towards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs
Younghun Lee
|
Dan Goldwasser
Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision-making. Existing studies suggest that LLM generations can convey subjectivity to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize the subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SolAr (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results demonstrate that our framework enhances overall inference performance, with notable improvements for users with limited data and in controversial situations. Additionally, we qualitatively show that SolAr provides explanations about individuals’ value preferences, which can further account for their judgments.
pdf
bib
abs
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models
Kang He
|
Kaushik Roy
Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.
pdf
bib
abs
Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs
Michiharu Yamashita
|
Thanh Tran
|
Delvin Ce Zhang
|
Dongwon Lee
The rapid advancement of Large Language Models (LLMs) has enabled the generation of highly realistic synthetic data. We identify a new vulnerability, LLMs generating convincing career trajectories in fake resumes and explore effective detection methods. To address this challenge, we construct a dataset of machine-generated career trajectories using LLMs and various methods, and demonstrate that conventional text-based detectors perform poorly on structured career data. We propose CareerScape, a novel heterogeneous, hierarchical multi-layer graph framework that models career entities and their relations in a unified global graph built from genuine resumes. Unlike conventional classifiers that treat each instance independently, CareerScape employs a structure-aware framework that augments user-specific subgraphs with trusted neighborhood information from a global graph, enabling the model to capture both global structural patterns and local inconsistencies indicative of synthetic career paths. Experimental results show that CareerScape outperforms state-of-the-art baselines by 5.8-85.0% relatively, highlighting the importance of structure-aware detection for machine-generated content. Our codebase is available at https://github.com/mickeymst/careerscape.
pdf
bib
abs
GAP: a Global Adaptive Pruning Method for Large Language Models
Zhihua Ban
|
Haotian Ma
|
Siheng Zhang
|
Shengyu Liu
|
Xichen Chen
|
Ming Yang
The deployment of Large Language Models (LLMs) faces significant challenges due to high computational costs,driving the demand for effective pruning techniques. Existing structured pruning methods employ uniform compression rates across network layers, neglecting the varying importance of different network depths. To address this limitation, we propose a novel optimization framework that directly minimizes global capability loss through layer-adaptive pruning rates. The framework formulates the pruning task as a combinatorial optimization problem constrained by a total parameter budget, and an efficient dynamic programming solution is derived to determine optimal layer-wise compression rates.Experiments demonstrate that, when tuning is not included, our approach achieves comparable performance with state-of-the-art methods at high pruning rates (37-50% reduction), and shows significant advantages at low pruning rates (13-25% reduction). When tuning is included, our method achieves the best performance among the compared methods.
pdf
bib
abs
Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce
Haojin Wang
|
Zining Zhu
|
Freda Shi
Autoregressive neural language models (LMs) generate a probability distribution over tokens at each time step given a prompt. In this work, we attempt to systematically understand the probability distributions that LMs can produce, showing that some distributions are significantly harder to elicit than others. Specifically, for any target next-token distribution over the vocabulary, we attempt to find a prompt that induces the LM to output a distribution as close as possible to the target, using either soft or hard gradient-based prompt tuning. We find that (1) in general, distributions with very low or very high entropy are easier to approximate than those with moderate entropy; (2) among distributions with the same entropy, those containing ”outlier tokens” are easier to approximate; (3) target distributions generated by LMs – even LMs with different tokenizers – are easier to approximate than randomly chosen targets. These results offer insights into the expressiveness of LMs and the challenges of using them as probability distribution proposers.
pdf
bib
abs
LGA: LLM-GNN Aggregation for Temporal Evolution Attribute Graph Prediction
Feng Zhao
|
Ruoyu Chai
|
Kangzheng Liu
|
Xianggan Liu
Temporal evolution attribute graph prediction, a key task in graph machine learning, aims to forecast the dynamic evolution of node attributes over time. While recent advances in Large Language Models (LLMs) have enabled their use in enhancing node representations for integration with Graph Neural Networks (GNNs), their potential to directly perform GNN-like aggregation and interaction remains underexplored. Furthermore, traditional approaches to initializing attribute embeddings often disregard structural semantics, limiting the provision of rich prior knowledge to GNNs. Current methods also primarily focus on 1-hop neighborhood aggregation, lacking the capability to capture complex structural interactions. To address these limitations, we propose a novel prediction framework that integrates structural information into attribute embeddings through the introduction of an attribute embedding loss. We design specialized prompts to enable LLMs to perform GNN-like aggregation and incorporate a relation-aware Graph Convolutional Network to effectively capture long-range and complex structural dependencies. Extensive experiments on multiple real-world datasets validate the effectiveness of our approach, demonstrating significant improvements in predictive performance over existing methods.
pdf
bib
abs
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models
Tao Zou
|
Xinghua Zhang
|
Haiyang Yu
|
Minzheng Wang
|
Fei Huang
|
Yongbin Li
With the development and widespread application of large language models (LLMs), the new paradigm of “Model as Product” is rapidly evolving, and demands higher capabilities to address complex user needs, often requiring precise workflow execution which involves the accurate understanding of multiple tasks. However, existing benchmarks focusing on single-task environments with limited constraints, lack the complexity required to fully reflect To bridge this gap, we present the Extremely Complex Instruction Following Benchmark (EIFBENCH), meticulously crafted to facilitate a more realistic and robust evaluation of LLMs. EIFBENCH not only includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently, but also integrates a variety of constraints, replicating complex operational environments. Furthermore, we propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM’s ability to accurately fulfill multi-task workflow. Evaluations on EIFBENCH have unveiled considerable performance discrepancies in existing LLMs when challenged with these extremely complex instructions. This finding underscores the necessity for ongoing optimization to navigate the intricate challenges posed by real-world LLM applications.
pdf
bib
abs
Tool Preferences in Agentic LLMs are Unreliable
Kazem Faghih
|
Wenxiao Wang
|
Yize Cheng
|
Siddhant Bharti
|
Gaurang Sriramanan
|
Sriram Balasubramanian
|
Parsa Hosseini
|
Soheil Feizi
Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use—a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool’s usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive **over 10 times more usage** from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 17 different models. These phenomena, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources. Our code is publicly available at [https://github.com/kazemf78/llm-unreliable-tool-preferences](https://github.com/kazemf78/llm-unreliable-tool-preferences).
pdf
bib
abs
Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning
Yu Liu
|
Yanan Cao
|
Xixun Lin
|
Yanmin Shang
|
Shi Wang
|
Shirui Pan
Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches develop separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%
pdf
bib
abs
MultiDocFusion : Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents
Joongmin Shin
|
Chanjun Park
|
Jeongbae Park
|
Jaehyung Seo
|
Heuiseok Lim
RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8–15% and ANLS QA scores by 2–3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.
pdf
bib
abs
Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models
Qiang Liu
|
Xinlong Chen
|
Yue Ding
|
Bowen Song
|
Weiqiang Wang
|
Shu Wu
|
Liang Wang
Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational complexity, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.
pdf
bib
abs
‘Rich Dad, Poor Lad’: How do Large Language Models Contextualize Socioeconomic Factors in College Admission ?
Huy Nghiem
|
Phuong-Anh Nguyen-Le
|
John Prindle
|
Rachel Rudinger
|
Hal Daumé Iii
Large Language Models (LLMs) are increasingly involved in high-stakes domains, yet how they reason about socially-sensitive decisions still remain underexplored. We present a large-scale audit of LLMs’ treatment of socioeconomic status (SES) in college admissions decisions using a novel dual-process framework inspired by cognitive science. Leveraging a synthetic dataset of 30,000 applicant profiles grounded in real-world correlations, we prompt 4 open-source LLMs (Qwen 2, Mistral v0.3, Gemma 2, Llama 3.1) under 2 modes: a fast, decision-only setup (System 1) and a slower, explanation-based setup (System 2). Results from 5 million prompts reveals that LLMs consistently favor low-SES applicants—even when controlling for academic performance—and that System 2 amplifies this tendency by explicitly invoking SES as compensatory justification, highlighting both their potential and volatility as decision-makers. We then propose DPAF, a dual-process audit framework to probe LLMs’ reasoning behaviors in sensitive applications.
pdf
bib
abs
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
Licheng Pan
|
Yongqi Tong
|
Xin Zhang
|
Xiaolu Zhang
|
Jun Zhou
|
Zhixuan Chu
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet they often refuse to answer legitimate queries—a phenomenon known as overrefusal. Overrefusal typically stems from over-conservative safety alignment, causing models to treat many reasonable prompts as potentially risky. To systematically understand this issue, we probe and leverage the models’ safety decision boundaries to analyze and mitigate overrefusal. Our findings reveal that overrefusal is closely tied to misalignment at these boundary regions, where models struggle to distinguish subtle differences between benign and harmful content. Building on these insights, we present **RASS**, an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary. By harnessing steering vectors in the representation space, **RASS** efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal. This approach not only provides a more precise and interpretable view of model safety decisions but also seamlessly extends to multilingual scenarios. We have explored the safety decision boundaries of various LLMs and construct the **MORBench** evaluation set to facilitate robust assessment of model safety and helpfulness across multiple languages. Code and datasets are available at https://github.com/Master-PLC/RASS.
pdf
bib
abs
MMAG: Multimodal Learning for Mucus Anomaly Grading in Nasal Endoscopy via Semantic Attribute Prompting
Xinpan Yuan
|
Mingzhu Huang
|
Liujie Hua
|
Jianuo Ju
|
Xu Zhang
Accurate grading of rhinitis severity in nasal endoscopy relies heavily on the characterization of key secretion types, notably clear nasal discharge (CND) and purulent nasal secretion (PUS). However, both exhibit ambiguous appearance and high structural variability, posing challenges to automated grading under weak supervision. To address this, we propose Multimodal Learning for Mucus Anomaly Grading (MMAG), which integrates structured prompts with rank-aware vision-language modeling for joint detection and grading. Attribute prompts are constructed from clinical descriptors (e.g., secretion type, severity, location) and aligned with multi-level visual features via a dual-branch encoder. During inference, the model localizes mucus anomalies and maps the input image to severity-specific prompts (e.g., “moderate pus”), projecting them into a rank-aware feature space for progressive similarity scoring.Extensive evaluations on CND and PUS datasets show that our method achieves consistent gains over Baseline, improving AUC by 6.31% and 4.79%, and F1 score by 12.85% and 6.03%, respectively.This framework enables interpretable, annotation-efficient, and semantically grounded assessment of rhinitis severity based on mucus anomalies.
pdf
bib
abs
The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT
Linyao Yang
|
Jian-Tao Huang
|
Yafei Lu
|
Zhenhui Jessie Li
|
Guirong Xue
Recent advances in large language models (LLMs) have yielded impressive gains on mathematical reasoning benchmarks via supervised fine-tuning (SFT). However, the brittleness of these models under input perturbations has cast doubt on whether such improvements reflect genuine reasoning abilities or merely superficial alignment with expected output formats. We investigate the mechanisms behind SFT improvements in small-scale LLMs, addressing four key questions: (1) Are performance gains primarily due to format alignment rather than reasoning? (2) Can high-quality supervision encourage genuine reasoning? (3) Does scaling data shift learning from format alignment to deeper reasoning? (4) Are format alignment gains consistent across model sizes and architectures? Through controlled experiments, we find that most performance improvements arise from format alignment rather than genuine reasoning enhancement. Moreover, SFT’s effectiveness is strongly influenced by the alignment between the base model’s inductive biases and the teacher model’s output distribution, rather than the teacher’s raw strength. Finally, scaling up training data offers diminishing returns and does not fundamentally alter the model’s reasoning behavior. These findings suggest that current SFT practices may overestimate the reasoning abilities of LLMs and underscore the need for more rigorous evaluation methods.
pdf
bib
abs
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning
Lang Cao
|
Yingtian Zou
|
Chao Peng
|
Renhong Chen
|
Wu Ning
|
Yitong Li
Mathematical reasoning has been challenging for large language models (LLMs), and the introduction of step-by-step Chain-of-Thought (CoT) inference has significantly advanced the mathematical capabilities of LLMs. However, current approaches either necessitate extensive inference datasets for training or depend on few-shot methods that frequently compromise computational accuracy. To address these fundamental limitations, we propose Step Guided Reasoning, a novel training-free adaptation framework that efficiently equips general-purpose pre-trained language models with enhanced mathematical reasoning capabilities. In this approach, LLMs reflect on small reasoning steps, similar to how humans deliberate and focus attention on what to do next. By incorporating this reflective process into the inference stage, LLMs can effectively guide their reasoning from one step to the next. Through extensive experiments, we demonstrate the significant effect of Step Guided Reasoning in enhancing mathematical performance in state-of-the-art language models – Qwen2-72B-Instruct outperforms its math-specific counterpart, Qwen2.5-72B-Math-Instruct, on MMLU-STEM with a score of 90.9%, compared to 87.3%. The average scores of Qwen2-7B-Instruct and Qwen2-72B-Instruct increase from 27.1% to 36. 3% and from 36. 5% to 47.4% in the math domain, respectively.
pdf
bib
abs
Flexibly Utilize Memory for Long-Term Conversation via a Fragment-then-Compose Framework
Cai Ke
|
Yiming Du
|
Bin Liang
|
Yifan Xiang
|
Lin Gui
|
Zhongyang Li
|
Baojun Wang
|
Yue Yu
|
Hui Wang
|
Kam-Fai Wong
|
Ruifeng Xu
Large language models (LLMs) have made significant breakthroughs in extracting useful information from conversation history to enhance the response in long-term conversations. Summarizing useful information from historical conversations has achieved remarkable performance, which, however, may introduce irrelevant or redundant information, making it difficult to flexibly choose and integrate key information from different sessions during memory retrieval. To address this issue, we propose a Fragment-then-Compose framework, a novel memory utilization approach for long-term open-domain conversation, called *FraCom*. To be specific, inspired by the concept of proposition representation from Cognitive Psychology, we first represent the conversation history as a series of predicates plus arguments for propositional representation to preserve key information useful for memory ("**Fragment**”). Then, we compose propositional graphs for the conversation history based on the connection between shared arguments ("**Compose**”). During retrieval, we retrieve relevant propositions from the graph based on arguments from the current query. This essentially allows for flexible and effective utilization of related information in long-term memory for better response generation towards a query. Experimental results on four long-term open-domain conversation datasets demonstrate the effectiveness of our *FraCom* in memory utilization and its ability to enhance response generation for LLMs.
pdf
bib
abs
STRICT: Stress-Test of Rendering Image Containing Text
Tianyu Zhang
|
Xinyu Wang
|
Lu Li
|
Zhenghan Tai
|
Jijun Chi
|
Jingrui Tian
|
Hailin He
|
Suyuchen Wang
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle with generating consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their capacity to model long-range spatial dependencies. In this paper, we introduce STRICT, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated and (2) the correctness and legibility of the generated text. We assess several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling.
pdf
bib
abs
A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence- and Collaboration-based Decision Making
Chung-Nan Tsai
|
Xin Wang
|
Cheng-Hsiung Lee
|
Ching-Sheng Lin
While large language models (LLMs) have shown strong capabilities across diverse domains, their application to code vulnerability detection holds great potential for identifying security flaws and improving software safety. In this paper, we propose a sequential multi-stage approach via confidence- and collaboration-based decision making (ConfColl). The system adopts a three-stage sequential classification framework, proceeding through a single agent, retrieval-augmented generation (RAG) with external examples, and multi-agent reasoning enhanced with RAG. The decision process selects among these strategies to balance performance and cost, with the process terminating at any stage where a high-certainty prediction is achieved. Experiments on a benchmark dataset and a low-resource language demonstrate the effectiveness of our framework in enhancing code vulnerability detection performance.
pdf
bib
abs
Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity
Zhaoyi Joey Hou
|
Adriana Kovashka
|
Xiang Lorraine Li
Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.
pdf
bib
abs
BIRD: Bronze Inscription Restoration and Dating
Wenjie Hua
|
Hoang H Nguyen
|
Gangyan Ge
Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD (Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.
pdf
bib
abs
DCP: Dual-Cue Pruning for Efficient Large Vision-Language Models
Lei Jiang
|
Zixun Zhang
|
Yuting Zeng
|
Chunzhao Xie
|
Tongxuan Liu
|
Zhen Li
|
Lechao Cheng
|
Xiaohua Xu
Large Vision-Language Models (LVLMs) achieve remarkable performance in multimodal tasks but suffer from high computational costs due to the large number of visual tokens. Existing pruning methods either apply after visual tokens enter the LLM or perform pre-pruning based solely on visual attention. Both fail to balance efficiency and semantic alignment, as post-pruning incurs redundant computation, while visual-only pre-pruning overlooks multimodal relevance.To address this limitation, we propose Dual-Cue Pruning (DCP), a novel cross-modal pruning framework that jointly considers textual semantics and visual self-attention. DCP consists of a text-aware computation module, which employs a gradient-weighted attention mechanism to enhance text-visual alignment, and an image-aware computation module, which utilizes deep-layer self-attention distributions to retain essential structural information. By integrating both cues, DCP adaptively selects the most informative visual tokens, achieving efficient inference acceleration while maintaining strong task performance. Experimental results show that DCP can retain only 25% of the visual tokens, with a minimal performance degradation of only 0.063% on LLaVA-1.5-13B, demonstrating its effectiveness in balancing efficiency and accuracy.
pdf
bib
abs
Improving Context Fidelity via Native Retrieval-Augmented Reasoning
Suyuchen Wang
|
Jinlin Wang
|
Xinyu Wang
|
Shiqi Li
|
Xiangru Tang
|
Sirui Hong
|
Xiao-Wen Chang
|
Chenglin Wu
|
Bang Liu
Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model’s own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.
pdf
bib
abs
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Shehzeen Samarah Hussain
|
Paarth Neekhara
|
Xuesong Yang
|
Edresson Casanova
|
Subhankar Ghosh
|
Roy Fejgin
|
Mikyas T. Desta
|
Rafael Valle
|
Jason Li
Autoregressive speech token generation models produce speech with remarkable variety and naturalness but often suffer from hallucinations and undesired vocalizations that do not conform to conditioning inputs. To address these challenges, we introduce Koel-TTS, an encoder-decoder transformer model for multilingual TTS that improves contextual adherence of speech generation LLMs through preference alignment and classifier-free guidance (CFG). For preference alignment, we design a reward system that ranks model outputs using automatic metrics derived from speech recognition and speaker verification models, encouraging generations that better match the input text and speaker identity. CFG further allows fine-grained control over the influence of conditioning inputs during inference by interpolating conditional and unconditional logits. Notably, applying CFG to a preference-aligned model yields additional gains in transcription accuracy and speaker similarity, demonstrating the complementary benefits of both techniques. Koel-TTS achieves state-of-the-art results in zero-shot TTS, outperforming prior LLM-based models on intelligibility, speaker similarity, and naturalness, despite being trained on significantly less data.
pdf
bib
abs
Mixing Inference-time Experts for Enhancing LLM Reasoning
Soumya Sanyal
|
Tianyi Xiao
|
Xiang Ren
Large Language Models (LLMs) have demonstrated impressive reasoning abilities, but their generated rationales often suffer from issues such as reasoning inconsistency and factual errors, undermining their reliability. Prior work has explored improving rationale quality via multi-reward fine-tuning or reinforcement learning (RL), where models are optimized for diverse objectives. While effective, these approaches train the model in a fixed manner and do not have any inference-time adaptability, nor can they generalize reasoning requirements for new test-time inputs. Another approach is to train specialized reasoning experts using reward signals and use them to improve generation at inference time. Existing methods in this paradigm are limited to using only a single expert and cannot improve upon multiple reasoning aspects. To address this, we propose MIXIE, a novel inference-time expert-mixing framework that dynamically determines mixing proportions for each expert, enabling contextualized and flexible fusion. We demonstrate the effectiveness of MIXIE on improving chain-of-thought reasoning in LLMs by merging commonsense and entailment reasoning experts finetuned on reward-filtered data. Our approach outperforms existing baselines on three question-answering datasets: StrategyQA, CommonsenseQA, and ARC, highlighting its potential to enhance LLM reasoning with efficient, adaptable expert integration.
pdf
bib
abs
Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks
Xubo Qin
|
Jun Bai
|
Jiaqi Li
|
Zixia Jia
|
Zilong Zheng
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale LLMs like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce Reinforced Query Reasoner (RQR), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. Our approach frames query reformulation as a reinforcement learning problem and employs a novel semi-rule-based reward function. This enables smaller language models, e.g., Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve reasoning performance rivaling large-scale LLMs without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that, with BM25 as retrievers, both RQR-7B and RQR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment. All code and dataset will be publicly released.
pdf
bib
abs
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Wei Wu
|
Zhuoshi Pan
|
Kun Fu
|
Chao Wang
|
Liyi Chen
|
Yunchu Bai
|
Tianfu Wang
|
Zheng Wang
|
Hui Xiong
Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (*TokenSelect*), a training-free method for efficient and accurate long-context inference. *TokenSelect* builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, *TokenSelect* selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate *TokenSelect*, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of *TokenSelect* demonstrates up to 23.84× speedup in attention computation and up to 2.28× acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.
pdf
bib
abs
MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
Siyu Yan
|
Long Zeng
|
Xuecheng Wu
|
Chengcheng Han
|
Kongcheng Zhang
|
Chong Peng
|
Xuezhi Cao
|
Xunliang Cai
|
Chenjuan Guo
As large language models (LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at https://anonymous.4open.science/r/MUSE-75F7.
pdf
bib
abs
EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation
Sen Yang
|
Yu Bao
|
Yu Lu
|
Jiajun Chen
|
Shujian Huang
|
Shanbo Cheng
Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models’ established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs.
pdf
bib
abs
“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents
Jianshuo Dong
|
Yutong Zhang
|
Liu Yan
|
Zhenyu Zhong
|
Tao Wei
|
Ke Xu
|
Minlie Huang
|
Chao Zhang
|
Han Qiu
Large language models (LLMs) exhibit prompt leakage vulnerabilities, where they may be coaxed into revealing system prompts embedded in LLM services, raising intellectual property and confidentiality concerns. An intriguing question arises: Do LLMs genuinely internalize prompt leakage intents in their hidden states before generating tokens? In this work, we use probing techniques to capture LLMs’ intent-related internal representations and confirm that the answer is yes. We start by comprehensively inducing prompt leakage behaviors across diverse system prompts, attack queries, and decoding methods. We develop a hybrid labeling pipeline, enabling the identification of broader prompt leakage behaviors beyond mere verbatim leaks. Our results show that a simple linear probe can predict prompt leakage risks from pre-generation hidden states without generating any tokens. Across all tested models, linear probes consistently achieve 90%+ AUROC, even when applied to new system prompts and attacks. Understanding the model internals behind prompt leakage drives practical applications, including intention-based detection of prompt leakage risks. Code is available at: https://github.com/jianshuod/Probing-leak-intents.
pdf
bib
abs
Nullspace Disentanglement for Red Teaming Language Models
Yi Han
|
Yuanxing Liu
|
Weinan Zhang
|
Ting Liu
With the widespread deployment of generative language models, concerns about safety issues have continuously grown. High-quality fine-tuning data generated from red teaming plays a crucial role in the model’s safety. Recently, automated red teaming approaches have been proposed to create test cases. However, these approaches, which rely on open-ended generation, encounter issues related to inefficiency and low attack success rates. In this work, we introduce a black-box approach that ingeniously exploits the unique properties of the nullspace to disentangle and regulate the crucial success information within test cases. Our study provides a brand-new perspective for automated red team research. Experimental results demonstrate that our approach outperforms baseline methods regarding the attack success rate. The generated test cases also excel in aspects of diversity and fluency.
pdf
bib
abs
Supervised Attention Mechanism for Low-quality Multimodal Data
Sijie Mai
|
Shiqin Han
|
Haifeng Hu
In practical applications, multimodal data are often of low quality, with noisy modalities and missing modalities being typical forms that severely hinder model performance, robustness, and applicability. However, current studies address these issues separately. To this end, we propose a framework for multimodal affective computing that jointly addresses missing and noisy modalities to enhance model robustness in low-quality data scenarios. Specifically, we view missing modality as a special case of noisy modality, and propose a supervised attention framework. In contrast to traditional attention mechanisms that rely on main task loss to update the parameters, we design supervisory signals for the learning of attention weights, ensuring that attention mechanisms can focus on discriminative information and suppress noisy information. We further propose a ranking-based optimization strategy to compare the relative importance of different interactions by adding a ranking constraint for attention weights, avoiding training noise caused by inaccurate absolute labels. The proposed model consistently outperforms state-of-the-art baselines on multiple datasets under the settings of complete modalities, missing modalities, and noisy modalities.
pdf
bib
abs
Reinforcement Learning for Large Language Models via Group Preference Reward Shaping
Huaisheng Zhu
|
Siyuan Xu
|
Hangfan Zhang
|
Teng Xiao
|
Zhimeng Guo
|
Shijie Zhou
|
Shuyue Hu
|
Vasant G. Honavar
Large Language Models (LLMs) require alignment via reinforcement learning (RL) to effectively perform task-specific objectives, such as human preference alignment and enhanced reasoning. While Proximal Policy Optimization (PPO) is widely adopted, its computational overhead, stemming from additional value model requirements, limits applicability. Existing alternatives, like Group Relative Policy Optimization (GRPO), mitigate computational costs but remain sensitive to reward model quality. To address this, we introduce Group Preference Reward Shaping (GPRS), a novel method that leverages preference-based comparisons rather than precise numerical rewards. GPRS requires no extra model components and remains robust across varying reward model sizes and qualities. Extensive experiments demonstrate that GPRS consistently outperforms existing critic-model-free RL algorithms in Reinforcement Learning from Human Feedback (RLHF) and reasoning tasks, providing stable and good alignment performance.
pdf
bib
abs
zFLoRA: Zero-Latency Fused Low-Rank Adapters
Dhananjaya Gowda
|
Seoha Song
|
Harshith Goka
|
Junhyun Lee
Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (up to 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.
pdf
bib
abs
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
Mihir Parmar
|
Palash Goyal
|
Xin Liu
|
Yiwen Song
|
Mingyang Ling
|
Chitta Baral
|
Hamid Palangi
|
Tomas Pfister
Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average ~7%. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average ~10% and ~12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
pdf
bib
abs
Semantic Inversion, Identical Replies: Revisiting Negation Blindness in Large Language Models
Jinsung Kim
|
Seonmin Koo
|
Heuiseok Lim
Large language models (LLMs) often fail to capture semantic changes in queries due to negation, and generate incorrect responses. Negation frequently exists in the real world and is useful for understanding the opposite or absence of a statement, so it is an essential element in logical reasoning. Previous studies have explored LLMs’ ability to capture negations ‘separately’ from their ability to properly ground knowledge for positive queries. However, this perspective is limited in that it cannot clearly distinguish whether the cause of incorrect responses is the logical incoherence caused by negations or the lack of grounding ability for the given context. To address this issue, we focus on the phenomenon of the model failing to capture semantic contradictions in negated queries despite its accurate understanding of knowledge about positive queries. We term this phenomenon negation blindness on the query. We propose a verification framework that includes task design and measurement methods to verify this issue. In detail, we establish two criteria for systematic task design–i) ‘complexity’ and ii) ‘constrainedness’–and devise four verification tasks accordingly. Moreover, we analyze the results extensively and provide insights into problem alleviation feasibility through experiments on various approaches. Our code and resources can be found at https://www.github.com/jin62304/NegationBlindness.
pdf
bib
abs
AMACE: Automatic Multi-Agent Chart Evolution for Iteratively Tailored Chart Generation
Hyuk Namgoong
|
Jeesu Jung
|
Hyeonseok Kang
|
Yohan Lee
|
Sangkeun Jung
Many statistical facts are conveyed through charts. While various methods have emerged for chart understanding, chart generation typically requires users to manually input code, intent, and other parameters to obtain the desired format on chart generation tools. Recently, the advent of image-generating Large Language Models has facilitated chart generation; however, even this process often requires users to provide numerous constraints for accurate results. In this paper, we propose a loop-based framework for automatically evolving charts in a multi-agent environment. Within this framework, three distinct agents—Chart Code Generator, Chart Replier, and Chart Quality Evaluator—collaborate for iterative, user-tailored chart generation using large language models. Our approach demonstrates an improvement of up to 29.97% in performance compared to first generation, while also reducing generation time by up to 86.9% compared to manual prompt-based methods, showcasing the effectiveness of this multi-agent collaboration in enhancing the quality and efficiency of chart generation.
pdf
bib
abs
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Jianguo Zhang
|
Thai Quoc Hoang
|
Ming Zhu
|
Zuxin Liu
|
Shiyu Wang
|
Tulika Manoj Awalgaonkar
|
Akshara Prabhakar
|
Haolin Chen
|
Weiran Yao
|
Zhiwei Liu
|
Juntao Tan
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9× higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories.
pdf
bib
abs
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
|
Aeree Cho
|
Grace C. Kim
|
ShengYun Peng
|
Mansi Phute
|
Duen Horng Chau
As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.
pdf
bib
abs
Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
Sohee Kim
|
Soohyun Ryu
|
Joonhyung Park
|
Eunho Yang
Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models’ tendency to falsely presume the visual presence of text input and its generality across various LVLMs.
pdf
bib
abs
Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs
Abhinav Arabelly
|
Jagrut Nemade
|
Robert D Nowak
|
Jifan Zhang
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation — a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80%.
pdf
bib
abs
Look Beyond Feeling: Unveiling Latent Needs from Implicit Expressions for Proactive Emotional Support
Xing Fu
|
Haozhen Li
|
Bichen Wang
|
Hao Yang
|
Yanyan Zhao
|
Bing Qin
In recent years, Large Language Models (LLMs) have made significant progress in emotional support dialogue. However, there are two major challenges for LLM-based support systems. First, users may be hesitant to fully disclose their emotions at the outset. Second, direct probing or excessive questioning can induce discomfort or even resistance. To bridge this gap, we propose COCOON, a proactive emotional support framework that leverages principles of active listening to uncover implicit user needs. We design a multi-stage data curation pipeline and an annotation mechanism for support strategies. Based on this framework, we build COCOON-Llama3, a fine-tuned large language model, and evaluate it using both standard metrics and psychological scales. Experimental results indicate that our model more effectively elicits implicit emotional needs and delivers empathetic support compared to existing baselines, suggesting its utility for building more inclusive emotional support dialogue systems.
pdf
bib
abs
s3: You Don’t Need That Much Data to Train a Search Agent via RL
Pengcheng Jiang
|
Xueqiang Xu
|
Jiacheng Lin
|
Jinfeng Xiao
|
Zifeng Wang
|
Jimeng Sun
|
Jiawei Han
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve—entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose **s3**, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naïve RAG. **s3** requires only 2.4k training samples to outperform baselines trained on over 70 × more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.
pdf
bib
abs
FuseChat: Knowledge Fusion of Chat Models
Fanqi Wan
|
Longguang Zhong
|
Ziyi Yang
|
Ruijun Chen
|
Xiaojun Quan
While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes.
pdf
bib
abs
Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
Yukun Zhang
|
Xueqing Zhou
We present Continuous-Time Attention, a novel framework that infuses partial differential equations (PDEs) into the Transformer’s attention mechanism to better handle long sequences. Instead of relying on a static attention matrix, we allow attention weights to evolve along a pseudo-time dimension governed by diffusion, wave, or reaction-diffusion dynamics. This dynamic process systematically smooths local noise, strengthens long-range dependencies, and improves gradient stability during training.Our theoretical analysis shows that PDE-driven attention mitigates the exponential decay of distant interactions and improves the optimization landscape. Empirically, Continuous-Time Attention achieves consistent performance gains over both standard and long-sequence Transformer variants across a range of tasks. These results suggest that embedding continuous-time dynamics into attention mechanisms is a promising direction for enhancing global coherence and scalability in Transformer models. Code is publicly available at:https://github.com/XueqingZhou/Continuous-Time-Attention
pdf
bib
abs
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Nurit Cohen Inger
|
Yehonatan Elisha
|
Bracha Shapira
|
Lior Rokach
|
Seffi Cohen
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the **Chameleon Benchmark Overfit Detector (C-BOD)**, a meta-evaluation framework designed to reveal such overfitting. C-BOD systematically rephrases benchmark inputs via a parameterized transformation that preserves semantic content and labels, enabling the detection of performance degradation indicative of superficial pattern reliance.We conduct extensive experiments across two datasets, three rephrasing models, and multiple distortion levels, evaluating 32 state-of-the-art LLMs. On the MMLU benchmark, C-BOD reveals an average performance drop of 2.75% under modest rephrasings, with over 80% of models exhibiting statistically significant differences. Notably, higher-performing models and larger LLMs tend to show greater sensitivity, suggesting a deeper dependence on benchmark-specific phrasing.Due to its dataset and model-agnostic design, C-BOD can be easily integrated into evaluation pipelines and offers a promising foundation for overfitting mitigation strategies. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation. Our code and benchmark datasets are availableat: https://github.com/nuritci/cbod
pdf
bib
abs
Memorization or Reasoning? Exploring the Idiom Understanding of LLMs
Jisu Kim
|
Youngwoo Shin
|
Uiji Hwang
|
Jihun Choi
|
Richeng Xuan
|
Taeuk Kim
Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs’ idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.
pdf
bib
abs
RD-MCSA: A Multi-Class Sentiment Analysis Approach Integrating In-Context Classification Rationales and Demonstrations
Haihua Xie
|
Yinzhu Cheng
|
Yaqing Wang
|
Miao He
|
Mingming Sun
This paper addresses the important yet underexplored task of **multi-class sentiment analysis (MCSA)**, which remains challenging due to the subtle semantic differences between adjacent sentiment categories and the scarcity of high-quality annotated data. To tackle these challenges, we propose **RD-MCSA** (**R**ationales and **D**emonstrations-based **M**ulti-**C**lass **S**entiment **A**nalysis), an In-Context Learning (ICL) framework designed to enhance MCSA performance under limited supervision by integrating classification rationales with adaptively selected demonstrations. First, semantically grounded classification rationales are generated from a representative, class-balanced subset of annotated samples selected using a tailored balanced coreset algorithm. These rationales are then paired with demonstrations chosen through a similarity-based mechanism powered by a **multi-kernel Gaussian process (MK-GP)**, enabling large language models (LLMs) to more effectively capture fine-grained sentiment distinctions. Experiments on five benchmark datasets demonstrate that RD-MCSA consistently outperforms both supervised baselines and standard ICL methods across various evaluation metrics.
pdf
bib
abs
Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint
Heekyung Lee
|
Jiaxin Ge
|
Tsung-Han Wu
|
Minwoo Kang
|
Trevor Darrell
|
David M. Chan
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multimodal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this short paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse english-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues (“head” over “heels”). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.
pdf
bib
abs
CREPE: Rapid Chest X-ray Report Evaluation by Predicting Multi-category Error Counts
Gihun Cho
|
Seunghyun Jang
|
Hanbin Ko
|
Inhyeok Baek
|
Chang Min Park
We introduce CREPE (Rapid Chest X-ray Report Evaluation by Predicting Multi-category Error Counts), a rapid, interpretable, and clinically grounded metric for automated chest X-ray report generation. CREPE uses a domain-specific BERT model fine-tuned with a multi-head regression architecture to predict error counts across six clinically meaningful categories. Trained on a large-scale synthetic dataset of 32,000 annotated report pairs, CREPE demonstrates strong generalization and interpretability. On the expert-annotated ReXVal dataset, CREPE achieves a Kendall’s tau correlation of 0.786 with radiologist error counts, outperforming traditional and recent metrics. CREPE achieves these results with an inference speed approximately 280 times faster than large language model (LLM)-based approaches, enabling rapid and fine-grained evaluation for scalable development of chest X-ray report generation models.
pdf
bib
abs
TIDES: Technical Information Discovery and Extraction System
Jihee Kim
|
Subeen Park
|
Hakyung Lee
|
YongTaek Lim
|
Hyo-won Suh
|
Kyungwoo Song
Addressing the challenges in QA for specific technical domains requires identifying relevant portions of extensive documents and generating answers based on this focused content. Traditional pre-trained LLMs often struggle with domain-specific terminology, while fine-tuned LLMs demand substantial computational resources. To overcome these limitations, we propose TIDES, Technical Information Distillation and Extraction System. TIDES is a training-free approach that combines traditional TF-IDF techniques with prompt-based LLMs in a hybrid process, effectively addressing complex technical questions. It uses TF-IDF to identify and prioritize domain-specific words that are rare in other documents and LLMs to refine the candidate pool by focusing on the most relevant segments in documents through multiple stages. Our approach improves the precision and efficiency of QA systems in technical contexts without LLM retraining.
pdf
bib
abs
Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang
|
Shi Juluan
|
Zixuan Ling
|
Yuk-Kit Chan
|
Chaozheng Wang
|
Cheryl Lee
|
Youliang Yuan
|
Jen-tse Huang
|
Wenxiang Jiao
|
Michael R. Lyu
Equipped with the capability to call functions, modern LLM agents can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLM agents but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLM agents tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We find that due to the next-token prediction training objective, LLM agents tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed, which prompts LLM agents to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLM agents’ performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the Ask-when-Needed significantly outperforms existing frameworks for tool learning in the Noisy ToolBench. We will release all related code and datasets to support future research.
pdf
bib
abs
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Yuchi Wang
|
Yishuo Cai
|
Shuhuai Ren
|
Sihan Yang
|
Linli Yao
|
Yuanxin Liu
|
Yuanxing Zhang
|
Pengfei Wan
|
Xu Sun
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap.
pdf
bib
abs
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Xuhui Zheng
|
Kang An
|
Ziliang Wang
|
Yuhang Wang
|
Yichao Wu
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. The project is open source at https://github.com/Zillwang/StepSearch
pdf
bib
abs
Dynamic Model-Bank Test-Time Adaptation for Automatic Speech Recognition
Yanshuo Wang
|
Yanghao Zhou
|
Yukang Lin
|
Haoxing Chen
|
Jin Zhang
|
Wentao Zhu
|
Jie Hong
|
Xuesong Li
End-to-end automatic speech recognition (ASR) based on deep learning has achieved impressive progress in recent years. However, the performance of ASR foundation model often degrades significantly on out-of-domain data due to real-world domain shifts. Test-Time Adaptation (TTA) methods aim to mitigate this issue by adapting models during inference without access to source data. Despite recent progress, existing ASR TTA methods often struggle with instability under continual and long-term distribution shifts. To alleviate the risk of performance collapse due to error accumulation, we propose Dynamic Model-bank Single-Utterance Test-time Adaptation (DMSUTA), a sustainable continual TTA framework based on adaptive ASR model ensembling. DMSUTA maintains a dynamic model bank, from which a subset of checkpoints is selected for each test sample based on confidence and uncertainty criteria. To preserve both model plasticity and long-term stability, DMSUTA actively manages the bank by filtering out potentially collapsed models. This design allows DMSUTA to continually adapt to evolving domain shifts in ASR test-time scenarios. Experiments on diverse, continuously shifting ASR TTA benchmarks show that DMSUTA consistently outperforms existing continual TTA baselines, demonstrating superior robustness to domain shifts in ASR.
pdf
bib
abs
Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning
Wei Huang
|
Anda Cheng
|
Yinggui Wang
Recent advancements in large language models (LLMs) have shown impressive capabilities in various downstream tasks but typically face Catastrophic Forgetting (CF) during fine-tuning. In this paper, we propose the Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) overlap with pre-trained model parameters is a critical factor for CF. Based on this finding, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the pruning criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across eight datasets, covering natural language inference, General Q&A, Medical Q&A, Math Q&A, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 0.25% while maintaining 99.67% accuracy on downstream tasks. We provide the codes of FAPM at an anonymous repository(https://anonymous.4open.science/r/FAPM-65CF).
pdf
bib
abs
Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models
Hwiyeong Lee
|
Uiji Hwang
|
Hyelim Lim
|
Taeuk Kim
Large language models often retain unintended content, prompting growing interest in knowledge unlearning.Recent approaches emphasize localized unlearning, restricting parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning.In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning.Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.
pdf
bib
abs
ArgCMV: An Argument Summarization Benchmark for the LLM-era
Omkar Gurjar
|
Agam Goyal
|
Eshwar Chandrasekharan
Key point extraction is an important task in argument summarization which involves extracting high-level short summaries from arguments. Existing approaches for KP extraction have been mostly evaluated on the popular ArgKP21 dataset. In this paper, we highlight some of the major limitations of the ArgKP21 dataset and demonstrate the need for new benchmarks that are more representative of actual human conversations. Using SoTA large language models (LLMs), we curate a new argument key point extraction dataset called ArgCMV comprising of ∼12K arguments from actual online human debates spread across ∼3K topics. Our dataset exhibits higher complexity such as longer, co-referencing arguments, higher presence of subjective discourse units, and a larger range of topics over ArgKP21. We show that existing methods do not adapt well to ArgCMV and provide extensive benchmark results by experimenting with existing baselines and latest open source models. This work introduces a novel KP extraction dataset for long-context online discussions, setting the stage for the next generation of LLM-driven summarization research.
pdf
bib
abs
VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu
|
Junlong Ren
|
Qi Chai
|
Deheng Ye
|
Yujun Cai
|
Hao Wang
Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.
pdf
bib
abs
GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction
Xuelin Li
|
Xiangqi Jin
|
Linfeng Zhang
Efficient Key-Value (KV) cache management is essential for processing long text sequences in large language models (LLMs), where memory constraints often limit performance. Conventional KV eviction strategies, such as top-k selection based on attention scores, depend on static heuristics that fail to capture the evolving implicit dependencies among tokens during inference. To overcome this, we propose GraphKV, a graph-based framework that redefines token selection for KV cache compression. In GraphKV, tokens are modeled as nodes with importance scores, and edges represent their similarity relationships. Through a decay-signal-propagation mechanism, token importance is dynamically updated by propagating information across the graph, enabling adaptive retention of the most contextually significant tokens. GraphKV can be seamlessly utilized in existing KV cache eviction methods such as SnapKV and PyramidKV in a plug-and-play manner. Codes are available in the supplementary materials and will be released on Github.
pdf
bib
abs
Joint Modeling of Entities and Discourse Relations for Coherence Assessment
Wei Liu
|
Michael Strube
In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.
pdf
bib
abs
Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs
Jun Bai
|
Minghao Tong
|
Yang Liu
|
Zixia Jia
|
Zilong Zheng
Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses.Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization—offering a potential pathway toward targeted optimization for improved context faithfulness.To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding.Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts.Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient.
pdf
bib
abs
HMoE: Heterogeneous Mixture of Experts for Language Modeling
An Wang
|
Xingwu Sun
|
Ruobing Xie
|
Shuaipeng Li
|
Jiaqi Zhu
|
Zhen Yang
|
Pinxue Zhao
|
Weidong Han
|
Zhanhui Kang
|
Di Wang
|
Naoaki Okazaki
|
Cheng-zhong Xu
Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter utilization. In this study, we propose a novel Heterogeneous Mixture of Experts (HMoE) framework, where experts differ in size and thus possess diverse capacities. This heterogeneity allows for more specialized experts to handle varying token complexities more effectively. To address the imbalance in expert activation, we propose a novel training objective that encourages the frequent activation of smaller experts, so as to improve computational efficiency and parameter utilization. Extensive experiments demonstrate that HMoE achieves a lower loss rate with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks. Codes will be released upon acceptance.
pdf
bib
abs
The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking
Yaoyao Qian
|
Yifan Zeng
|
Yuchao Jiang
|
Chelsi Jain
|
Huazheng Wang
Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the “Ranking Blind Spot”—a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: **Decision Objective Hijacking**, which alters the evaluation goal in pairwise ranking systems, and **Decision Criteria Hijacking**, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to real-world examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks.
pdf
bib
abs
Uniform Information Density and Syntactic Reduction: Revisiting *that*-Mentioning in English Complement Clauses
Hailin Hao
|
Elsi Kaiser
Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer *that* in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and *that*-mentioning. However, we found that previous measures of information density based on matrix verbs’ subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.
pdf
bib
abs
GRIT: Guided Relational Integration for Efficient Multi-Table Understanding
Yujin Kang
|
Park Seong Woo
|
Yoon-Sik Cho
Recent advances in large language models (LLMs) have opened new possibilities for table-based tasks. However, most existing methods remain confined to single-table settings, limiting their applicability to real-world databases composed of multiple interrelated tables. In multi-table scenarios, LLMs face two key challenges: reasoning over relational structures beyond sequential text, and handling the input length limitations imposed by large-scale table concatenation. To address these issues, we propose Guided Relational Integration for multiple Tables (GRIT), a lightweight method that converts relational schemas into LLM-friendly textual representations. GRIT employs hashing-based techniques to efficiently infer primary–foreign key relationships and constructs prompts that explicitly encode relevant join paths and question-relevant columns. When applied to off-the-shelf LLMs, GRIT consistently improves table-column retrieval performance across diverse multi-table benchmarks while significantly reducing memory and computational overhead.
pdf
bib
abs
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang
|
Siyue Zhang
|
Junbo Zhao
|
Chen Zhao
Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
pdf
bib
abs
Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering
Lorena Calvo-Bartolomé
|
Valérie Aldana
|
Karla Cantarero
|
Alonso Madroñal de Mesa
|
Jerónimo Arenas-García
|
Jordan Lee Boyd-Graber
Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.
pdf
bib
abs
Data-Efficient Selection via Grammatical Complexity in Continual Pre-training of Domain-Specific LLMs
Yizhou Ying
|
Geng Zhang
|
Cui Danxin
|
Chengyu Du
|
Guanglei Yue
|
Sihang Jiang
|
Jiaqing Liang
|
Yifei Fu
|
Hailin Hu
|
Yanghua Xiao
Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for “small data, big impact,” this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.
pdf
bib
abs
Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models
Guangyu Xie
|
Yice Zhang
|
Jianzhu Bao
|
Qianlong Wang
|
Yang Sun
|
Bingbing Wang
|
Ruifeng Xu
Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce CompEffDist, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data. All codes are available at
https://github.com/HITSZ-HLT/COMPEFFDIST.
pdf
bib
abs
One Planner To Guide Them All ! Learning Adaptive Conversational Planners for Goal-oriented Dialogues
Huy Quang Dao
|
Lizi Liao
Goal-oriented dialogues, such as recommendation and negotiation, often require balancing multiple, conflicting objectives. Existing methods typically involve training separate models for specific combinations of objectives, leading to computational and scalability issues. In this work, we aim to develop a new dialogue policy method that can adapt to varying objective preferences at inference time without retraining. This raises several challenges in terms of both (1) optimization strategy and (2) knowledge utilization. To address these, we propose a novel learning framework, Preference Adaptive Dialogue Policy Planner (PADPP), for multi-objective goal-oriented dialogues. Specifically, to tackle the former, we introduce a novel policy optimization scheme, which leverages information gained from training the model on previously updated objective weights, accelerating the learning capability on new weight settings. To address the latter, we utilize Generalized Policy Improvement (GPI) to ensure the effectiveness of leveraged knowledge. Experimental results demonstrate that PADPP achieves superior adaptability and performance compared to state-of-the-art approaches, offering a scalable and flexible solution for multi-objective, goal-oriented dialogues. Code and data are available at the anonymous link.
pdf
bib
abs
Unsupervised Hallucination Detection by Inspecting Reasoning Processes
Ponhvoan Srey
|
Xiaobao Wu
|
Anh Tuan Luu
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
pdf
bib
abs
Multimodal Neural Machine Translation: A Survey of the State of the Art
Yi Feng
|
Chuanyi Li
|
Jiatong He
|
Zhenyu Hou
|
Vincent Ng
Multimodal neural machine translation (MNMT) has received increasing attention due to its widespread applications in various fields such as cross-border e-commerce and cross-border social media platforms. The task aims to integrate other modalities, such as the visual modality, with textual data to enhance translation performance. We survey the major milestones in MNMT research, providing a comprehensive overview of relevant datasets and recent methodologies, and discussing key challenges and promising research directions.
pdf
bib
abs
Lemmatization of Polish Multi-word Expressions
Magdalena Król
|
Aleksander Smywiński-Pohl
|
Zbigniew Kaleta
|
Paweł Lewkowicz
This paper explores the lemmatization of multi-word expressions (MWEs) and proper names in Polish – tasks complicated by linguistic irregularities and historical factors. Instead of using rule-based methods, we apply a machine learning approach with fine-tuned plT5 and mT5 models. We trained and validated the models on enhanced gold-standard data from the 2019 PolEval task and evaluated the impact of additional fine-tuning on a silver-standard dataset derived from Wikipedia. Two setups were tested: one without context, and one using left-side context of the target MWE. Our best model achieved 86.23% AccCS (Accuracy Case-Sensitive), 89.43% AccCI (Accuracy Case-Insensitive), and a combined score of 88.79%, setting a new state-of-the-art for Polish MWE and named entity lemmatization, as confirmed by the PolEval maintainers. We also evaluated optimization and quantization techniques to reduce model size and inference time with minimal quality loss.
pdf
bib
abs
Targeted Distillation for Sentiment Analysis
Yice Zhang
|
Guangyu Xie
|
Jingjie Lin
|
Jianzhu Bao
|
Qianlong Wang
|
Xi Zeng
|
Ruifeng Xu
This paper explores targeted distillation methods for sentiment analysis, aiming to build compact and practical models that preserve strong and generalizable sentiment analysis capabilities. To this end, we conceptually decouple the distillation target into knowledge and alignment and accordingly propose a two-stage distillation framework. Moreover, we introduce SentiBench, a comprehensive and systematic sentiment analysis benchmark that covers a diverse set of tasks across 12 datasets. We evaluate a wide range of models on this benchmark. Experimental results show that our approach substantially enhances the performance of compact models across diverse sentiment analysis tasks, and the resulting models demonstrate strong generalization to unseen tasks, showcasing robust competitiveness against existing small-scale models.
pdf
bib
abs
DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak
Hao Wang
|
Hao Li
|
Junda Zhu
|
Xinyuan Wang
|
Chengwei Pan
|
Minlie Huang
|
Lei Sha
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model’s output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.
pdf
bib
abs
Rank-Awareness and Angular Constraints: A New Perspective on Learning Sentence Embeddings from NLI Data
Zicheng Zhou
|
Min Huang
|
Qinghai Miao
Learning high-quality sentence embeddings from Natural Language Inference (NLI) data is often challenged by a critical signal conflict between discrete labels and the continuous spectrum of semantic similarity, as well as information loss from discarded neutral sentence pairs during training. To address this, we introduce Rank-Awareness and Angular Optimization Embeddings (RAOE), a framework that leverages the full NLI dataset (Entailment, Neutral, Contradiction) augmented with pre-computed continuous similarity scores (S). RAOE employs a novel composite objective which features: (1) a Rank Margin objective that enforces rank consistency against S using an explicit margin, and (2) a Gated Angular objective that conditionally refines embedding geometry based on NLI label (L) and S score agreement. Extensive evaluations on STS tasks and the MTEB benchmark demonstrate RAOE’s effectiveness. Our general-purpose RAOE-S1 model (BERT-base) significantly outperforms strong baselines, achieving an average Spearman’s correlation of 85.11 (vs. SimCSE’s 81.57 and AnglE’s 82.43), and shows consistent improvements on MTEB. Further STS-specialized fine-tuning (RAOE-S2) establishes new state-of-the-art performance on STS (88.17 with BERT-base). These results confirm RAOE’s ability to efficiently learn robust and nuanced sentence representations through the synergy of rank-awareness and conditional angular constraints. Code is available at https://github.com/Shengjingwa/RAOE.
pdf
bib
abs
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
Qianrui Zhou
|
Hua Xu
|
Yifan Wang
|
Xinzhi Dong
|
Hanlei Zhang
Understanding human intents from multimodal signals is critical for analyzing human behaviors and enhancing human-machine interactions in real-world scenarios. However, existing methods exhibit limitations in their modality-level reliance, constraining relational reasoning over fine-grained semantics for complex intent understanding. This paper proposes a novel LLM-Guided Semantic Relational Reasoning (LGSRR) method, which harnesses the expansive knowledge of large language models (LLMs) to establish semantic foundations that boost smaller models’ relational reasoning performance. Specifically, an LLM-based strategy is proposed to extract fine-grained semantics as guidance for subsequent reasoning, driven by a shallow-to-deep Chain-of-Thought (CoT) that autonomously uncovers, describes, and ranks semantic cues by their importance without relying on manually defined priors. Besides, we formally model three fundamental types of semantic relations grounded in logical principles and analyze their nuanced interplay to enable more effective relational reasoning. Extensive experiments on multimodal intent and dialogue act recognition tasks demonstrate LGSRR’s superiority over state-of-the-art methods, with consistent performance gains across diverse semantic understanding scenarios. The complete data and code are available at https://github.com/thuiar/LGSRR.
pdf
bib
abs
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
Burak Satar
|
Zhixin Ma
|
Patrick Amadeus Irawan
|
Wilfried Ariel Mulyawan
|
Jing Jiang
|
Ee-Peng Lim
|
Chong-Wah Ngo
Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures.In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. https://github.com/buraksatar/SeeingCulture
pdf
bib
abs
GRADA: Graph-based Reranking against Adversarial Documents Attack
Jingjie Zheng
|
Aryo Pradipta Gema
|
Giwon Hong
|
Xuanli He
|
Pasquale Minervini
|
Youcheng Sun
|
Qiongkai Xu
Retrieval Augmented Generation (RAG) frameworks can improve the factual accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models’ static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective **G**raph-based **R**eranking against **A**dversarial **D**ocument **A**ttacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on six LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b-Instruct, Llama3.1-70b-Instruct, Qwen2.5-7b-Instruct and Qwen2.5-14b-Instruct. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.
pdf
bib
abs
Orchestrating Audio: Multi-Agent Framework for Long-Video Audio Synthesis
Yehang Zhang
|
Xinli Xu
|
Xiaojie Xu
|
Doudou Zhang
|
Li Liu
|
Ying-Cong Chen
Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, audio diversity and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a multi-agent framework that offers a coordinated, multi-component approach to long-video audio generation. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, audio design and audio synthesis. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments show that our method outperforms state-of-the-art V2A models in overall audio synthesis quality.
pdf
bib
abs
MADAWSD: Multi-Agent Debate Framework for Adversarial Word Sense Disambiguation
Kaiyuan Zhang
|
Qian Liu
|
Luyang Zhang
|
Chaoqun Zheng
|
Shuaimin Li
|
Bing Xu
|
Muyun Yang
|
Xinxiao Qiao
|
Wenpeng Lu
Word sense disambiguation (WSD) is a fundamental yet challenging task in natural language processing. In recent years, the advent of large language models (LLMs) has led to significant advancements in regular WSD tasks. However, most existing LLMs face two major issues that hinder their performance in WSD. Firstly, these models are often prone to misclassifying the correct meaning of an ambiguous word when confronted with contexts containing adversarial information. Secondly, there is a lack of sufficient adversarial WSD datasets, which severely limits the development and evaluation of adversarial WSD systems. To address these gaps, we propose a novel Multi-Agent Debate framework for Adversarial Word Sense Disambiguation (MADAWSD). The MADAWSD framework simulates a real-world debate environment where multiple agent roles, namely, the Debater, Moderator, Consensus-seeker, and Judge, engage in discussions about ambiguous words in the context of adversarial information. Through a collaborative mechanism among these agents, it achieves accurate WSD. Additionally, a novel dataset for Chinese adversarial WSD has been constructed, focusing on improving and evaluating the performance of WSD models in the Chinese language. Extensive experiments on both English and Chinese adversarial WSD datasets demonstrate that MADAWSD can seamlessly integrate with existing LLMs and significantly enhance their performance, showcasing broad generality and outstanding effectiveness.
pdf
bib
abs
Interpretable Text Embeddings and Text Similarity Explanation: A Survey
Juri Opitz
|
Lucas Moeller
|
Andrianos Michail
|
Sebastian Padó
|
Simon Clematide
Text embeddings are a fundamental component in many NLP tasks, including classification, regression, clustering, and semantic search. However, despite their ubiquitous application, challenges persist in interpreting embeddings and explaining similarities between them.In this work, we provide a structured overview of methods specializing in inherently interpretable text embeddings and text similarity explanation, an underexplored research area. We characterize the main ideas, approaches, and trade-offs. We compare means of evaluation, discuss overarching lessons learned and finally identify opportunities and open challenges for future research.
pdf
bib
abs
Dyve: Thinking Fast and Slow for Dynamic Process Verification
Jianyuan Zhong
|
Zeju Li
|
Zhijian Xu
|
Xiangyu Wen
|
Qiang Xu
Large Language Models have advanced significantly in complex reasoning, often leveraging external reward model to improve the reliability of their multi-step processes. However, existing process verification methods struggle with reliably assessing incomplete reasoning traces and are limited by the cost of high-quality human annotations or the inherent noise in automatically generated labels. Therefore, we present Dyve, a dynamic process verifier that enhances reasoning error detection in large language models by integrating fast and slow thinking, inspired by Kahneman’s Systems Theory. Dyve adaptively applies immediate token-level confirmation (System 1) for straightforward steps and comprehensive analysis (System 2) for complex ones. Unlike traditional verifiers that only evaluate final outputs, Dyve employs a step-wise consensus-filtered supervision strategy, leveraging Monte Carlo estimation, LLM-as-a-Judge, and specialized reasoning models to extract high-quality training signals from noisy rollouts. Experimental results on ProcessBench and the MATH dataset confirm that Dyve significantly outperforms existing process-based verifiers and boosts performance in Best-of-N settings while maintaining computational efficiency by strategically allocating verification resources.
pdf
bib
abs
PERSEVAL: A Framework for Perspectivist Classification Evaluation
Soda Marem Lo
|
Silvia Casola
|
Erhan Sezerer
|
Valerio Basile
|
Franco Sansonetti
|
Antonio Uva
|
Davide Bernardi
Data perspectivism goes beyond majority vote label aggregation by recognizing various perspectives as legitimate ground truths.However, current evaluation practices remain fragmented, making it difficult to compare perspectivist approaches and analyze their impact on different users and demographic subgroups. To address this gap, we introduce PersEval, the first unified framework for evaluating perspectivist models in NLP. A key innovation is its evaluation at the individual annotator level and its treatment of annotators and users as distinct entities, consistently with real-world scenarios. We demonstrate PersEval’s capabilities through experiments with both Encoder-based and Decoder-based approaches, as well as an analysis of the effect of sociodemographic prompting. By considering global, text-, trait- and user-level evaluation metrics, we show that PersEval is a powerful tool for examining how models are influenced by user-specific information and identifying the biases this information may introduce.
pdf
bib
abs
Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality
Yuto Harada
|
Yusuke Yamauchi
|
Yusuke Oda
|
Yohei Oseki
|
Yusuke Miyao
|
Yu Takagi
Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT.Our findings reveal that some training–task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness, often surpassing superficial similarity between the training data and the benchmark, and that mid-layer weight changes correlate most strongly with performance gains. We release these 1,000+ SFT models and benchmark results to accelerate further research. All resources are available at
https://github.com/llm-jp/massive-sft.
pdf
bib
abs
IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages
Ujjwal Sharma
|
Pushpak Bhattacharyya
Grammatical Error Correction (GEC) for low-resource Indic languages faces significant challenges due to the scarcity of annotated data. In this work, we introduce the Mask-Translate&Fill (MTF) framework, a novel approach for generating high-quality synthetic data for GEC using only monolingual corpora. MTF leverages a machine translation system and a pretrained masked language model to introduce synthetic errors and tries to mimic errors made by second-language learners. Our experimental results on English, Hindi, Bengali, Marathi, and Tamil demonstrate that MTF consistently outperforms other monolingual synthetic data generation methods and achieves performance comparable to the Translation Language Modeling (TLM)-based approach, which uses a bilingual corpus, in both independent and multilingual settings. Under multilingual training, MTF yields significant improvements across Indic languages, with particularly notable gains in Bengali and Tamil, achieving +1.6 and +3.14 GLEU over the TLM-based method, respectively. To support further research, we also introduce the IndiGEC Corpus, a high-quality, human-written, manually validated GEC dataset for these four Indic languages, comprising over 8,000 sentence pairs with separate development and test splits.
pdf
bib
abs
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
Giorgos Filandrianos
|
Angeliki Dimitriou
|
Maria Lymperaiou
|
Konstantinos Thomas
|
Giorgos Stamou
The advent of Large Language Models (LLMs) has revolutionized product recommenders, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making such manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive evaluation across models of varying scale, we find that certain biases, such as social proof, consistently boost product recommendation rate and ranking, while others, like scarcity and exclusivity, surprisingly reduce visibility. Our results demonstrate that cognitive biases are deeply embedded in state-of-the-art LLMs, leading to highly unpredictable behavior in product recommendations and posing significant challenges for effective mitigation.
pdf
bib
abs
T2R-BENCH: A Benchmark for Real World Table-to-Report Task
Jie Zhang
|
Changzai Pan
|
Sishi Xiong
|
Kaiwen Wei
|
Yu Zhao
|
Xiangyu Li
|
Jiaxin Peng
|
Xiaoyan Gu
|
Jian Yang
|
Wenhan Chang
|
Zhenhe Wu
|
Jiang Zhong
|
Shuangyong Song
|
Xuelong Li
Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as four types of industrial tables. Furthermore, we propose a novel evaluation criteria to fairly measure the quality of report generation. Expeimental results show that Deepseek-R1 only achieves the best performance with 62.71% overall score, indicating that LLMs still have room for improvement on T2R-bench.
pdf
bib
abs
TCP: a Benchmark for Temporal Constraint-Based Planning
Zifeng Ding
|
Sikuan Yan
|
Moy Yuan
|
Xianglong Hu
|
Fangru Lin
|
Andreas Vlachos
Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we generate abstract problem prototypes that are then paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models may struggle with TCP, highlighting its difficulty and revealing limitations in LLMs’ temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.
pdf
bib
abs
The Role of Outgoing Connection Heterogeneity in Feedforward Layers of Large Language Models
Felix Stahlberg
|
Shankar Kumar
We report on investigations into the characteristics of outgoing connections in feedforward layers of large language models. Our findings show that inner neurons with diverse outgoing connection strengths are more critical to model performance than those with uniform connections. We propose a new fine-tuning loss that takes advantage of this observation by decreasing the outgoing connection entropy in feedforward layers. Using this loss yields gains over standard fine-tuning across two different model families (PaLM-2 and Gemma-2) for downstream tasks in math, coding, and language understanding. To further elucidate the role of outgoing connection heterogeneity, we develop a data-free structured pruning method, which uses entropy to identify and remove neurons. This method is considerably more effective than removing neurons either randomly or based on their magnitude.
pdf
bib
abs
Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents
Manan Suri
|
Puneet Mathur
|
Nedim Lipka
|
Franck Dernoncourt
|
Ryan A. Rossi
|
Vivek Gupta
|
Dinesh Manocha
Flowcharts are a critical tool for visualizing decision-making processes. However, their non-linear structure and complex visual-textual relationships make it challenging to interpret them using LLMs, as vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams. This leads to compromised reliability for automated flowchart processing in critical domains such as logistics, health, and engineering. We introduce the task of Fine-grained Flowchart Attribution, which traces specific components grounding a flowchart referring LLM response. Flowchart Attribution ensures the verifiability of LLM predictions and improves explainability by linking generated responses to the flowchart’s structure. We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning. It first segments the flowchart, then converts it into a structured symbolic graph, and then employs an agentic approach to dynamically interact with the graph, to generate attribution paths. Additionally, we present FlowExplainBench, a novel benchmark for evaluating flowchart attributions across diverse styles, domains, and question types. Experimental results show that FlowPathAgent mitigates visual hallucinations in LLM answers over flowchart QA, outperforming strong baselines by 10–14% on our proposed FlowExplainBench dataset.
pdf
bib
abs
Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog
Lautaro Estienne
|
Gabriel Ben Zenou
|
Nona Naderi
|
Jackie CK Cheung
|
Pablo Piantanida
As AI systems take on collaborative roles, they must reason about shared goals and beliefs—not just generate fluent language. The Rational Speech Act (RSA) framework offers a principled approach to pragmatic reasoning, but existing extensions face challenges in scaling to multi-turn, collaborative scenarios. In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor–patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines—paving the way for more pragmatic and socially aware language agents.
pdf
bib
abs
Understanding Subword Compositionality of Large Language Models
Qiwei Peng
|
Yekun Chai
|
Anders Søgaard
Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.
pdf
bib
abs
Internal Chain-of-Thought: Empirical Evidence for Layer‐wise Subtask Scheduling in LLMs
Zhipeng Yang
|
Junzhuo Li
|
Siyu Xia
|
Xuming Hu
We show that large language models (LLMs) exhibit an internal chain-of-thought: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world TRACE benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.
pdf
bib
abs
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
Viktor Hangya
|
Fabian Küch
|
Darina Gold
Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut
pdf
bib
abs
Debiasing Multilingual LLMs in Cross-lingual Latent Space
Qiwei Peng
|
Guimin Hu
|
Yekun Chai
|
Anders Søgaard
Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.
pdf
bib
abs
Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
Max Conti
|
Manuel Faysse
|
Gautier Viaud
|
Antoine Bosselut
|
Celine Hudelot
|
Pierre Colombo
A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations.In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.
pdf
bib
abs
MS-RAG: Simple and Effective Multi-Semantic Retrieval-Augmented Generation
Xiaozhou You
|
Yahui Luo
|
Lihong Gu
To alleviate the hallucination problem of large language model (LLM), retrieval-augmented generation (RAG) has been proposed and widely adopted. Due to the limitations in cross-chunk summarization task of naive RAG, graph-based RAG has emerged as a promising solution. However, a close study reveals several flaws in these works. First, most graph-based RAGs suffer from less efficient indexing process, which leads to information loss and expensive costs. Second, they heavily rely on LLM for retrieval thus inference slowly, which hinders their application in industry. To build a more efficient and effective RAG, we propose the multi-semantic RAG (MS-RAG). In this work, we combine knowledge graphs with dense vector to build a multi-semantic RAG. To be specific, (i) at indexing stage, we create multiple semantic-level indexes, including chunk-level, relation-level, and entity-level, to leverage the merits of dense vector and knowledge graph. (ii) at retrieval stage, unlike the previous LLM-empowered entity extraction, we propose a novel mix recall algorithm. Finally, we employ a multi-semantic rerank module to purify the results. Extensive experiments show that MS-RAG achieves superior performance. In terms of retrieval effect, MS-RAG achieves state-of-the-art performance, which is about 10%-30% improvement than the existing methods. In terms of question-answering effect, MS-RAG still achieves promising results with faster inference speed. More analysis and experiments are provided in Appendix.
pdf
bib
abs
Transitive self-consistency evaluation of NLI models without gold labels
Wei Wu
|
Mark Last
Natural Language Inference (NLI) is an important task in natural language processing. NLI models are aimed at automatically determining logical relationships between pairs of sentences. However, recent studies based on gold labels assigned to sentence pairs by human experts have provided some evidence that NLI models tend to make inconsistent model decisions during inference. Previous studies have used existing NLI datasets to test the transitive consistency of language models. However, they test only variations of two transitive consistency rules out of four. To further evaluate the transitive consistency of NLI models, we propose a novel evaluation approach that allows us to test all four rules automatically by generating adversarial examples via antonym replacements. Since we are testing self-consistency, human labeling of generated adversarial examples is unnecessary. Our experiments on several benchmark datasets indicate that the examples generated by the proposed antonym replacement methodology can reveal transitive inconsistencies in the state-of-the-art NLI models.
pdf
bib
abs
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Jonghwi Kim
|
Deokhyung Kang
|
Seonjeong Hwang
|
Yunsu Kim
|
Jungseul Ok
|
Gary Lee
Despite bilingual speakers frequently using mixed-language queries in web searches, Information Retrieval (IR) research on them remains scarce. To address this, we introduce ***MiLQ***, ***Mi***xed-***L***anguage ***Q***uery test set, the first public benchmark of mixed-language queries, qualified as realistic and relatively preferred. Experiments show that multilingual IR models perform moderately on MiLQ and inconsistently across native, English, and mixed-language queries, also suggesting code-switched training data’s potential for robust IR models handling such queries. Meanwhile, intentional English mixing in queries proves an effective strategy for bilinguals searching English documents, which our analysis attributes to enhanced token matching compared to native queries.
pdf
bib
abs
Enhancing Chinese Offensive Language Detection with Homophonic Perturbation
Junqi Wu
|
Shujie Ji
|
Kang Zhong
|
Huiling Peng
|
Zhendongxiao
|
Xiongding Liu
|
Wu Wei
Detecting offensive language in Chinese is challenging due to homophonic substitutions used to evade detection. We propose a framework to improve large language models’ robustness against such phonetic attacks. First, we construct HED-COLD, the first large-scale and systematic homophonic dataset for Chinese offensive language detection. Additionally, we design a homophone-aware pretraining strategy that learns the mappings among orthography, phonetics, and semantics between original and perturbed text. Experimental results show that our approach achieves state-of-the-art performance on both the COLD test set and the toxicity benchmark ToxiCloakCN. Notably, it achieves greater gains in domains susceptible to homophonic attacks, such as gender and regional content. These results demonstrate improved robustness and generalization against phonetic adversarial attacks.
pdf
bib
abs
Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
Kimberly Truong
|
Riccardo Fogliato
|
Hoda Heidari
|
Steven Wu
Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, or recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for LLM performance across linguistic variations.
pdf
bib
abs
Computational Analysis of Character Development in Holocaust Testimonies
Esther Shizgal
|
Eitan Wagner
|
Renana Keydar
|
Omri Abend
This work presents a computational approach to analyze character development along the narrative timeline. The analysis characterizes changes in the protagonist’s views and behavior and the interplay between them. We consider transcripts of Holocaust survivor testimonies as a test case, each telling the story of an individual in first-person terms. We focus on the survivor’s religious trajectory, examining the evolution of their disposition toward religious belief and practice as it is reflected in the testimony. Clustering the resulting trajectories in the dataset, we identify common sequences in the data. Our findings highlight multiple common structures of religiosity across the narratives: in terms of belief, a constant disposition is common, while for practice, most present an oscillating structure, serving as valuable material for historical and sociological research. This work demonstrates the potential of natural language processing for analyzing character evolution through thematic trajectories in narratives.
pdf
bib
abs
TASO: Task-Aligned Sparse Optimization for Parameter-Efficient Model Adaptation
Daiye Miao
|
Yufang Liu
|
Jie Wang
|
Changzhi Sun
|
Yunke Zhang
|
Demei Yan
|
Shaokang Dong
|
Qi Zhang
|
Yuanbin Wu
LoRA has become one of the most widely used parameter-efficient fine-tuning methods due to its simplicity and effectiveness. However, numerous studies have shown that LoRA often introduces substantial parameter redundancy, which not only increases the number of trainable parameters but also hinders the effectiveness of fine-tuning. Since identifying redundant parameters in LoRA is inherently difficult, how to eliminate them efficiently and accurately remains a challenging problem. In this paper, we propose TASO, a redundancy reduction method that leverages importance information from the pretrained model’s weights to mitigate LoRA redundancy. Specifically, we estimate parameter importance on downstream tasks and identify task-specific core regions based on the distribution of importance scores. The location information of these core regions is then used to determine the sparse structure of LoRA modules, enabling redundancy removal before fine-tuning. Our approach significantly reduces the number of trainable parameters required for task adaptation, while providing a novel task-aligned perspective for LoRA redundancy reduction. Experimental results demonstrate that, with a parameter budget comparable to LoRA with rank r = 1, TASO consistently outperforms standard LoRA across multiple tasks, achieving strong fine-tuning performance while effectively eliminating redundant parameters.
pdf
bib
abs
Dual-Path Counterfactual Integration for Multimodal Aspect-Based Sentiment Classification
Rui Liu
|
Jiahao Cao
|
Jiaqian Ren
|
Xu Bai
|
Yanan Cao
Multimodal aspect-based sentiment classification (MABSC) requires fine-grained reasoning over both textual and visual content to infer sentiments toward specific aspects. However, existing methods often rely on superficial correlations—particularly between aspect terms and sentiment labels—leading to poor generalization and vulnerability to spurious cues. To address this limitation, we propose DPCI, a novel Dual-Path Counterfactual Integration framework that enhances model robustness by explicitly modeling counterfactual reasoning in multimodal contexts. Specifically, we design a dual counterfactual generation module that simulates two types of interventions: replacing aspect terms and rewriting descriptive content, thereby disentangling the spurious dependencies from causal sentiment cues. We further introduce a sample-aware counterfactual selection strategy to retain high-quality, diverse counterfactuals tailored to each generation path. Finally, a confidence-guided integration mechanism adaptively fuses counterfactual signals into the main prediction stream. Extensive experiments on standard MABSC benchmarks demonstrate that DPCI not only achieves state-of-the-art performance but also significantly improves model robustness.
pdf
bib
abs
Job Unfair: An Investigation of Gender and Occupational Bias in Free-Form Text Completions by LLMs
Camilla Casula
|
Sebastiano Vecellio Salto
|
Elisa Leonardelli
|
Sara Tonelli
Disentangling how gender and occupations are encoded by LLMs is crucial to identify possible biases and prevent harms, especially given the widespread use of LLMs in sensitive domains such as human resources.In this work, we carry out an in-depth investigation of gender and occupational biases in English and Italian as expressed by 9 different LLMs (both base and instruction-tuned). Specifically, we focus on the analysis of sentence completions when LLMs are prompted with job-related sentences including different gender representations. We carry out a manual analysis of 4,500 generated texts over 4 dimensions that can reflect bias, we propose a novel embedding-based method to investigate biases in generated texts and, finally, we carry out a lexical analysis of the model completions. In our qualitative and quantitative evaluation we show that many facets of social bias remain unaccounted for even in aligned models, and LLMs in general still reflect existing gender biases in both languages. Finally, we find that models still struggle with gender-neutral expressions, especially beyond English.
pdf
bib
abs
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
Chengqian Ma
|
Wei Tao
|
Steven Y. Guo
Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.
pdf
bib
abs
Understanding LLMs’ Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Changjiang Gao
|
Hankun Lin
|
Xin Huang
|
Xue Han
|
Junlan Feng
|
Chao Deng
|
Jiajun Chen
|
Shujian Huang
Cross-lingual context retrieval (extracting contextual information in one language based on requests in another) is a fundamental aspect of cross-lingual alignment, but the performance and mechanism of it for large language models (LLMs) remains unclear. In this paper, we evaluate the cross-lingual context retrieval of over 40 LLMs across 12 languages, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our mechanism analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential.
pdf
bib
abs
Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets
Mahdi Zakizadeh
|
Mohammad Taher Pilehvar
Accurately measuring gender stereotypical bias in language models is a complex task with many hidden aspects. Current benchmarks have underestimated this multifaceted challenge and failed to capture the full extent of the problem. This paper examines the inconsistencies between intrinsic stereotype benchmarks. We propose that currently available benchmarks each capture only partial facets of gender stereotypes, and when considered in isolation, they provide just a fragmented view of the broader landscape of bias in language models. Using StereoSet and CrowS-Pairs as case studies, we investigated how data distribution affects benchmark results. By applying a framework from social psychology to balance the data of these benchmarks across various components of gender stereotypes, we demonstrated that even simple balancing techniques can significantly improve the correlation between different measurement approaches. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
pdf
bib
abs
Linguistic and Embedding-Based Profiling of Texts Generated by Humans and Large Language Models
Sergio E. Zanotto
|
Segun Aroyehun
The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written or machine-generated texts, our study focuses on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality, and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls, and model release dates. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human- and machine-generated texts show stylistic diversity across domains, with human-written texts displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to a homogenization of machine-generated texts.
pdf
bib
abs
An Interdisciplinary Approach to Human-Centered Machine Translation
Marine Carpuat
|
Omri Asscher
|
Kalika Bali
|
Luisa Bentivogli
|
Fred Blain
|
Lynne Bowker
|
Monojit Choudhury
|
Hal Daumé Iii
|
Kevin Duh
|
Ge Gao
|
Alvin C Grissom II
|
Marzena Karpinska
|
Elaine C Khoong
|
William D. Lewis
|
Andre Martins
|
Mary Nurminen
|
Douglas W. Oard
|
Maja Popovic
|
Michel Simard
|
François Yvon
Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and real-world usage, particularly for non-expert users who may struggle to assess translation reliability.This paper advocates for a human-centered approach to MT, emphasizing the alignment of system design with diverse communicative goals and contexts of use. We survey the literature in Translation Studies and Human-Computer Interaction to recontextualize MT evaluation and design to address the diverse real-world scenarios in which MT is used today.
pdf
bib
abs
Exploring the Hidden Capacity of LLMs for One-Step Text Generation
Gleb Mezentsev
|
Ivan Oseledets
A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts — up to thousands of tokens — via autoregressive generation from just one trained input embedding. In this work, we explore whether autoregressive decoding is essential for such reconstruction. We show that frozen LLMs can generate hundreds of accurate tokens in just one token-parallel forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored multi-token generation capability of autoregressive LLMs. We examine these embeddings and characterize the information they encode. We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space — suggesting the potential to train a practical encoder. The existence of such representations hints that multi-token generation may be natively accessible in off-the-shelf LLMs via a learned input encoder, eliminating heavy retraining and helping to overcome the fundamental bottleneck of autoregressive decoding while reusing already-trained models.
pdf
bib
abs
Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization
Guanghui Song
|
Dongping Liao
|
Yiren Zhao
|
Kejiang Ye
|
Cheng-zhong Xu
|
Xitong Gao
Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding “low-priority” tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA’s superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.
pdf
bib
abs
PathwiseRAG: Multi-Dimensional Exploration and Integration Framework
Hengrui Zhang
|
Pin-Siang Huang
|
Zhen Zhang
|
Peican Lin
|
Yao-Ching Yu
|
Bo Hu
|
Yulu Du
Conventional retrieval-augmented generation(RAG) systems employ rigid retrieval strategies that create: (1) knowledge blind spots across domain boundaries, (2) reasoning fragmentation when processing interdependent concepts, and (3) contradictions from conflicting evidence sources. Motivated by these limitations, we introduce PathwiseRAG, which addresses these challenges through: intent-aware strategy selection to eliminate blind spots, dynamic reasoning networks that capture sub-problem interdependencies to overcome fragmentation, and parallel path exploration with adaptive refinement to resolve conflicts. The framework models query intent across semantic and reasoning dimensions, constructs a directed acyclic graph of interconnected sub-problems, and explores multiple reasoning trajectories while continuously adapting to emerging evidence. Evaluation across challenging benchmarks demonstrates significant improvements over state-of-the-art RAG systems, with average accuracy gains of 4.9% and up to 6.9% on complex queries, establishing a new paradigm for knowledge-intensive reasoning by transforming static retrieval into dynamic, multi-dimensional exploration.
pdf
bib
abs
“Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue
Anh Ha Ngo
|
Nicolas Rollet
|
Catherine Pelachaud
|
Chloé Clavel
Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
pdf
bib
abs
R-BPE: Improving BPE-Tokenizers with Token Reuse
Nancy Hamdan
|
Osama Rakan Al Mraikhat
|
Fadi Zaraket
This paper presents R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language.It reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language.We evaluate R-BPE on Arabic as a target language. R-BPE reduced subword fertility by an average of 24.4% across the LLaMA 3.1 8B, Command R 35B, and Qwen 3 8B models.Applied to LLaMA 3.1 8B in continued pretraining mode, R-BPE yields a 7.33% reduction in training time. On the ArabicMMLU benchmark, the resulting model improved by 5.09 points on five in-domain topics and matched the original model’s overall performance.It also preserved performance on EnglishMMLU. R-BPE effectively leverages existing models’ tokenizers, embedding layers, and performance to better support target languages without incurring model size changes. We release an R-BPE implementation that is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models at https://acr.ps/1L9GPmL.
pdf
bib
abs
Language Models Can be Efficiently Steered via Minimal Embedding Layer Transformations
Diogo Tavares
|
David Semedo
|
Alexander Rudnicky
|
Joao Magalhaes
Large Language Models (LLMs) are increasingly costly to fine-tune due to their size, with embedding layers alone accounting for up to 20% of model parameters. While Parameter-Efficient Fine-Tuning (PEFT) methods exist, they largely overlook the embedding layer. In this paper, we introduce TinyTE, a novel PEFT approach that steers model behavior via minimal translational transformations in the embedding space. TinyTE modifies input embeddings without altering hidden layers, achieving competitive performance while requiring approximately 0.0001% of the parameters needed for full fine-tuning. Experiments across architectures provide a new lens for understanding the relationship between input representations and model behavior—revealing them to be more flexible at their foundation than previously thought.
pdf
bib
abs
Adversarial Attacks Against Automated Fact-Checking: A Survey
Fanzhen Liu
|
Sharif Abuadbba
|
Kristen Moore
|
Surya Nepal
|
Cecile Paris
|
Jia Wu
|
Jian Yang
|
Quan Z. Sheng
In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.
pdf
bib
abs
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
An-Lan Wang
|
Jingqun Tang
|
Lei Liao
|
Hao Feng
|
Qi Liu
|
Xiang Fei
|
Jinghui Lu
|
Han Wang
|
Hao Liu
|
Yuliang Liu
|
Xiang Bai
|
Can Huang
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise scanned or digital documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models’ inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding.
pdf
bib
abs
DCR: Quantifying Data Contamination in LLMs Evaluation
Cheng Xu
|
Nan Yan
|
Shuhao Guan
|
Changhong Jin
|
Yuke Mei
|
Yibing Guo
|
Tahar Kechadi
The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data during the training process, inflating performance metrics, and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC risk across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
pdf
bib
abs
Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency
Svetlana Maslenkova
|
Clement Christophe
|
Marco AF Pimentel
|
Tathagata Raha
|
Muhammad Umar Salman
|
Ahmed Al Mahrooqi
|
Avani Gupta
|
Shadab Khan
|
Ronnie Rajan
|
Praveenkumar Kanithi
Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.
pdf
bib
abs
Surprise Calibration for Better In-Context Learning
Zhihang Tan
|
Jingrui Hou
|
Ping Wang
|
Qibiao Hu
|
Peng Zhu
In-context learning (ICL) has emerged as a powerful paradigm for task adaptation in large language models (LLMs), where models infer underlying task structures from a few demonstrations. However, ICL remains susceptible to biases that arise from prior knowledge and contextual demonstrations, which can degrade the performance of LLMs. Existing bias calibration methods typically apply fixed class priors across all inputs, limiting their efficacy in dynamic ICL settings where the context for each query differs. To address these limitations, we adopt implicit sequential Bayesian inference as a framework for interpreting ICL, identify “surprise” as an informative signal for class prior shift, and introduce a novel method—Surprise Calibration (SC). SC leverages the notion of surprise to capture the temporal dynamics of class priors, providing a more adaptive and computationally efficient solution for in-context learning. We empirically demonstrate the superiority of SC over existing bias calibration techniques across a range of benchmark natural language processing tasks.
pdf
bib
abs
SPARK: Simulating the Co-evolution of Stance and Topic Dynamics in Online Discourse with LLM-based Agents
Bowen Zhang
|
Yi Yang
|
Fuqiang Niu
|
Xianghua Fu
|
Genan Dai
|
Hu Huang
Topic evolution and stance dynamics are deeply intertwined in online social media, shaping the fragmentation and polarization of public discourse. Yet existing dynamic topic models and stance analysis approaches usually consider these processes in isolation, relying on abstractions that lack interpretability and agent-level behavioral fidelity. We present stance and topic evolution reasoning framework (SPARK), the first LLM-based multi-agent simulation framework for jointly modeling the co-evolution of topics and stances through natural language interactions. In SPARK, each agent is instantiated as an LLM persona with unique demographic and psychological traits, equipped with memory and reflective reasoning. Agents engage in daily conversations, adapt their stances, and organically introduce emergent subtopics, enabling interpretable, fine-grained simulation of discourse dynamics at scale. Experiments across five real-world domains show that SPARK captures key empirical patterns—such as rapid topic innovation in technology, domain-specific stance polarization, and the influence of personality on stance shifts and topic emergence. Our framework quantitatively reveals the bidirectional mechanisms by which stance shifts and topic evolution reinforce each other, a phenomenon rarely addressed in prior work. SPARK provides actionable insights and a scalable tool for understanding and mitigating polarization in online discourse. Code and simulation resources will be released after acceptance.
pdf
bib
abs
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Yang Wang
|
Chenghao Xiao
|
Chia-Yi Hsiao
|
Zi Yan Chang
|
Chi-Li Chen
|
Tyler Loakman
|
Chenghua Lin
We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs’ pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.
pdf
bib
abs
Can Large Language Models be Effective Online Opinion Miners?
Ryang Heo
|
Yongsik Seo
|
Junseong Lee
|
Dongha Lee
The surge of user-generated online content presents a wealth of insights into customer preferences and market trends.However, the highly diverse, complex, and context-rich nature of such content poses significant challenges to traditional opinion mining approaches.To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides, for each content instance, an extensive set of (entity, feature, opinion) tuples and a corresponding opinion-centric insight that highlights key opinion topics, thereby enabling the evaluation of both the extractive and abstractive capabilities of models.Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios.This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.
pdf
bib
abs
Can Large Language Models Translate Unseen Languages in Underrepresented Scripts?
Dianqing Lin
|
Aruukhan
|
Hongxu Hou
|
Shuo Sun
|
Wei Chen
|
Yichen Yang
|
Guo Dong Shi
Large language models (LLMs) have demonstrated impressive performance in machine translation, but still struggle with unseen low-resource languages, especially those written in underrepresented scripts. To investigate whether LLMs can translate such languages with the help of linguistic resources, we introduce Lotus, a benchmark designed to evaluate translation for Mongolian (in traditional script) and Yi. Our study shows that while linguistic resources can improve translation quality as measured by automatic metrics, LLMs remain limited in their ability to handle these languages effectively. We hope our work provides insights for the low-resource NLP community and fosters further progress in machine translation for underrepresented script low-resource languages. Our code and data are available.
pdf
bib
abs
InterIDEAS: Philosophical Intertextuality via LLMs
Yue Yang
|
Yinzhi Xu
|
Chenghao Huang
|
JohnMichael Jurgensen
|
Han Hu
|
Hao Wang
The formation and circulation of ideas in philosophy have profound implications for understanding philosophical dynamism–enabling us to identify seminal texts, delineate intellectual traditions, and track changing conventions in the act of philosophizing. However, traditional analyses of these issues often depend on manual reading and subjective interpretation, constrained by human cognitive limits. We introduce InterIDEAS, a pioneering dataset designed to bridge philosophy, literary studies, and natural language processing (NLP). By merging theories of intertextuality from literary studies with bibliometric techniques and recent LLMs, InterIDEAS enables both quantitative and qualitative analysis of the intellectual, social, and historical relations embedded within authentic philosophical texts. This dataset not only assists the study of philosophy but also contributes to the development of language models by providing a training corpus that challenges and enhances their interpretative capacity.
pdf
bib
abs
KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling
Yangfan Wang
|
Jie Liu
|
Chen Tang
|
Lian Yan
|
Jingchi Jiang
Multi-hop question answering faces substantial challenges due to data sparsity, which increases the likelihood of language models learning spurious patterns. To address this issue, prior research has focused on diversifying question generation through content planning and varied expression. However, these approaches often emphasize generating simple questions and neglect the integration of essential knowledge, such as relevant sentences within documents. This paper introduces the **Knowledge Composition Sampling (KCS)**, an innovative framework designed to expand the diversity of generated multi-hop questions by sampling varied knowledge compositions within a given context. KCS models the knowledge composition selection as a sentence-level conditional prediction task and utilizes a probabilistic contrastive loss to predict the next most relevant piece of knowledge. During inference, we employ a stochastic decoding strategy to effectively balance accuracy and diversity. Compared to competitive baselines, our KCS improves the overall accuracy of knowledge composition selection by 3.9%, and its application for data augmentation yields improvements on HotpotQA and 2WikiMultihopQA datasets. Our code is available at: https://github.com/yangfanww/kcs.
pdf
bib
abs
Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
Yerin Hwang
|
Dongryeol Lee
|
Kyungmin Min
|
Taegwan Kang
|
Yongil Kim
|
Kyomin Jung
Recently, large vision–language models (LVLMs) have emerged as the preferred tools for judging text–image alignment, yet their robustness along the visual modality remains underexplored. This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores? We define potential image-induced biases within the context of T2I evaluation and examine how these biases affect the evaluations of LVLM judges. Moreover, we introduce a novel, fine-grained, multi-domain meta-evaluation benchmark named FRAME, which is deliberately constructed to exhibit diverse score distributions. By introducing the defined biases into the benchmark, we reveal that all tested LVLM judges exhibit vulnerability across all domains, consistently inflating scores for manipulated images. Further analysis reveals that combining multiple biases amplifies their effects, and pairwise evaluations are similarly susceptible. Moreover, we observe that visual biases persist despite prompt-based mitigation strategies, highlighting the vulnerability of current LVLM evaluation systems and underscoring the urgent need for more robust LVLM judges.
pdf
bib
abs
Disentangled Information Bottleneck for Adversarial Text Defense
Yidan Xu
|
Xinghao Yang
|
Wei Liu
|
Bao-di Liu
|
Weifeng Liu
Adversarial text defense is a significant strategy to protect modern NLP models from being attacked. Typical text defense methods usually enhance the model’s robustness by model retraining or equipping it with a data preprocessing step, aiming to eliminate the non-robust features and preserve the robust ones. Although some efforts have been made to recognize the robust features, e.g., by the information bottleneck (IB) technique, how to fully disentangle the robust and non-robust representation remains a big challenge. To alleviate this problem, we propose a novel text defense method, named Disentangled Information Bottleneck (DisIB), with two major merits. Firstly, we separate the robust features and non-robust features with a disentangled two-line framework rather than the one-line compression network in IB. This prevents the loss of robust features caused by information compression and produces complete robust features. Secondly, we design a discriminator network to approximate the minimum mutual information of the two lines, which sufficiently disentangles robust and non-robust features. To validate the effectiveness of our DisIB, we conduct a total of 96 defense experiments on four datasets by defending four popular attack methods. Experimental results elaborate that our method significantly outperforms six baselines, with accuracy improvements ranging from 3.8% to 20.7%.
pdf
bib
abs
How do Language Models Reshape Entity Alignment? A Survey of LM-Driven EA Methods: Advances, Benchmarks, and Future
Zerui Chen
|
Huiming Fan
|
Qianyu Wang
|
Tao He
|
Ming Liu
|
Heng Chang
|
Weijiang Yu
|
Ze Li
|
Bing Qin
Entity alignment (EA), critical for knowledge graph (KG) integration, identifies equivalent entities across different KGs. Traditional methods often face challenges in semantic understanding and scalability. The rise of language models (LMs), particularly large language models (LLMs), has provided powerful new strategies. This paper systematically reviews LM-driven EA methods, proposing a novel taxonomy that categorizes methods in three key stages: data preparation, feature embedding, and alignment. We further summarize key benchmarks, evaluation metrics, and discuss future directions. This paper aims to provide researchers and practitioners with a clear and comprehensive understanding of how language models reshape the field of entity alignment.
pdf
bib
abs
Enhancing LLM-Based Social Bot via an Adversarial Learning Framework
Fanqi Kong
|
Xiaoyuan Zhang
|
Xinyu Chen
|
Yaodong Yang
|
Song-Chun Zhu
|
Xue Feng
Developing Large Language Model (LLM) agents that exhibit human-like behavior, encompassing not only individual heterogeneity rooted in unique user profiles but also adaptive response to socially connected neighbors, is a significant research challenge. Social media platforms, with their diverse user data and explicit social structures, provide an ideal testbed for such investigations. This paper introduces EvoBot, an **Evo**lving LLM-based social **Bot** that significantly enhances human-like generative capabilities through a novel adversarial learning framework. EvoBot is initialized by Supervised Fine-Tuning (SFT) on representative data from social media and then iteratively refines its generation of sophisticated, human-like content via Direct Preference Optimization (DPO). This refinement is guided by feedback from a co-adapting **Detector** which concurrently improves its ability to distinguish EvoBot from humans, thereby creating an increasingly challenging learning environment for EvoBot. Experiments demonstrate that EvoBot generates content aligned with diverse user profiles, increasingly bypassing the co-adapting Detector through human-like expression. Moreover, it exhibits strong social responsiveness, more accurately modeling real-world opinion dynamics and information spread in multi-agent simulations. The framework also yields a more robust Detector, underscoring its broader utility for both advanced agent development and related detection tasks. The code is available at https://anonymous.4open.science/r/EvoBot-036D.
pdf
bib
abs
GER-LLM: Efficient and Effective Geospatial Entity Resolution with Large Language Model
Haojia Zhu
|
Zhicheng Li
|
Jiahui Jin
Geospatial Entity Resolution (GER) plays a central role in integrating spatial data from diverse sources. However, existing methods are limited by their reliance on large amounts of training data and their inability to incorporate commonsense knowledge. While recent advances in Large Language Models (LLMs) offer strong semantic reasoning and zero-shot capabilities, directly applying them to GER remains inadequate due to their limited spatial understanding and high inference cost. In this work, we present GER-LLM, a framework that integrates LLMs into the GER pipeline. To address the challenge of spatial understanding, we design a spatially informed blocking strategy based on adaptive quadtree partitioning and Area of Interest (AOI) detection, preserving both spatial proximity and functional relationships. To mitigate inference overhead, we introduce a group prompting mechanism with graph-based conflict resolution, enabling joint evaluation of diverse candidate pairs and enforcing global consistency across alignment decisions. Extensive experiments on real-world datasets demonstrate the effectiveness of our approach, yielding significant improvements over state-of-the-art methods.
pdf
bib
abs
CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion
Sheng Zhang
|
Yifan Ding
|
Shuquan Lian
|
Shun Song
|
Hui Li
Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at https://github.com/KDEGroup/CodeRAG.
pdf
bib
abs
Searching for the Most Human-like Emergent Language
Brendon Boldt
|
David R. Mortensen
In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.
pdf
bib
abs
Does Context Matter? A Prosodic Comparison of English and Spanish in Monolingual and Multilingual Discourse Settings
Debasmita Bhattacharya
|
David Sasu
|
Michela Marchini
|
Natalie Schluter
|
Julia Hirschberg
Different languages are known to have typical and distinctive prosodic profiles. However, the majority of work on prosody across languages has been restricted to monolingual discourse contexts. We build on prior studies by asking: how does the nature of the discourse context influence variations in the prosody of monolingual speech? To answer this question, we compare the prosody of spontaneous, conversational monolingual English and Spanish both in monolingual and in multilingual speech settings. For both languages, we find that monolingual speech produced in a monolingual context is prosodically different from that produced in a multilingual context, with more marked differences having increased proximity to multilingual discourse. Our work is the first to incorporate multilingual discourse contexts into the study of native-level monolingual prosody, and has potential downstream applications for the recognition and synthesis of multilingual speech.
pdf
bib
abs
ZERA: Zero-init Instruction Evolving Refinement Agent – From Zero Instructions to Structured Prompts via Principle-based Optimization
Seungyoun Yi
|
Minsoo Khang
|
Sungrae Park
Automatic Prompt Optimization (APO) improves large language model (LLM) performance by refining prompts for specific tasks. However, prior APO methods typically focus only on user prompts, rely on unstructured feedback, and require large sample sizes and long iteration cycles—making them costly and brittle. We propose ZERA (Zero-init Instruction Evolving Refinement Agent), a novel framework that jointly optimizes both system and user prompts through principled, low-overhead refinement. ZERA scores prompts using eight generalizable criteria with automatically inferred weights, and revises prompts based on these structured critiques. This enables fast convergence to high-quality prompts using minimal examples and short iteration cycles. We evaluate ZERA across five LLMs and nine diverse datasets spanning reasoning, summarization, and code generation tasks. Experimental results demonstrate consistent improvements over strong baselines. Further ablation studies highlight the contribution of each component to more effective prompt construction. Our implementation including all prompts is publicly available at https://github.com/younatics/zera-agent.
pdf
bib
abs
Toward Machine Interpreting: Lessons from Human Interpreting Studies
Matthias Sperber
|
Maureen de Seyssel
|
Jiajun Bao
|
Matthias Paulik
Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior and do not adapt to real-world situations in ways human interpreters do. In order to improve their practical usefulness and enable interpreting-like experiences, a precise understanding of the nature of human interpreting is crucial. To this end, we discuss human interpreting literature from the perspective of the machine translation field, while considering both operational and qualitative aspects. We identify implications for the development of speech translation systems and argue that there is great potential to adopt many human interpreting principles using recent modeling techniques. We hope that our findings provide inspiration for closing the perceived usability gap, and can motivate progress toward true machine interpreting.
pdf
bib
abs
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Jaewoo Ahn
|
Junseo Kim
|
Heeseung Yun
|
Jaehyeon Son
|
Dongmin Park
|
Jaewoong Cho
|
Gunhee Kim
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
pdf
bib
abs
FLARE: Faithful Logic-Aided Reasoning and Exploration
Erik Arakelyan
|
Pasquale Minervini
|
Patrick Lewis
|
Pat Verga
|
Isabelle Augenstein
Modern Question Answering (QA) and Reasoning approaches with Large Language Models (LLMs) commonly use Chain-of-Thought (CoT) prompting but struggle with generating outputs faithful to their intermediate reasoning chains. While neuro-symbolic methods like Faithful CoT (F-CoT) offer higher faithfulness through external solvers, they require code-specialized models and struggle with ambiguous tasks.We introduce Faithful Logic-Aided Reasoning and Exploration (FLARE), which uses LLMs to plan solutions, formalize queries into logic programs, and simulate code execution through multi-hop search without external solvers. Our method achieves SOTA results on 𝟕 out of 𝟗 diverse reasoning benchmarks and 3 out of 3 logic inference benchmarks while enabling measurement of reasoning faithfulness. We demonstrate that model faithfulness correlates with performance and that successful reasoning traces show an 18.1% increase in unique emergent facts, 8.6% higher overlap between code-defined and execution-trace relations, and 3.6% reduction in unused relations.
pdf
bib
abs
Discourse-Driven Code-Switching: Analyzing the Role of Content and Communicative Function in Spanish-English Bilingual Speech
Debasmita Bhattacharya
|
Juan Junco
|
Divya Tadimeti
|
Julia Hirschberg
Code-switching (CSW) is commonly observed among bilingual speakers, and is motivated by various paralinguistic, syntactic, and morphological aspects of conversation. We build on prior work by asking: how do discourse-level aspects of dialogue – i.e. the content and function of speech – influence patterns of CSW? To answer this, we analyze the named entities and dialogue acts present in a Spanish-English spontaneous speech corpus, and build a predictive model of CSW based on our statistical findings. We show that discourse content and function interact with patterns of CSW to varying degrees, with a stronger influence from function overall. Our work is the first to take a discourse-sensitive approach to understanding the pragmatic and referential cues of bilingual speech and has potential applications in improving the prediction, recognition, and synthesis of code-switched speech that is grounded in authentic aspects of multilingual discourse.
pdf
bib
abs
Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?
Jiale Chen
|
Xuelian Dong
|
Qihao Yang
|
Wenxiu Xie
|
Tianyong Hao
Spoken-only languages are languages without a writing system. They remain excluded from modern Natural Language Processing (NLP) advancements like Large Language Models (LLMs) due to their lack of textual data. Existing NLP research focuses primarily on high-resource or written low-resource languages, leaving spoken-only languages critically underexplored. As a popular NLP paradigm, LLMs have demonstrated strong few-shot and cross-lingual generalization abilities, making them a promising solution for understanding and translating spoken-only languages. In this paper, we investigate how LLMs can translate spoken-only languages into high-resource languages by leveraging international phonetic transcription as an intermediate representation. We propose UNILANG, a unified language understanding framework that learns to translate spoken-only languages via in-context learning. Through automatic dictionary construction and knowledge retrieval, UNILANG equips LLMs with more fine-grained knowledge for improving word-level semantic alignment. To support this study, we introduce the SOLAN dataset, which consists of Bai (a spoken-only language) and its corresponding translations in a high-resource language. A series of experiments demonstrates the effectiveness of UNILANG in translating spoken-only languages, potentially contributing to the preservation of linguistic and cultural diversity. Our dataset and code will be publicly released.
pdf
bib
abs
ClimateViz: A Benchmark for Statistical Reasoning and Fact Verification on Scientific Charts
Ruiran Su
|
Jiasheng Si
|
Zhijiang Guo
|
Janet B. Pierrehumbert
Scientific fact-checking has largely focused on textual and tabular sources, neglecting scientific charts—a primary medium for conveying quantitative evidence and supporting statistical reasoning in research communication. We introduce ClimateViz, the first large-scale benchmark for scientific fact-checking grounded in real-world, expert-curated scientific charts. ClimateViz comprises 49,862 claims paired with 2,896 visualizations, each labeled as support, refute, or not enough information. To enable interpretable verification, each instance includes structured knowledge graph explanations that capture statistical patterns, temporal trends, spatial comparisons, and causal relations. We conduct a comprehensive evaluation of state-of-the-art multimodal large language models, including proprietary and open-source ones, under zero-shot and few-shot settings. Our results show that current models struggle to perform fact-checking when statistical reasoning over charts is required: even the best-performing systems, such as Gemini 2.5 and InternVL 2.5, achieve only 76.2–77.8% accuracy in label-only output settings, which is far below human performance (89.3% and 92.7%). While few-shot prompting yields limited improvements, explanation-augmented outputs significantly enhance performance in some closed-source models, notably o3 and Gemini 2.5.
pdf
bib
abs
Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment
Hyuntae Park
|
Yeachan Kim
|
SangKeun Lee
Molecule and text representation learning has gained increasing interest due to its potential for enhancing the understanding of chemical information. However, existing models often struggle to capture subtle differences between molecules and their descriptions, as they lack the ability to learn fine-grained alignments between molecular substructures and chemical phrases. To address this limitation, we introduce MolBridge, a novel molecule–text learning framework based on substructure-aware alignments. Specifically, we augment the original molecule–description pairs with additional alignment signals derived from molecular substructures and chemical phrases. To effectively learn from these enriched alignments, MolBridge employs substructure-aware contrastive learning, coupled with a self-refinement mechanism that filters out noisy alignment signals. Experimental results show that MolBridge effectively captures fine-grained correspondences and outperforms state-of-the-art baselines on a wide range of molecular benchmarks, underscoring the importance of substructure-aware alignment in molecule-text learning.
pdf
bib
abs
SLlama: Parameter-Efficient Language Model Architecture for Enhanced Linguistic Competence Under Strict Data Constraints
Victor Adelakun Omolaoye
|
Babajide Alamu Owoyele
|
Gerard de Melo
Scaling data and model size has driven recent advances in language modeling, but this strategy falters under scenarios with strict data constraints, as in the BabyLM Challenge. However, insights from Chinchilla highlights that smaller models trained on more data outperform larger counterparts trained inadequately, emphasizing the need for compact architectures. Furthermore, while embedding weight tying is a common parameter-saving technique, we find it significantly diminishes linguistic competence in compact models.In response, we explore alternative architectural strategies that preserve the parameter efficiency of tied models without sacrificing the representational benefits of untied embeddings. Consequently, we introduce SLlama a Llama3 architecture variant which incorporates targeted modifications—Repeated Reduced Hidden Size and Projection (RRHP), Permutated Weight Attention (PWA), Shared Projection Multi-Layer Perceptron (SPMLP), and Layer Weight Sharing—to compress Transformer components. Without relying on distillation, SLlama achieves a 31.72% improvement in linguistic knowledge acquisition over the BabyLlama baseline, with a comparable GLUE score and significantly lower parameter count. These results demonstrate that well-designed, compact models can rival larger ones under strict data constraints.
pdf
bib
abs
What You See is What You Ask: Evaluating Audio Descriptions
Divy Kala
|
Eshika Khandelwal
|
Makarand Tapaswi
Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.
pdf
bib
abs
TAPS: Tool-Augmented Personalisation via Structured Tagging
Ekaterina Taktasheva
|
Jeff Dalton
Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce TAPS, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.
pdf
bib
abs
Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities
Masahiro Kaneko
|
Timothy Baldwin
Large Language Models (LLMs) are trained on massive web-crawled corpora, often containing personal information, copyrighted text, and benchmark datasets. This inadvertent inclusion in the training dataset, known as data leakage, poses significant risks and could compromise the safety of LLM outputs. Despite its criticality, existing studies do not examine how leaked instances in the pre-training data influence LLMs’ output and detection capabilities. In this paper, we conduct an experimental survey to elucidate the relationship between data leakage in training datasets and its effects on the generation and detection by LLMs. Our experiments reveal that LLMs often generate outputs containing leaked information, even when there is little such data in the training dataset. Moreover, the fewer the leaked instances, the more difficult it becomes to detect such leakage. Finally, we demonstrate that enhancing leakage detection through few-shot learning can help mitigate the impact of the leakage rate in the training data on detection performance.
pdf
bib
abs
Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Wenda Qin
|
Andrea Burns
|
Bryan A. Plummer
|
Margrit Betke
Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning.To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes.Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.
pdf
bib
abs
Connecting the Knowledge Dots: Retrieval-augmented Knowledge Connection for Commonsense Reasoning
Junho Kim
|
Soyeon Bak
|
Mingyu Lee
|
Minju Hong
|
Songha Kim
|
Tae-Eui Kam
|
SangKeun Lee
While large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, LLMs exhibit a limited understanding of commonsense reasoning due to the necessity of implicit knowledge that is rarely expressed in text. Recently, retrieval-augmented language models (RALMs) have enhanced their commonsense reasoning ability by incorporating background knowledge from external corpora. However, previous RALMs overlook the implicit nature of commonsense knowledge, potentially resulting in the retrieved documents not directly containing information needed to answer questions. In this paper, we propose Retrieval-augmented knowledge Connection, ReConnect, which transforms indirectly relevant documents into a direct explanation to answer the given question. To this end, we extract relevant knowledge from various retrieved document subsets and aggregate them into a direct explanation. Experimental results show that ReConnect outperforms state-of-the-art (SOTA) baselines, achieving improvements of +2.0% and +4.6% average accuracy on in-domain (ID) and out-of-domain (OOD) benchmarks, respectively.
pdf
bib
abs
Agent-as-Judge for Factual Summarization of Long Narratives
Yeonseok Jeong
|
Minsoo Kim
|
Seung-won Hwang
|
Byung-Hak Kim
Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore (NFS), the first “Agent-as-a-Judge” framework that evaluates and refines factuality in narrative summarization. By leveraging a Character Knowledge Graph (CKG) extracted from input narrative, NarrativeFactScore evaluates the factuality and provides actionable guidance for refinement, such as identifying missing or erroneous facts. Our experimental results demonstrate that constructing the CKG enables reasoning with 1/3 of the factuality computation used in the prior approach, and achieve three times higher correlation with human judgments. Furthermore, refinement with actionable guidance improves the quality of the summary.
pdf
bib
abs
DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation
Miriam Wanner
|
Benjamin Van Durme
|
Mark Dredze
The decompose-then-verify strategy for verification of Large Language Model (LLM) generations decomposes claims that are then independently verified. Decontextualization augments text (claims) to ensure it can be verified outside of the original context, enabling reliable verification. While decomposition and decontextualization have been explored independently, their interactions in a complete system have not been investigated. Their conflicting purposes can create tensions: decomposition isolates atomic facts while decontextualization inserts relevant information. Furthermore, a decontextualized subclaim presents a challenge to the verification step: what part of the augmented text should be verified as it now contains multiple atomic facts? We conduct an evaluation of different decomposition, decontextualization, and verification strategies and find that the choice of strategy matters in the resulting factuality scores. Additionally, we introduce DnDScore, a decontextualization aware verification method that validates subclaims in the context of contextual information.
pdf
bib
abs
RAcQUEt: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs
Alberto Testoni
|
Barbara Plank
|
Raquel Fernández
Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can emulate these strategies remains unclear. In this work, we examine referential ambiguity in image-based question answering by introducing RAcQUEt, a carefully curated dataset targeting distinct aspects of ambiguity. Through a series of evaluations, we reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses. The overconfidence issue becomes particularly relevant for RAcQUEt-BIAS, a subset designed to analyze a critical yet underexplored problem: failing to address ambiguity leads to stereotypical, socially biased responses. Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.
pdf
bib
abs
Resource-Rational Noisy-Channel Language Processing: Testing the Effect of Algorithmic Constraints on Inferences
Thomas Hikaru Clark
|
Jacob Hoover Vigly
|
Edward Gibson
|
Roger P. Levy
Human language use is robust to errors: comprehenders can and do mentally correct utterances that are implausible or anomalous. How are humans able to solve these problems in real time, picking out alternatives from an unbounded space of options using limited cognitive resources? And can language models trained on next-word prediction for typical language be augmented to handle language anomalies in a human-like way? Using a language model as a prior and an error model to encode likelihoods, we use Sequential Monte Carlo with optional rejuvenation to perform incremental and approximate probabilistic inference over intended sentences and production errors. We demonstrate that the model captures previously established patterns in human sentence processing, and that a trade-off between human-like noisy-channel inferences and computational resources falls out of this model. From a psycholinguistic perspective, our results offer a candidate algorithmic model of rational inference in language processing. From an NLP perspective, our results showcase how to elicit human-like noisy-channel inference behavior from a relatively small LLM while controlling the amount of computation available during inference. Our model is implemented in the Gen.jl probabilistic programming language, and our code is available at
https://github.com/thomashikaru/noisy_channel_model.
pdf
bib
abs
In Benchmarks We Trust ... Or Not?
Ine Gevers
|
Victor De Marez
|
Jens Van Nooten
|
Jens Lemmens
|
Andriy Kosar
|
Ehsan Lotfi
|
Nikolay Banar
|
Pieter Fivez
|
Luna De Bruyne
|
Walter Daelemans
Standardized benchmarks are central to evaluating and comparing model performance in Natural Language Processing (NLP). However, Large Language Models (LLMs) have exposed shortcomings in existing benchmarks, and so far there is no clear solution. In this paper, we survey a wide scope of benchmarking issues, and provide an overview of solutions as they are suggested in the literature. We observe that these solutions often tackle a limited number of issues, neglecting other facets. Therefore, we propose concrete checklists to cover all aspects of benchmarking issues, both for benchmark creation and usage. We illustrate the use of our checklists by applying them to three popular NLP benchmarks (i.e., SuperGLUE, WinoGrande, and ARC-AGI). Additionally, we discuss the potential advantages of adding minimal-sized test-suites to benchmarking, which would ensure downstream applicability on real-world use cases.
pdf
bib
abs
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
Xueqiao Zhang
|
Chao Zhang
|
Jingtao Xu
|
Yifan Zhu
|
Xin Shi
|
Yi Yang
|
Yawei Luo
Role-playing agents (RPAs) have attracted growing interest for their ability to simulate immersive and interactive characters. However, existing approaches primarily focus on static role profiles, overlooking the dynamic perceptual abilities inherent to humans. To bridge this gap, we introduce the concept of dynamic role profiles by incorporating video modality into RPAs. To support this, we construct Role-playing-Video60k, a large-scale, high-quality dataset comprising 60k videos and 700k corresponding dialogues. Based on this dataset, we develop a comprehensive RPA framework that combines adaptive temporal sampling with both dynamic and static role profile representations. Specifically, the dynamic profile is created by adaptively sampling video frames and feeding them to the LLM in temporal order, while the static profile consists of (1) character dialogues from training videos during fine-tuning, and (2) a summary context from the input video during inference. This joint integration enables RPAs to generate greater responses. Furthermore, we propose a robust evaluation method covering eight metrics. Experimental results demonstrate the effectiveness of our framework, highlighting the importance of dynamic role profiles in developing RPAs.
pdf
bib
abs
Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks
Maureen de Seyssel
|
Jie Chi
|
Skyler Seto
|
Maartje Ter Hoeve
|
Masha Fedzechkina
|
Natalie Schluter
We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance.Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.
pdf
bib
abs
Rethinking Text-based Protein Understanding: Retrieval or LLM?
Juntong Wu
|
Zijing Liu
|
He Cao
|
Li Hao
|
Bin Feng
|
Zishan Shu
|
Ke Yu
|
Li Yuan
|
Yu Li
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to assess the model’s performance in this domain accurately. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data will be available.
pdf
bib
abs
Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands
Claudiu Daniel Hromei
|
Antonio Scaiella
|
Danilo Croce
|
Roberto Basili
Understanding natural language commands in situated Human-Robot Interaction (HRI) requires linking linguistic input to perceptual context. Traditional symbolic parsers lack the flexibility to operate in complex, dynamic environments. We introduce a novel Multimodal Grounded Semantic Role Labelling (G-SRL) framework that combines frame semantics with perceptual grounding, enabling robots to interpret commands via multimodal logical forms. Our approach leverages modern Visual Language Models (VLLMs), which jointly process text and images, and is supported by an automated pipeline that generates high-quality training data. Structured command annotations are converted into photorealistic scenes via LLM-guided prompt engineering and diffusion models, then rigorously validated through object detection and visual question answering. The pipeline produces over 11,000 image-command pairs (3,500+ manually validated), while approaching the quality of manually curated datasets at significantly lower cost.
pdf
bib
abs
Easy as PIE? Identifying Multi-Word Expressions with LLMs
Kai Golan Hashiloni
|
Ofri Hefetz
|
Kfir Bar
We investigate the identification of idiomatic expressions—a semantically non-compositional subclass of multiword expressions (MWEs)—in running text using large language models (LLMs) without any fine-tuning. Instead, we adopt a prompt-based approach and evaluate a range of prompting strategies, including zero-shot, few-shot, and chain-of-thought variants, across multiple languages, datasets, and model types. Our experiments show that, with well-crafted prompts, LLMs can perform competitively with supervised models trained on annotated data. These findings highlight the potential of prompt-based LLMs as a flexible and effective alternative for idiomatic expression identification.
pdf
bib
abs
Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
Wuwei Zhang
|
Fangcong Yin
|
Howard Yen
|
Danqi Chen
|
Xi Ye
Recent work has identified retrieval heads (Wu et al., 2025), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHead (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHead by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QRRetriever, an efficient and effective retriever that uses the accumulated attention mass of QRHead as retrieval scores. We use QRRetriever for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRetriever as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the query-context attention scoring and task selection are crucial for identifying QRHead with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
pdf
bib
abs
Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
Jingbiao Mei
|
Jinghong Chen
|
Guangyu Yang
|
Weizhe Lin
|
Bill Byrne
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems.Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
pdf
bib
abs
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
Xie Zhifei
|
Mingbao Lin
|
Zihang Liu
|
Pengcheng Wu
|
Shuicheng Yan
|
Chunyan Miao
Recent advancements in multimodal reasoning overlook the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation (+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning. The model, dataset, and code are open-sourced at [https://github.com/xzf-thu/Audio-Reasoner](https://github.com/xzf-thu/Audio-Reasoner) or [https://huggingface.co/datasets/zhifeixie/Audio-Reasoner-CoTA](https://huggingface.co/datasets/zhifeixie/Audio-Reasoner-CoTA).
pdf
bib
abs
From perception to production: how acoustic invariance facilitates articulatory learning in a self-supervised vocal imitation model
Marvin Lavechin
|
Thomas Hueber
Human infants face a formidable challenge in speech acquisition: mapping extremely variable acoustic inputs into appropriate articulatory movements without explicit instruction. We present a computational model that addresses the acoustic-to-articulatory mapping problem through self-supervised learning. Our model comprises a feature extractor that transforms speech into latent representations, an inverse model that maps these representations to articulatory parameters, and a synthesizer that generates speech outputs. Experiments conducted in both single- and multi-speaker settings reveal that intermediate layers of a pre-trained wav2vec 2.0 model provide optimal representations for articulatory learning, significantly outperforming MFCC features. These representations enable our model to learn articulatory trajectories that correlate with human patterns, discriminate between places of articulation, and produce intelligible speech. Critical to successful articulatory learning are representations that balance phonetic discriminability with speaker invariance – precisely the characteristics of self-supervised representation learning models. Our findings provide computational evidence consistent with developmental theories proposing that perceptual learning of phonetic categories guides articulatory development, offering insights into how infants might acquire speech production capabilities despite the complex mapping problem they face.
pdf
bib
abs
REALM: Recursive Relevance Modeling for LLM-based Document Re-Ranking
Pinhuan Wang
|
Zhiqiu Xia
|
Chunhua Liao
|
Feiyi Wang
|
Hang Liu
Large Language Models (LLMs) have shown strong capabilities in document re-ranking, a key component in modern Information Retrieval (IR) systems. However, existing LLM-based approaches face notable limitations, including ranking uncertainty, unstable top-k recovery, and high token cost due to token-intensive prompting. To effectively address these limitations, we propose REALM, an uncertainty-aware re-ranking framework that models LLM-derived relevance as Gaussian distributions and refines them through recursive Bayesian updates. By explicitly capturing uncertainty and minimizing redundant queries, REALM achieves better rankings more efficiently. Experimental results demonstrate that our REALM surpasses state-of-the-art re-rankers while significantly reducing token usage and latency, improving NDCG@10 by 0.7-11.9 and simultaneously reducing the number of LLM inferences by 23.4-84.4%, promoting it as the next-generation re-ranker for modern IR systems.
pdf
bib
abs
PLLuM-Align: Polish Preference Dataset for Large Language Model Alignment
Karolina Seweryn
|
Anna Kołos
|
Agnieszka Karlińska
|
Katarzyna Lorenc
|
Katarzyna Dziewulska
|
Maciej Chrabaszcz
|
Aleksandra Krasnodebska
|
Paula Betscher
|
Zofia Cieślińska
|
Katarzyna Kowol
|
Julia Moska
|
Dawid Motyka
|
Paweł Walkowiak
|
Bartosz Żuk
|
Arkadiusz Janz
Alignment is the critical process of minimizing harmful outputs by teaching large language models (LLMs) to prefer safe, helpful and appropriate responses. While the majority of alignment research and datasets remain overwhelmingly English-centric, ensuring safety across diverse linguistic and cultural contexts requires localized resources. In this paper, we introduce the first Polish preference dataset PLLuM-Align, created entirely through human annotation to reflect Polish language and cultural nuances. The dataset includes response rating, ranking, and multi-turn dialog data. Designed to reflect the linguistic subtleties and cultural norms of Polish, this resource lays the groundwork for more aligned Polish LLMs and contributes to the broader goal of multilingual alignment in underrepresented languages.
pdf
bib
abs
Graph-R1: Incentivizing the Zero-Shot Graph Learning Capability in LLMs via Explicit Reasoning
Yicong Wu
|
Guangyue Lu
|
Yuan Zuo
|
Huarong Zhang
|
Junjie Wu
Generalizing to unseen graph tasks without task-specific supervision remains challenging. Graph Neural Networks (GNNs) are limited by fixed label spaces, while Large Language Models (LLMs) lack structural inductive biases. Recent advances in Large Reasoning Models (LRMs) provide a zero-shot alternative via explicit, long chain-of-thought reasoning. Inspired by this, we propose a GNN-free approach that reformulates graph tasks—node classification, link prediction, and graph classification—as textual reasoning problems solved by LRMs. We introduce the first datasets with detailed reasoning traces for these tasks and develop Graph-R1, a reinforcement learning framework that leverages task-specific rethink templates to guide reasoning over linearized graphs. Experiments demonstrate that Graph-R1 outperforms state-of-the-art baselines in zero-shot settings, producing interpretable and effective predictions. Our work highlights the promise of explicit reasoning for graph learning and provides new resources for future research. Codes are available at https://github.com/lgybuaa/Graph-R1.
pdf
bib
abs
Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration
Weicheng Ma
|
John J. Guerrerio
|
Soroush Vosoughi
Research on stereotypes in large language models (LLMs) has largely focused on English-speaking contexts, due to the lack of datasets in other languages and the high cost of manual annotation in underrepresented cultures. To address this gap, we introduce a cost-efficient human-LLM collaborative annotation framework and apply it to construct EspanStereo, a Spanish-language stereotype dataset spanning multiple Spanish-speaking countries across Europe and Latin America. EspanStereo captures both well-documented stereotypes from prior literature and culturally specific biases absent from English-centric resources. Using LLMs to generate candidate stereotypes and in-culture annotators to validate them, we demonstrate the framework’s effectiveness in identifying nuanced, region-specific biases. Our evaluation of Spanish-supporting LLMs using EspanStereo reveals significant variation in stereotypical behavior across countries, highlighting the need for more culturally grounded assessments. Beyond Spanish, our framework is adaptable to other languages and regions, offering a scalable path toward multilingual stereotype benchmarks. This work broadens the scope of stereotype analysis in LLMs and lays the groundwork for comprehensive cross-cultural bias evaluation.
pdf
bib
abs
Can Large Language Models Be Good Language Teachers?
LiQing Xu
|
Qiwei Li
|
Tianshuo Peng
|
Zuchao Li
|
Hai Zhao
|
Ping Wang
Large language models (LLMs) have achieved remarkable success across diverse domains. However, their potential as effective language teachers—particularly in complex pedagogical scenarios like teaching Chinese as a second language—remains inadequately assessed. To address this gap, we propose the first pedagogical competence benchmark for LLMs, rigorously evaluating their performance against international standards for Chinese language teachers. Our framework spans three core dimensions: (1) basic knowledge evaluation, covering 32 subtopics across five major categories; (2) international teacher examination, based on data collected from international Chinese teacher certification exams; and (3) teaching practice evaluation, where target LLMs summarize knowledge points and design instructional content for student models, followed by testing the student models to assess the LLM’s ability to distill and teach key concepts.We conduct a comprehensive evaluation of 13 latest multilingual and Chinese LLMs. While most models demonstrate promising pedagogical potential, there remains substantial room for improvement in their teaching capabilities. This study contributes to the development of AI-assisted language education tools capable of rivaling human teaching excellence. The benchmark dataset and evaluation scripts used in this study are publicly available at https://github.com/Line-Kite/CLTE.
pdf
bib
abs
Empowering Math Problem Generation and Reasoning for Large Language Model via Synthetic Data based Continual Learning Framework
Qian Wan
|
Wangzi Shi
|
Jintian Feng
|
Shengyingjie Liu
|
Luona Wei
|
Zhicheng Dai
|
Jianwen Sun
The large language models (LLMs) learning framework for math problem generation (MPG) mostly performs homogeneous training in different epochs on small-scale manually annotated data. This pattern struggles to provide large-scale new quality data to support continual improvement, and fails to stimulate the mutual promotion reaction between generation and reasoning ability of math problem, resulting in the lack of reliable solving process. This paper proposes a synthetic data based continual learning framework to improve LLMs ability for MPG and math reasoning. The framework cycles through three stages, “supervised fine-tuning, data synthesis, direct preference optimization”, continuously and steadily improve performance. We propose a synthetic data method with dual mechanism of model self-play and multi-agent cooperation is proposed, which ensures the consistency and validity of synthetic data through sample filtering and rewriting strategies, and overcomes the dependence of continual learning on manually annotated data. A data replay strategy that assesses sample importance via loss differentials is designed to mitigate catastrophic forgetting. Experimental analysis on abundant authoritative math datasets demonstrates the superiority and effectiveness of our framework.
pdf
bib
abs
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Vani Kanjirangat
|
Tanja Samardzic
|
Ljiljana Dolamic
|
Fabio Rinaldi
Dialectal data are characterized by linguistic variation that appears small to humans but has a significant impact on the performance of models. This dialect gap has been related to various factors (e.g., data size, economic and social factors) whose impact, however, turns out to be inconsistent. In this work, we investigate factors impacting the model performance more directly: we correlate Tokenization Parity (TP) and Information Parity (IP), as measures of representational biases in pre-trained multilingual models, with the downstream performance. We compare state-of-the-art decoder-only LLMs with encoder-based models across three tasks: dialect classification, topic classification, and extractive question answering, controlling for varying scripts (Latin vs. non-Latin) and resource availability (high vs. low). Our analysis reveals that TP is a better predictor of the performance on tasks reliant on syntactic and morphological cues (e.g., extractive QA), while IP better predicts performance in semantic tasks (e.g., topic classification). Complementary analyses, including tokenizer behavior, vocabulary coverage, and qualitative insights, reveal that the language support claims of LLMs often might mask deeper mismatches at the script or token level.
pdf
bib
abs
Evaluating the Evaluators: Are readability metrics good measures of readability?
Isabel Cachola
|
Daniel Khashabi
|
Mark Dredze
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries.
pdf
bib
abs
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick
|
Saransh Sharma
|
Abhik Jana
|
Pawan Goyal
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multimodal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 dataset. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively. We release both the code and the dataset used for this work at https://github.com/Text-Takes-Over-EMNLP-2025/MultiModal-Intent-EMNLP-2025.
pdf
bib
abs
What’s in a prompt? Language models encode literary style in prompt embeddings
Raphaël Sarfati
|
Haley Moller
|
Toni J.b. Liu
|
Nicolas Boulle
|
Christopher Earls
Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.
pdf
bib
abs
Identifying and Answering Questions with False Assumptions: An Interpretable Approach
Zijie Wang
|
Eduardo Blanco
People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions requires first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers to these questions because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate whether the problem reduces to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by pinpointing the false assumptions.
pdf
bib
abs
VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding
Zhaowei Liu
|
Xin Guo
|
Haotian Xia
|
Lingfeng Zeng
|
Fangqi Lou
|
Jinyi Niu
|
Mengping Li
|
Qi Qi
|
Jiahuan Li
|
Wei Zhang
|
Yinglong Wang
|
Weige Cai
|
Weining Shen
|
Liwen Zhang
Multimodal large language models (MLLMs) hold great promise for automating complex financial analysis. To comprehensively evaluate their capabilities, we introduce VisFinEval, the first large-scale Chinese benchmark that spans the full front-middle-back office lifecycle of financial tasks. VisFinEval comprises 15,848 annotated question–answer pairs drawn from eight common financial image modalities (e.g., K-line charts, financial statements, official seals), organized into three hierarchical scenario depths: Financial Knowledge & Data Analysis, Financial Analysis & Decision Support, and Financial Risk Control & Asset Optimization. We evaluate 21 state-of-the-art MLLMs in a zero-shot setting. The top model, Qwen-VL-max, achieves an overall accuracy of 76.3%, outperforming non-expert humans but trailing financial experts by over 14 percentage points. Our error analysis uncovers six recurring failure modes—including cross-modal misalignment, hallucinations, and lapses in business-process reasoning—that highlight critical avenues for future research. VisFinEval aims to accelerate the development of robust, domain-tailored MLLMs capable of seamlessly integrating textual and visual financial information. The data and the code are available at https://github.com/SUFE-AIFLM-Lab/VisFinEval.
pdf
bib
abs
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
David Acuna
|
Ximing Lu
|
Jaehun Jung
|
Hyunwoo Kim
|
Amlan Kar
|
Sanja Fidler
|
Yejin Choi
Recent research in vision-language models (VLMs) has centered around the possibility of equipping them with implicit long-form chain-of-thought reasoning—akin to the success observed in language models—via distillation and reinforcement learning. But what about the non-reasoning models already trained and deployed across the internet? Should we simply abandon them, or is there hope for a search mechanism that can elicit hidden knowledge and induce long reasoning traces— without any additional training or supervision? In this paper, we explore this possibility using a Monte Carlo Tree Search (MCTS)-inspired algorithm, which injects subquestion–subanswer pairs into the model’s output stream. We show that framing reasoning as a search process—where subquestions act as latent decisions within a broader inference trajectory—helps the model “connect the dots” between fragmented knowledge and produce extended reasoning traces in non-reasoning models. We evaluate our method across three benchmarks and observe consistent improvements. Notably, our approach yields a 2% overall improvement on MMMU-PRO, including a significant 9% gain in Liberal Arts.
pdf
bib
abs
LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Harry Mayne
|
Ryan Othniel Kearns
|
Yushi Yang
|
Andrew M. Bean
|
Eoin D. Delaney
|
Chris Russell
|
Adam Mahdi
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
pdf
bib
abs
Grounding Multilingual Multimodal LLMs With Cultural Knowledge
Jean De Dieu Nyandwi
|
Yueqi Song
|
Simran Khanuja
|
Graham Neubig
Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. Cultural-Pangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of +5.0%without degrading results on mainstream vision–language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.
pdf
bib
abs
Following Length Constraints in Instructions
Weizhe Yuan
|
Ilia Kulikov
|
Ping Yu
|
Kyunghyun Cho
|
Sainbayar Sukhbaatar
|
Jason E Weston
|
Jing Xu
Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral.
pdf
bib
abs
Memory-QA: Answering Recall Questions Based on Multimodal Memories
Hongda Jiang
|
Xinyuan Zhang
|
Siddhant Garg
|
Rishab Arora
|
Shiun-Zu Kuo
|
Jiayang Xu
|
Aaron Colak
|
Xin Luna Dong
We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to +14% on QA accuracy).
pdf
bib
abs
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
Javad Rafiei Asl
|
Sidhant Narula
|
Mohammad Ghasemigol
|
Eduardo Blanco
|
Daniel Takabi
Large Language Models (LLMs) have revolutionized natural language processing, yet remain vulnerable to jailbreak attacks—particularly multi-turn jailbreaks that distribute malicious intent across benign exchanges, thereby bypassing alignment mechanisms. Existing approaches often suffer from limited exploration of the adversarial space, rely on hand-crafted heuristics, or lack systematic query refinement. We propose NEXUS (Network Exploration for eXploiting Unsafe Sequences), a modular framework for constructing, refining, and executing optimized multi-turn attacks. NEXUS comprises: (1) ThoughtNet, which hierarchically expands a harmful intent into a structured semantic network of topics, entities, and query chains; (2) a feedback-driven Simulator that iteratively refines and prunes these chains through attacker–victim–judge LLM collaboration using harmfulness and semantic-similarity benchmarks; and (3) a Network Traverser that adaptively navigates the refined query space for real-time attacks. This pipeline systematically uncovers stealthy, high-success adversarial paths across LLMs. Our experimental results on several closed-source and open-source LLMs show that NEXUS can achieve a higher attack success rate, between 2.1% and 19.4%, compared to state-of-the-art approaches. Our source code is available at https://github.com/inspire-lab/NEXUS.
pdf
bib
abs
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Simon A. Aytes
|
Jinheon Baek
|
Sung Ju Hwang
Recent advances in large language models (LLMs) have enabled strong reasoning capabilities through Chain-of-Thought (CoT) prompting, which elicits step-by-step problem solving, but often at the cost of excessive verbosity in intermediate outputs, leading to increased computational overhead. We propose Sketch-of-Thought (SoT), a prompting framework that integrates cognitively inspired reasoning paradigms with linguistic constraints to reduce token usage while preserving reasoning accuracy. SoT is designed as a flexible, modular approach and is instantiated with three paradigms—Conceptual Chaining, Chunked Symbolism, and Expert Lexicons—each tailored to distinct reasoning tasks and selected dynamically at test-time by a lightweight routing model. Across 18 reasoning datasets spanning multiple domains, languages, and modalities, SoT achieves token reductions of up to 84% with minimal accuracy loss. In tasks such as mathematical and multi-hop reasoning, it even improves accuracy while shortening outputs.
pdf
bib
abs
From Language to Cognition: How LLMs Outgrow the Human Language Network
Badr AlKhamissi
|
Greta Tuckute
|
Yingtian Tang
|
Taha Osama A Binhuraib
|
Antoine Bosselut
|
Martin Schrimpf
Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language underlying this alignment—and how brain-like representations emerge and change across training—remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence—i.e., knowledge of linguistic rules—more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. Notably, we find that the correlation between next-word prediction, behavioral alignment, and brain alignment fades once models surpass human language proficiency. We further show that model size is not a reliable predictor of brain alignment when controlling for the number of features. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.
pdf
bib
abs
Logos as a Well-Tempered Pre-train for Sign Language Recognition
Ilya Ovodov
|
Petr Surovtsev
|
Karina Kvanchiani
|
Alexander Kapitanov
|
Alexander Nagaev
This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, although a certain number of datasets is available, the data for individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive available ISLR dataset by the number of signers, one of the most extensive datasets in size and vocabulary, and the largest RSL dataset. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target low-resource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.
pdf
bib
abs
Hallucination Detection in LLMs Using Spectral Features of Attention Maps
Jakub Binkowski
|
Denis Janiak
|
Albert Sawczyn
|
Bogdan Gabrys
|
Tomasz Jan Kajdanowicz
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks but remain prone to hallucinations. Detecting hallucinations is essential for safety-critical applications, and recent methods leverage attention map properties to this end, though their effectiveness remains limited. In this work, we investigate the spectral features of attention maps by interpreting them as adjacency matrices of graph structures. We propose the LapEigvals method, which utilises the top-k eigenvalues of the Laplacian matrix derived from the attention maps as an input to hallucination detection probes. Empirical evaluations demonstrate that our approach achieves state-of-the-art hallucination detection performance among attention-based methods. Extensive ablation studies further highlight the robustness and generalisation of LapEigvals, paving the way for future advancements in the hallucination detection domain.
pdf
bib
abs
Composable Cross-prompt Essay Scoring by Merging Models
Sanwoo Lee
|
Kun Liang
|
Yunfang Wu
Recent advances in cross-prompt automated essay scoring typically train models jointly on all available source domains, often requiring simultaneous access to unlabeled target domain samples. However, using all sources can lead to suboptimal transfer and high computational cost. Moreover, repeatedly accessing the source essays for continual adaptation raises privacy concerns. We propose a source-free adaptation approach that selectively merges the parameters of individually trained source models without further access to the source datasets. In particular, we mix the task vectors—the parameter updates from fine-tuning—via a weighted sum to efficiently simulate selective joint-training. We use Bayesian optimization to determine the mixing weights using our proposed Prior-encoded Information Maximization (PIM), an unsupervised objective which promotes score discriminability by leveraging useful priors pre-computed from the sources. Experimental results with LLMs on in-dataset and cross-dataset adaptation show that our method (1) consistently outperforms joint-training on all sources, (2) maintains superior robustness compared to other merging methods, (3) excels under severe distribution shifts where recent leading cross-prompt methods struggle, all while retaining computational efficiency.
pdf
bib
abs
Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts
Yuho Lee
|
Jiaqi Deng
|
Nicole Hee-Yeon Kim
|
Hyangsuk Min
|
Taewon Yun
|
Minjeong Ban
|
Kim Yul
|
Hwanjun Song
We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures key information of source texts into a three-level hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models faithfully recall the key information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, demonstrating that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the evaluation cost by up to 25×. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and proprietary models, as well as across model scales. Our code and dataset are publicly available at https://github.com/DISL-Lab/HAMLET.
pdf
bib
abs
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates
Hy Dang
|
Tianyi Liu
|
Zhuofeng Wu
|
Jingfeng Yang
|
Haoming Jiang
|
Tao Yang
|
Pei Chen
|
Zhengyang Wang
|
Helen Wang
|
Huasheng Li
|
Bing Yin
|
Meng Jiang
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3–12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.
pdf
bib
abs
Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey
Katerina Korre
|
Dimitris Tsirmpas
|
Nikos Gkoumas
|
Emma Cabalé
|
Danai Myrtzani
|
Theodoros Evgeniou
|
Ion Androutsopoulos
|
John Pavlopoulos
We present a survey of methods for assessing and enhancing the quality of online discussions, focusing on the potential of Large Language Models (LLMs). While online discourses aim, at least in theory, to foster mutual understanding, they often devolve into harmful exchanges, such as hate speech, threatening social cohesion and democratic values. Recent advancements in LLMs enable artificial facilitation agents to not only moderate content, but also actively improve the quality of interactions. Our survey synthesizes ideas from Natural Language Processing (NLP) and Social Sciences to provide (a) a new taxonomy on discussion quality evaluation, (b) an overview of intervention and facilitation strategies, (c) along with a new taxonomy of conversation facilitation datasets, (d) an LLM-oriented roadmap of good practices and future research directions, from technological and societal perspectives.
pdf
bib
abs
Temporal Scaling Law for Large Language Models
Yizhe Xiong
|
Xiansheng Chen
|
Xin Ye
|
Hui Chen
|
Zijia Lin
|
Haoran Lian
|
Zhenpeng Su
|
Wei Huang
|
Jianwei Niu
|
Jungong Han
|
Guiguang Ding
Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pretraining process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters *directly* on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pretraining dynamics from the token position granularity provides some insights to enhance the understanding of LLM pretraining.
pdf
bib
abs
Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
Yi Feng
|
Jiaqi Wang
|
Wenxuan Zhang
|
Zhuang Chen
|
Shen Yutong
|
Xiyao Xiao
|
Minlie Huang
|
Liping Jing
|
Jian Yu
Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, **INT** (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate responses through retrieval-augmentation. Second, **IMA** (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking “Innovative Moments” (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that **INT** consistently outperforms standard methods in therapeutic quality and depth. We further demonstrate the effectiveness of **INT** in synthesizing high-quality support conversations to facilitate social applications.
pdf
bib
abs
From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test
Xunlian Dai
|
Li Zhou
|
Benyou Wang
|
Haizhou Li
The human-centered word association test (WAT) serves as a cognitive proxy, revealing sociocultural variations through culturally shared semantic expectations and implicit linguistic patterns shaped by lived experiences. We extend this test into an LLM-adaptive, free-relation task to assess the alignment of large language models (LLMs) with cross-cultural cognition. To address culture preference, we propose CultureSteer, an innovative approach that moves beyond superficial cultural prompting by embedding cultural-specific semantic associations directly within the model’s internal representation space. Experiments show that current LLMs exhibit significant bias toward Western (notably American) schemas at the word association level. In contrast, our model substantially improves cross-cultural alignment, capturing diverse semantic associations. Further validation on culture-sensitive downstream tasks confirms its efficacy in fostering cognitive alignment across cultures. This work contributes a novel methodological paradigm for enhancing cultural awareness in LLMs, advancing the development of more inclusive language technologies.
pdf
bib
abs
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data
Shenglai Zeng
|
Jiankun Zhang
|
Pengfei He
|
Jie Ren
|
Tianqi Zheng
|
Hanqing Lu
|
Han Xu
|
Hui Liu
|
Yue Xing
|
Jiliang Tang
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. However, when the retrieval process involves private data, RAG systems may face severe privacy risks, potentially leading to the leakage of sensitive information. To address this issue, we propose using synthetic data as a privacy-preserving alternative for the retrieval data. We propose SAGE, a novel two-stage synthetic data generation paradigm. In the stage-1, we employ an attribute-based extraction and generation approach to preserve key contextual information from the original data. In the stage-2, we further enhance the privacy properties of the synthetic data through an agent-based iterative refinement process. Extensive experiments demonstrate that using our synthetic data as the retrieval context achieves comparable performance to using the original data while substantially reducing privacy risks. Our work takes the first step towards investigating the possibility of generating high-utility and privacy-preserving synthetic data for RAG, opening up new opportunities for the safe application of RAG systems in various domains.
pdf
bib
abs
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
|
Jiahe Guo
|
Yulin Hu
|
Yang Deng
|
An Zhang
|
Xingyu Sui
|
Xinyang Han
|
Yanyan Zhao
|
Bing Qin
|
Tat-Seng Chua
|
Ting Liu
Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
pdf
bib
abs
Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities
Chuangtao Ma
|
Yongrui Chen
|
Tianxing Wu
|
Arijit Khan
|
Haofen Wang
Large language models (LLMs) have demonstrated remarkable performance on question-answering (QA) tasks because of their superior capabilities in natural language understanding and generation. However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG’s role when integrating with LLMs. We systematically survey state-of-the-art methods in synthesizing LLMs and KGs for QA and compare and analyze these approaches in terms of strength, limitations, and KG requirements. We then align the approaches with QA and discuss how these approaches address the main challenges of different complex QA. Finally, we summarize the advancements, evaluation metrics, and benchmark datasets and highlight open challenges and opportunities.
pdf
bib
abs
TFDP: Token-Efficient Disparity Audits for Autoregressive LLMs via Single-Token Masked Evaluation
Inderjeet Singh
|
Ramya Srinivasan
|
Roman Vainshtein
|
Hisashi Kojima
Auditing autoregressive Large Language Models (LLMs) for disparities is often impeded by high token costs and limited precision. We introduce Token-Focused Disparity Probing (TFDP), a novel methodology overcoming these challenges by adapting single-token masked prediction to autoregressive architectures via targeted token querying. Disparities between minimally contrastive sentence pairs are quantified through a multi-scale semantic alignment score that integrates sentence, local-context, and token embeddings with adaptive weighting. We propose three disparity metrics: Preference Score (\mathcal{PS}), Prediction Set Divergence (\mathcal{PSD}), and Weighted Final Score (\mathcal{WFS}), for comprehensive assessment. Evaluated on our customized Proverbs Disparity Dataset (PDD) with controlled attribute toggles (e.g., gender bias, misinformation susceptibility), TFDP precisely detects disparities while achieving up to 42 times fewer output tokens than minimal n-token continuations, offering a scalable tool for responsible LLM evaluation.
pdf
bib
abs
Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
Li Zhou
|
Lutong Yu
|
Dongchu Xie
|
Shaohuan Cheng
|
Wenyan Li
|
Haizhou Li
Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation. The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional attire into modern designs through cultural element inheritance and modern context adaptation. Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10% to human experts, while open VLMs lags further behind non-experts. For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42%. Our benchmark provides an essential testbed, revealing significant challenges in this new direction of temporal cultural understanding and creative adaptation.
pdf
bib
abs
MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition
Zhongyu Yang
|
Junhao Song
|
Siyang Song
|
Wei Pang
|
Yingfang Yuan
Multimodal large language models (MLLMs) have demonstrated strong performance across diverse multimodal tasks, achieving promising outcomes. However, their application to emotion recognition in natural images remains underexplored. MLLMs struggle to handle ambiguous emotional expressions and implicit affective cues, whose capability is crucial for affective understanding but largely overlooked. To address these challenges, we propose MERMAID, a novel multi-agent framework that integrates a multi-perspective self-reflection module, an emotion-guided visual augmentation module, and a cross-modal verification module. These components enable agents to interact across modalities and reinforce subtle emotional semantics, thereby enhancing emotion recognition and supporting autonomous performance. Extensive experiments show that MERMAID outperforms existing methods, achieving absolute accuracy gains of 8.70%–27.90% across diverse benchmarks and exhibiting greater robustness in emotionally diverse scenarios.
pdf
bib
abs
Personality Vector: Modulating Personality of Large Language Models by Model Merging
Seungjong Sun
|
Seo Yeon Baek
|
Jang Hyun Kim
Driven by the demand for personalized AI systems, there is growing interest in aligning the behavior of large language models (LLMs) with human traits such as personality. Previous attempts to induce personality in LLMs have shown promising results, but they struggle to capture the continuous and multidimensional nature of human traits. In this work, we propose a novel method for personality modulation in LLMs via model merging. Specifically, we construct personality vectors by subtracting the weights of a pre-trained model from those of the fine-tuned model on a given personality trait. By merging personality vectors, we enable LLMs to exhibit desired personality traits without additional training. Extensive experiments show that personality vectors enable continuous control over trait intensity and support the composition of multiple traits. Furthermore, personality vectors transfer across diverse downstream models, suggesting that they encode generalizable representations of personality.
pdf
bib
abs
Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models
Ruibin Xiong
|
Yimeng Chen
|
Dmitrii Khizbullin
|
Mingchen Zhuge
|
Jürgen Schmidhuber
Long-form writing agents require flexible integration and interaction across information retrieval, reasoning, and composition. Current approaches rely on predefined workflows and rigid thinking patterns to generate outlines before writing, resulting in constrained adaptability during writing. In this paper we propose WriteHERE, a general agent framework that achieves human-like adaptive writing through recursive task decomposition and dynamic integration of three fundamental task types: retrieval, reasoning, and composition. Our methodology features: 1) a planning mechanism that interleaves recursive task decomposition and execution, eliminating artificial restrictions on writing workflow; and 2) integration of task types that facilitates heterogeneous task decomposition. Evaluations on both fiction writing and technical report generation show that our method consistently outperforms state-of-the-art approaches across all automatic evaluation metrics, demonstrating the effectiveness and broad applicability of our proposed framework. We have publicly released our code and prompts to facilitate further research.
pdf
bib
abs
Hidden in Plain Sight: Reasoning in Underspecified and Misspecified Scenarios for Multimodal LLMs
Qianqi Yan
|
Hongquan Li
|
Shan Jiang
|
Yang Zhao
|
Xinze Guan
|
Ching-Chen Kuo
|
Xin Eric Wang
Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments where inputs are messy, underspecified, and not always trustworthy. Unlike curated benchmarks, these settings frequently involve instructions that reference missing objects or contradictory facts, rely on ambiguous cues, or request infeasible actions. In such cases, success hinges not merely on task execution, but on the model’s ability to detect when something is silently wrong. This paper presents a systematic analysis of how current MLLMs handle such underspecified and misspecified scenarios: cases where flaws must be inferred from context rather than explicitly stated. Using a curated diagnostic suite spanning four categories of real-world failure modes, we evaluate nine MLLMs, including o3 and GPT-4o, and find that models often fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills. Explicit prompting reveals that the underlying capabilities exist but are frequently suppressed in favor of user compliance.We further show that simple inference-time interventions, such as cautious persona prompting and, in particular, requiring a clarifying question, can substantially recover performance. Our findings highlight a persistent gap between reasoning competence and behavioral compliance in current MLLMs, and suggest practical strategies for making these systems more trustworthy in underconstrained environments.
pdf
bib
abs
PrimeX: A Dataset of Worldview, Opinion, and Explanation
Rik Koncel-Kedziorski
|
Brihi Joshi
|
Tim Paek
As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual’s belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.
pdf
bib
abs
LASER: An LLM-based ASR Scoring and Evaluation Rubric
Amruta Parulekar
|
Preethi Jyothi
Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs’ in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.
pdf
bib
abs
Improving Zero-shot Sentence Decontextualisation with Content Selection and Planning
Zhenyun Deng
|
Yulong Chen
|
Andreas Vlachos
Extracting individual sentences from a document as evidence or reasoning steps is commonly done in many NLP tasks. However, extracted sentences often lack context necessary to make them understood, e.g., coreference and background information. To this end, we propose a content selection and planning framework for zero-shot decontextualisation, which determines what content should be mentioned and in what order for a sentence to be understood out of context. Specifically, given a potentially ambiguous sentence and its context, we first segment it into basic semantically-independent units. We then identify potentially ambiguous units from the given sentence, and extract relevant units from the context based on their discourse relations. Finally, we generate a content plan to rewrite the sentence by enriching each ambiguous unit with its relevant units. Experimental results demonstrate that our approach is competitive for sentence decontextualisation, producing sentences that exhibit better semantic integrity and discourse coherence, outperforming existing methods.
pdf
bib
abs
Beyond Text: Unveiling Privacy Vulnerabilities in Multi-modal Retrieval-Augmented Generation
Jiankun Zhang
|
Shenglai Zeng
|
Jie Ren
|
Tianqi Zheng
|
Hui Liu
|
Xianfeng Tang
|
Hui Liu
|
Yi Chang
Multimodal Retrieval-Augmented Generation (MRAG) systems enhance LMMs by integrating external multimodal databases, but introduce unexplored privacy vulnerabilities. While text-based RAG privacy risks have been studied, multimodal data presents unique challenges. We provide the first systematic analysis of MRAG privacy vulnerabilities across vision-language and speech-language modalities. Using a novel compositional structured prompt attack in a black-box setting, we demonstrate how attackers can extract private information by manipulating queries. Our experiments reveal that LMMs can both directly generate outputs resembling retrieved content and produce descriptions that indirectly expose sensitive information, highlighting the urgent need for robust privacy-preserving MRAG techniques.
pdf
bib
abs
Code Execution as Grounded Supervision for LLM Reasoning
Dongwon Jung
|
Wenxuan Zhou
|
Muhao Chen
Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
pdf
bib
abs
Subjective Behaviors and Preferences in LLM: Language of Browsing
Sai Sundaresan
|
Harshita Chopra
|
Atanu R. Sinha
|
Koustava Goswami
|
Nagasai Saketh Naidu
|
Raghav Karan
|
N Anushka
A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user’s self-constructed “language”, albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the “language of browsing” better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users’ heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.
pdf
bib
abs
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
Michal Golovanevsky
|
William Rudman
|
Michael A. Lepori
|
Amir Bar
|
Ritambhara Singh
|
Carsten Eickhoff
Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 99.3% of color and 80.8% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.
pdf
bib
abs
Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models
Benyamin Jamialahmadi
|
Parsa Kavehzadeh
|
Mehdi Rezagholizadeh
|
Parsa Farinneya
|
Hossein Rajabzadeh
|
Aref Jafari
|
Boxing Chen
|
Marzieh S. Tahaei
Deploying large language models (LLMs) in real-world applications is often hindered by strict computational and latency constraints. While dynamic inference offers the flexibility to adjust model behavior based on varying resource budgets, existing methods are frequently limited by hardware inefficiencies or performance degradation. In this paper, we introduce Balcony, a simple yet highly effective framework for depth-based dynamic inference. By freezing the pretrained LLM and inserting additional transformer layers at selected exit points, Balcony maintains the full model’s performance while enabling real-time adaptation to different computational budgets. These additional layers are trained using a straightforward self-distillation loss, aligning the sub-model outputs with those of the full model. This approach requires significantly fewer training tokens and tunable parameters, drastically reducing computational costs compared to prior methods. When applied to the LLaMA3-8B model, using only 0.2% of the original pretraining data, Balcony achieves minimal performance degradation while enabling significant speedups. Remarkably, we show that Balcony outperforms state-of-the-art methods such as Flextron and Layerskip on multiple models at various scales, as well as other leading compression techniques across a variety of benchmarks.
pdf
bib
abs
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
|
Marian Qian
|
Paul Pu Liang
|
Louis-Philippe Morency
Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce Social Genome, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. Social Genome contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). Social Genome is also the first modeling challenge to study external knowledge in social reasoning. Social Genome computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of Social Genome through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.
pdf
bib
abs
Profiler: Black-box AI-generated Text Origin Detection via Context-aware Inference Pattern Analysis
Hanxi Guo
|
Siyuan Cheng
|
Xiaolong Jin
|
Zhuo Zhang
|
Guangyu Shen
|
Kaiyuan Zhang
|
Shengwei An
|
Guanhong Tao
|
Xiangyu Zhang
With the increasing capabilities of Large Language Models (LLMs), the proliferation of AI-generated texts has become a serious concern. Given the diverse range of organizations providing LLMs, it is crucial for governments and third-party entities to identify the origin LLM of a given AI-generated text to enable accurate mitigation of potential misuse and infringement. However, existing detection methods, primarily designed to distinguish between human-generated and LLM-generated texts, often fail to accurately identify the origin LLM due to the high similarity of AI-generated texts from different LLMs. In this paper, we propose a novel black-box AI-generated text origin detection method, dubbed Profiler, which accurately predicts the origin of an input text by extracting distinct context inference patterns through calculating and analyzing novel context losses between the surrogate model’s output logits and the adjacent input context. Extensive experimental results show that Profiler outperforms 10 state-of-the-art baselines, achieving more than a 25% increase in AUC score on average across both natural language and code datasets when evaluated against five of the latest commercial LLMs under both in-distribution and out-of-distribution settings.
pdf
bib
abs
Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs
Dingdong Wang
|
Junan Li
|
Mingyu Cui
|
Dongchao Yang
|
Xueyuan Chen
|
Helen M. Meng
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.
pdf
bib
abs
RAG-Zeval: Enhancing RAG Responses Evaluator through End-to-End Reasoning and Ranking-Based Reinforcement Learning
Kun Li
|
Yunxiang Li
|
Tianhua Zhang
|
Hongyin Luo
|
Xixin Wu
|
James R. Glass
|
Helen M. Meng
Robust evaluation is critical for deploying trustworthy retrieval-augmented generation (RAG) systems. However, current LLM-based evaluation frameworks predominantly rely on directly prompting resource-intensive models with complex multi-stage prompts, underutilizing models’ reasoning capabilities and introducing significant computational cost. In this paper, we present RAG-Zeval (RAG-Zero Evaluator), a novel end-to-end framework that formulates faithfulness and correctness evaluation of RAG systems as a rule-guided reasoning task. Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments with detailed explanation in one-pass. We introduce a ranking-based outcome reward mechanism, using preference judgments rather than absolute scores, to address the challenge of obtaining precise pointwise reward signals. To this end, we synthesize the ranking references by generating quality-controlled responses with zero human annotation. Experiments demonstrate RAG-Zeval’s superior performance, achieving the strongest correlation with human judgments and outperforming baselines that rely on LLMs with 10-100× more parameters. Our approach also exhibits superior interpretability in response evaluation.
pdf
bib
abs
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu
|
Jiacheng Liu
|
Yejin Choi
|
Noah A. Smith
|
Hannaneh Hajishirzi
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18×) and memory use during both indexing (3.2× reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single 128-core CPU node (or 19 hours if using 137 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.
pdf
bib
abs
Mahānāma: A Unique Testbed for Literary Entity Discovery and Linking
Sujoy Sarkar
|
Gourav Sarkar
|
Manoj Balaji Jagadeeshan
|
Jivnesh Sandhan
|
Amrith Krishna
|
Pawan Goyal
High lexical variation, ambiguous references, and long-range dependencies make entity resolution in literary texts particularly challenging. We present Mahānāma, the first large-scale dataset for end-to-end Entity Discovery and Linking (EDL) in Sanskrit, a morphologically rich and under-resourced language. Derived from the Mahābhārata , the world’s longest epic, the dataset comprises over 109K named entity mentions mapped to 5.5K unique entities, and is aligned with an English knowledge base to support cross-lingual linking. The complex narrative structure of Mahānāma, coupled with extensive name variation and ambiguity, poses significant challenges to resolution systems. Our evaluation reveals that current coreference and entity linking models struggle when evaluated on the global context of the test set. These results highlight the limitations of current approaches in resolving entities within such complex discourse. Mahānāma thus provides a unique benchmark for advancing entity resolution, especially in literary domains.
pdf
bib
abs
Adaptively profiling models with task elicitation
Davis Brown
|
Prithvi Balehannina
|
Helen Jin
|
Shreya Havaldar
|
Hamed Hassani
|
Eric Wong
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks—an order of magnitude more than prior work—where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
pdf
bib
abs
Causal Interventions Reveal Shared Structure Across English Filler–Gap Constructions
Sasha Boguraev
|
Christopher Potts
|
Kyle Mahowald
Language Models (LMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LMs learn to use. Our empirical focus is a set of English filler–gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors – relating to frequency, filler type, and surrounding context – that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LMs can push linguistic theory forward.
pdf
bib
abs
TactfulToM: Do LLMs have the Theory of Mind ability to understand White Lies?
Yiwei Liu
|
Emma Jane Pretty
|
Jiahao Huang
|
Saku Sugawara
While recent studies explore Large Language Models’ (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities that require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs’ ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly when they are used to spare others’ feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that TactfulToM is challenging for state-of-the-art models, which perform substantially below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning that enables true understanding of white lies.
pdf
bib
abs
Don’t Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
Colten DiIanni
|
Daniel Deutsch
This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that addresses limitations in previous Pearson’s 𝜌-based and Kendall’s 𝜏-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses only pairwise differences to refine Global Pearson to intra-segment comparisons. Analysis on the WMT’24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than acceq.
pdf
bib
abs
SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction
Alexander Scarlatos
|
Nigel Fernandez
|
Christopher Ormerod
|
Susan Lottridge
|
Andrew Lan
Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
pdf
bib
abs
HESEIA: A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in Latin America
Guido Ivetta
|
Marcos J Gomez
|
Sofía Martinelli
|
Pietro Palombini
|
M Emilia Echeveste
|
Nair Carolina Mazzeo
|
Beatriz Busaniche
|
Luciana Benotti
Most resources for evaluating social biases in Large Language Models are developed without co-design from the communities affected by these biases, and rarely involve participatory approaches. We introduce HESEIA, a dataset of 46,499 sentences created in a professional development course. The course involved 370 high-school teachers and 5,370 students from 189 Latin-American schools. Unlike existing benchmarks, HESEIA captures intersectional biases across multiple demographic axes and school subjects. It reflects local contexts through the lived experience and pedagogical expertise of educators. Teachers used minimal pairs to create sentences that express stereotypes relevant to their school subjects and communities. We show the dataset diversity in term of demographic axes represented and also in terms of the knowledge areas included. We demonstrate that the dataset contains more stereotypes unrecognized by current LLMs than previous datasets. HESEIA is available to support bias assessments grounded in educational communities.
pdf
bib
abs
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal
|
Mahsa Massoud
|
Aarash Feizi
|
Zichao Li
|
Suyuchen Wang
|
Christopher Pal
|
Aishwarya Agrawal
|
David Vazquez
|
Siva Reddy
|
Juan A. Rodriguez
|
Perouz Taslakian
|
Spandana Gella
|
Sai Rajeswar
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
pdf
bib
abs
Analyzing values about gendered language reform in LLMs’ revisions
Jules Watson
|
Xi Wang
|
Raymond Liu
|
Suzanne Stevenson
|
Barend Beekhuizen
Within the common LLM use case of text revision, we study LLMs’ revision of gendered role nouns (e.g., outdoorsperson/woman/man) and their justifications of such revisions. We evaluate their alignment with feminist and trans-inclusive language reforms for English. Drawing on insight from sociolinguistics, we further assess if LLMs are sensitive to the same contextual effects in the application of such reforms as people are, finding broad evidence of such effects. We discuss implications for value alignment.
pdf
bib
abs
ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval
Zihan Chen
|
Lei Shi
|
Weize Wu
|
Qiji Zhou
|
Yue Zhang
Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5%-10% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.
pdf
bib
abs
HyperKGR: Knowledge Graph Reasoning in Hyperbolic Space with Graph Neural Network Encoding Symbolic Path
Lihui Liu
Knowledge graphs (KGs) enable reasoning tasks such as link prediction, question answering, and knowledge discovery. However, real-world KGs are often incomplete, making link prediction both essential and challenging. Existing methods, including embedding-based and path-based approaches, rely on Euclidean embeddings, which struggle to capture hierarchical structures. GNN-based methods aggregate information through message passing in Euclidean space, but they struggle to effectively encode the recursive tree-like structures that emerge in multi-hop reasoning. To address these challenges, we propose a hyperbolic GNN framework that embeds recursive learning trees in hyperbolic space and generates query-specific embeddings. By incorporating hierarchical message passing, our method naturally aligns with reasoning paths and dynamically adapts to queries, improving prediction accuracy. Unlike static embedding-based approaches, our model computes context-aware embeddings tailored to each query. Experiments on multiple benchmark datasets show that our approach consistently outperforms state-of-the-art methods, demonstrating its effectiveness in KG reasoning.
pdf
bib
abs
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval
Yuan Chiang
|
Elvis Hsieh
|
Chia-Hong Chou
|
Janosh Riebesell
Materials science research requires multi-step reasoning and precise material informatics retrieval, where minor errors can propagate into significant failures in downstream experiments. Despite their general success, Large Language Models (LLMs) often struggle with hallucinations, handling domain-specific data effectively (e.g., crystal structures), and integrating experimental workflows. To address these challenges, we introduce LLaMP, a hierarchical multi-agent framework designed to emulate the materials science research workflow. The high-level supervisor agent decomposes user requests into sub-tasks and coordinates with specialized assistant agents. These assistant agents handle domain-specific tasks, such as retrieving and processing data from the Materials Project (MP) or conducting simulations as needed. This pipeline facilitates iterative refinement of material property retrieval and enables the simulation of real-world research workflows. To ensure reliability, we propose a novel metric combining uncertainty and confidence estimate to evaluate the self-consistency of responses from LLaMP and baseline methods. Our experiments demonstrate LLaMP’s superior performance in material property retrieval, crystal structure editing, and annealing molecular dynamics simulations using pre-trained interatomic potentials. Unlike prior work focused solely on material property prediction or discovery, LLaMP serves as a foundation for autonomous materials research by combining grounded informatics and enabling iterative experimental processes. Code and live demo are available at https://github.com/chiang-yuan/llamp.
pdf
bib
abs
ReSeeding Latent States for Sequential Language Understanding
Stéphane Aroca-Ouellette
|
Katharina von der Wense
|
Alessandro Roncone
We introduce Refeeding State Embeddings aligned using Environmental Data (ReSEED), a novel method for grounding language in environmental data. While large language models (LLMs) excel at many tasks, they continue to struggle with multi-step sequential reasoning. ReSEED addresses this by producing latent embeddings aligned with the true state of the environment and refeeding these embeddings into the model before generating its output. To evaluate its effectiveness, we develop three new sequential reasoning benchmarks, each with a training set of paired state-text trajectories and several text-only evaluation sets that test generalization to longer, unseen trajectories. Across all benchmarks, ReSEED significantly improves generalization and scalability over a text-only baseline. We further show that ReSEED outperforms commercial LLMs on our benchmarks, highlighting the value of grounding language in the environment.
pdf
bib
abs
DPED: Multi-Layer Noise Distillation for Privacy-Preserving Text Embeddings
Shuya Feng
|
Yuan Hong
Training text embedding models under differential privacy constraints is challenging due to the high dimensionality of language data and the presence of rare, identifying linguistic features. We propose (Differentially Private Embedding Distillation), a framework that leverages teacher-student distillation with multi-layer noise injection to learn high-quality embeddings while providing differential privacy guarantees. DPED trains an ensemble of teacher models on disjoint subsets of sensitive text data, then transfers their knowledge to a student model through noisy aggregation at multiple layers. A rare-word-aware strategy adaptively handles infrequent words, improving privacy-utility trade-offs. Experiments on benchmark datasets demonstrate that DPED outperforms standard differentially private training methods, achieving substantially higher utility at the same privacy budget. Our approach protects individual word usage patterns in training documents, preventing models from memorizing unique linguistic fingerprints while maintaining practical utility for downstream NLP tasks. Source code is available at https://github.com/datasec-lab/DPED.
pdf
bib
abs
Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation
Mert Inan
|
Anthony Sicilia
|
Alex Xie
|
Saujas Vaduguru
|
Daniel Fried
|
Malihe Alikhani
Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker’s intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g. the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, and therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.
pdf
bib
abs
Morpheme Induction for Emergent Language
Brendon Boldt
|
David R. Mortensen
We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings.It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat).The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks.Second, we validate CSAR’s performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains.Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
pdf
bib
abs
Stepwise Informativeness Search for Improving LLM Reasoning
Siyuan Wang
|
Enda Zhao
|
Xiang Ren
Advances in Large Language Models (LLMs) have improved multi-step reasoning by generating free-text rationales, but these models tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference-time tree search framework incorporating two selection heuristics: grounding-guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty-guided selection which encourages steps with novel conclusions. We further utilize a self-grounding strategy that prompts LLMs to explicitly reference relevant prior steps as premises before deduction at each step, mitigating distraction from irrelevant content. Experiments on five reasoning datasets across five LLMs show the effectiveness and efficiency of our approach to improve reasoning with reduced errors and redundancy.
pdf
bib
abs
Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts
Eric Chamoun
|
Nedjma Ousidhoum
|
Michael Sejr Schlichtkrull
|
Andreas Vlachos
Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications when researchers claim that their findings have real-world impact. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning.We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset—achieving consistent improvements over strong LLM baselines.Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
pdf
bib
abs
FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance
Mintong Kang
|
Vinayshekhar Bannihatti Kumar
|
Shamik Roy
|
Abhishek Kumar
|
Sopan Khosla
|
Balakrishnan Murali Narayanaswamy
|
Rashmi Gangadharaiah
Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., “male” for “gender”) in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Further, given the limitations of existing datasets in comprehensively assessing bias in diffusion models, we introduce a holistic bias evaluation benchmark HBE, covering diverse domains and incorporating complex prompts across various applications. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen’s ability to flexibly and precisely control generation distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.
pdf
bib
abs
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou
|
Le Xue
|
Honglu Zhou
|
Silvio Savarese
|
Ran Xu
|
Caiming Xiong
|
Chris Callison-Burch
|
Mark Yatskar
|
Juan Carlos Niebles
Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt. Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples. While task-specific fine-tuning improves performance by 56% relative to baseline, state-of-the-art models still achieve only 56% accuracy overall and 42% in four-modality settings, underscoring a significant limitation in current multimodal models.
pdf
bib
abs
Proactive Hearing Assistants that Isolate Egocentric Conversations
Guilin Hu
|
Malek Itani
|
Tuochao Chen
|
Shyamnath Gollakota
We introduce proactive hearing assistants that automatically identify and separate the wearer’s conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer’s self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement.
pdf
bib
abs
fLSA: Learning Semantic Structures in Document Collections Using Foundation Models
Weijia Xu
|
Nebojsa Jojic
|
Nicolas Le Roux
Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more often than direct sampling and hierarchical sampling with existing tagging methods.
pdf
bib
abs
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Kaiwen Zhou
|
Xuandong Zhao
|
Jayanth Srinivasa
|
Gaowen Liu
|
Aosong Feng
|
Dawn Song
|
Xin Eric Wang
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the ‘key sentence’ that follows models’ query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha-moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model’s internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models’ attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.
pdf
bib
abs
HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance
Rosni Vasu
|
Chandrayee Basu
|
Bhavana Dalvi Mishra
|
Cristina Sarasua
|
Peter Clark
|
Abraham Bernstein
Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present HypER (Hypothesis Generation with Explanation and Reasoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. HypER is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that HypER outperformes the base model, distinguishing valid from invalid reasoning chains (+22% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts (>3.5 on 5-point Likert scale).
pdf
bib
abs
Empowering GraphRAG with Knowledge Filtering and Integration
Kai Guo
|
Harry Shomer
|
Shenglai Zeng
|
Haoyu Han
|
Yu Wang
|
Jiliang Tang
In recent years, large language models (LLMs) have revolutionized the field of natural language processing. However, they often suffer from knowledge gaps and hallucinations. Graph retrieval-augmented generation (GraphRAG) enhances LLM reasoning by integrating structured knowledge from external graphs. However, we identify two key challenges that plague GraphRAG: (1) Retrieving noisy and irrelevant information can degrade performance and (2) Excessive reliance on external knowledge suppresses the model’s intrinsic reasoning.To address these issues, we propose GraphRAG-FI (Filtering & Integration), consisting of GraphRAG-Filtering and GraphRAG-Integration. GraphRAG-Filtering employs a two-stage filtering mechanism to refine retrieved information. GraphRAG-Integration employs a logits-based selection strategy to balance external knowledge from GraphRAG with the LLM’s intrinsic reasoning, reducing over-reliance on retrievals. Experiments on knowledge graph QA tasks demonstrate that GraphRAG-FI significantly improves reasoning performance across multiple backbone models, establishing a more reliable and effective GraphRAG framework.
pdf
bib
abs
Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization
Jaewook Lee
|
Alexander Scarlatos
|
Andrew Lan
Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
pdf
bib
abs
Refining Attention for Explainable and Noise-Robust Fact-Checking with Transformers
Jean-Flavien Bussotti
|
Paolo Papotti
In tasks like question answering and fact-checking, models must discern relevant information from extensive corpora in an “open-book” setting. Conventional transformer-based models excel at classifying input data, but (i) often falter due to sensitivity to noise and (ii) lack explainability regarding their decision process. To address these challenges, we introduce ATTUN, a novel transformer architecture designed to enhance model transparency and resilience to noise by refining the attention mechanisms. Our approach involves a dedicated module that directly modifies attention weights, allowing the model to both improve predictions and identify the most relevant sections of input data. We validate our methodology using fact-checking datasets and show promising results in question answering. Experiments demonstrate improvements of up to 51% in F1 score for detecting relevant context, and gains of up to 18% in task accuracy when integrating ATTUN into a model.
pdf
bib
abs
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
Seongho Joo
|
Hyukhun Koh
|
Kyomin Jung
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their potential misuse for harmful purposes remains a significant concern. To strengthen defenses against such vulnerabilities, it is essential to investigate universal jailbreak attacks that exploit intrinsic weaknesses in the architecture and learning paradigms of LLMs. In response, we propose Harmful Prompt Laundering (HaPLa), a novel and broadly applicable jailbreaking technique that requires only black-box access to target models. HaPLa incorporates two primary strategies: 1) abductive framing, which instructs LLMs to infer plausible intermediate steps toward harmful activities, rather than directly responding to explicit harmful queries; and 2) symbolic encoding, a lightweight and flexible approach designed to obfuscate harmful content, given that current LLMs remain sensitive primarily to explicit harmful keywords. Experimental results show that HaPLa achieves over 95% attack success rate on GPT-series models and 70% across all targets. Further analysis with diverse symbolic encoding rules also reveals a fundamental challenge: it remains difficult to safely tune LLMs without significantly diminishing their helpfulness in responding to benign queries.
pdf
bib
abs
Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25
Meng Lu
|
Catherine Chen
|
Carsten Eickhoff
Mechanistic interpretation has greatly contributed to a more detailed understanding of generative language models, enabling significant progress in identifying structures that implement key behaviors through interactions between internal components. In contrast, interpretability in information retrieval (IR) remains relatively coarse-grained, and much is still unknown as to how IR models determine whether a document is relevant to a query. In this work, we address this gap by mechanistically analyzing how one commonly used model, a cross-encoder, estimates relevance. We find that the model extracts traditional relevance signals, such as term frequency and inverse document frequency, in early-to-middle layers. These concepts are then combined in later layers, similar to the well-known probabilistic ranking function, BM25. Overall, our analysis offers a more nuanced understanding of how IR models compute relevance. Isolating these components lays the groundwork for future interventions that could enhance transparency, mitigate safety risks, and improve scalability.
pdf
bib
abs
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening
Andre Wang He
|
Daniel Fried
|
Sean Welleck
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms—such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning—merely sharpen the base model’s distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO’s rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@N across a large range of N in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter—the number of updates per batch—that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark.
pdf
bib
abs
PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs
Sana Kang
|
Myeongseok Gwon
|
Su Young Kwon
|
Jaewook Lee
|
Andrew Lan
|
Bhiksha Raj
|
Rita Singh
Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner’s first language (L1) to aid in acquiring L2 vocabulary. However, most methods still rely on direct IPA-based phonetic matching or employ LLMs without phonological guidance. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that performs IPA-based phonological adaptation and syllable-aware alignment to retrieve L1 keyword sequence and uses LLMs to generate verbal cues. We evaluate PhoniTale through automated metrics and a short-term recall test with human participants, comparing its output to human-written and prior automated mnemonics. Our findings show that PhoniTale consistently outperforms previous automated approaches and achieves quality comparable to human-written mnemonics.
pdf
bib
abs
Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries
Sahana Ramnath
|
Anurag Mudgil
|
Brihi Joshi
|
Skyler Hallinan
|
Xiang Ren
Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference pair responses, and uses them to make judgments. On 4 challenging datasets, Amulet shows that (a) humans frequently (60-70% of the time) change their intents from one turn of the conversation to the next, and (b) in ∼75% of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter’s significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all 4 datasets.
pdf
bib
abs
Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment
Yunfan Zhang
|
Kathleen McKeown
|
Smaranda Muresan
Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism — the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.
pdf
bib
abs
CMedCalc-Bench: A Fine-Grained Benchmark for Chinese Medical Calculations in LLM
Yunyan Zhang
|
Zhihong Zhu
|
Xian Wu
Large Language Models (LLMs) have demonstrated significant potential in medical diagnostics and clinical decision-making. While benchmarks such as MedQA and PubMedQA have advanced the evaluation of qualitative reasoning, existing medical NLP benchmarks still face two limitations: the absence of a Chinese benchmark for medical calculation tasks, and the lack of fine-grained evaluation of intermediate reasoning. In this paper, we introduce CMedCalc-Bench, a new benchmark designed for Chinese medical calculation. CMedCalc-Bench covers 69 calculators across 12 clinical departments, featuring over 1,000 real-world patient cases. Building on this, we design a fine-grained evaluation framework that disentangles clinical entity extraction from numerical computation, enabling systematic diagnosis of model deficiencies. Experiments across four model families, including medical-specialized and reasoning-focused, provide an assessment of their strengths and limitations on Chinese medical calculation. Furthermore, explorations on faithful reasoning and the demonstration effect offer early insights into advancing safe and reliable clinical computation.
pdf
bib
abs
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Guanyu Hou
|
Jiaming He
|
Yinhang Zhou
|
Ji Guo
|
Yitong Qiao
|
Rui Zhang
|
Wenbo Jiang
Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection remains underexplored. To address this gap, this study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. We quantitatively assess their vulnerabilities and resilience using metrics: the Defense Success Rate, Context Robustness Score, and Judgment Robustness Index. The experiments reveal significant performance disparities, with no single model demonstrating consistent robustness across all attack types. Attack effectiveness is significantly influenced by the position of the malicious content, particularly when injected at the beginning of a sequence. Furthermore, our analysis uncovers a negative correlation between a model’s instruction-following capability and its robustness: models that strictly adhere to instructions tend to be more susceptible, whereas safety-aligned models exhibit greater resistance. To facilitate future research, this work introduces a comprehensive benchmark framework. Our findings underscore the critical need for integrating robustness into training pipelines and developing multi-modal defenses, ultimately facilitating the secure deployment of LALMs. The dataset used in this work is available on Hugging Face.
pdf
bib
abs
How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison
Jiayin Wang
|
Zhiqiang Guo
|
Weizhi Ma
|
Min Zhang
As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks. The code and data are available.
pdf
bib
abs
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
Yejin Son
|
Minseo Kim
|
Sungwoong Kim
|
Seungju Han
|
Jian Kim
|
Dongju Jang
|
Youngjae Yu
|
Chan Young Park
Large Language Models (LLMs) are increasingly used for decision making in embodied agents, yet existing safety evaluations often rely on coarse success rates and domain-specific setups, making it difficult to diagnose why and where these models fail. This obscures our understanding of embodied safety and limits the selective deployment of LLMs in high-risk physical environments. We introduce SAFEL, the framework for systematically evaluating the physical safety of LLMs in embodied decision making. SAFEL assesses two key competencies: (1) rejecting unsafe commands via the Command Refusal Test, and (2) generating safe and executable plans via the Plan Safety Test. Critically, the latter is decomposed into functional modules, goal interpretation, transition modeling, action sequencing enabling fine-grained diagnosis of safety failures. To support this framework, we introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions. Evaluation across 13 state-of-the-art LLMs reveals that while models often reject clearly unsafe commands, they struggle to anticipate and mitigate subtle, situational risks. Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.
pdf
bib
abs
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
Aurick Qiao
|
Zhewei Yao
|
Samyam Rajbhandari
|
Yuxiong He
LLM inference for enterprise applications, such as summarization, RAG, and code-generation, typically observe much longer prompt than generations, leading to high prefill cost and response latency. We present SwiftKV, a novel model transformation and distillation procedure targeted at reducing the prefill compute (in FLOPs) of prompt tokens while preserving high generation quality. First, SwiftKV prefills later layers’ KV cache using an earlier layer’s output, allowing prompt tokens to skip those later layers. Second, SwiftKV employs a lightweight knowledge-preserving distillation procedure that can adapt existing LLMs with minimal accuracy impact. Third, SwiftKV can naturally incorporate KV cache compression to improve inference performance in low-memory scenarios. Our comprehensive experiments show that SwiftKV can effectively reduce prefill computation by 25-50% across several LLM families while incurring minimum quality degradation. In the end-to-end inference serving, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B. SwiftKV is open-sourced at https://github.com/snowflakedb/arctictraining and https://github.com/snowflakedb/arcticinference.
pdf
bib
abs
Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics
Ling-I Wu
|
Weijie Wu
|
Minyu Chen
|
Jianxin Xue
|
Guoqiang Li
Large language models (LLMs) are increasingly used as evaluators in natural language generation tasks, offering advantages in scalability and interpretability over traditional evaluation methods. However, existing LLM-based evaluations often suffer from biases and misalignment, particularly in domain-specific tasks, due to limited functional understanding and knowledge gaps. To address these challenges, we first investigate the relationship between an LLM-based evaluator’s familiarity with the target task and its evaluation performance. We then introduce the Co-Eval framework, which leverages a criteria planner model and optimized machine metrics to enhance the scalability and fairness of LLM-based evaluation. Experimental results on both general and domain-specific tasks demonstrate that Co-Eval reduces biases, achieving up to a 0.4903 reduction in self-preference bias, and improves alignment with human preferences, with gains of up to 0.324 in Spearman correlation.
pdf
bib
abs
Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning
MinJu Jeon
|
Si-Woo Kim
|
Ye-Chan Kim
|
HyunGee Kim
|
Dong-Jin Kim
Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose **Sali4Vid**, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning.
pdf
bib
abs
Semantic Networks Extracted from Students’ Think-Aloud Data are Correlated with Students’ Learning Performance
Pingjing Yang
|
Sullam Jeoung
|
Jennifer Cromley
|
Jana Diesner
When students reflect on their learning from a textbook via think-aloud processes, network representations can be used to capture the concepts and relations from these data. What can we learn from the resulting network representations about students’ learning processes, knowledge acquisition, and learning outcomes? This study brings methods from entity and relation extraction using classic and LLM-based methods to the application domain of educational psychology. We built a ground-truth baseline of relational data that represents relevant (to educational science), textbook-based information as a semantic network. Among the tested models, SPN4RE and LUKE achieved the best performance in extracting concepts and relations from students’ verbal data. Network representations of students’ verbalizations varied in structure, reflecting different learning processes. Correlating the students’ semantic networks with learning outcomes revealed that denser and more interconnected semantic networks were associated with more elaborated knowledge acquisition. Structural features such as the number of edges and surface overlap with textbook networks significantly correlated with students’ posttest performance.
pdf
bib
abs
Less is More: The Effectiveness of Compact Typological Language Representations
York Hay Ng
|
Phuong Hanh Hoang
|
En-Shiun Annie Lee
Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.
pdf
bib
abs
Sparse Activation Editing for Reliable Instruction Following in Narratives
Runcong Zhao
|
Chengyu Cao
|
Qinglin Zhu
|
Xiucheng Ly
|
Shun Shao
|
Lin Gui
|
Ruifeng Xu
|
Yulan He
Complex narrative contexts often challenge language models’ ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
pdf
bib
abs
Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages
Asif Shahriar
|
Rifat Shahriyar
|
M Saifur Rahman
Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in downstream tasks where local patterns are important. To remedy this, we propose a lightweight architectural enhancement: an inception-style 1-D convolution module that sits on top of the transformer layer and augments token representations with multi-scale local features. This enriched feature space is then processed by a self-attention layer that dynamically weights tokens based on their task relevance. Experiments on five diverse tasks show that our framework consistently improves general-purpose, domain-specific, and multilingual models, outperforming baselines by 1% to 14% while maintaining efficiency. Ablation studies show that multi-scale convolution performs better than any single kernel and that the self-attention layer is critical for performance.
pdf
bib
abs
Causal Tree Extraction from Medical Case Reports: A Novel Task for Experts-like Text Comprehension
Sakiko Yahata
|
Zhen Wan
|
Fei Cheng
|
Sadao Kurohashi
|
Hisahiko Sato
|
Ryozo Nagai
Extracting causal relationships from a medical case report is essential for comprehending the case, particularly its diagnostic process. Since the diagnostic process is regarded as a bottom-up inference, causal relationships in cases naturally form a multi-layered tree structure. The existing tasks, such as medical relation extraction, are insufficient for capturing the causal relationships of an entire case, as they treat all relations equally without considering the hierarchical structure inherent in the diagnostic process. Thus, we propose a novel task, Causal Tree Extraction (CTE), which receives a case report and generates a causal tree with the primary disease as the root, providing an intuitive understanding of a case’s diagnostic process. Subsequently, we construct a Japanese case report CTE dataset, J-Casemap, propose a generation-based CTE method that outperforms the baseline by 20.2 points in the human evaluation, and introduce evaluation metrics that reflect clinician preferences. Further experiments also show that J-Casemap enhances the performance of solving other medical tasks, such as question answering.
pdf
bib
abs
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
Alisha Srivastava
|
Emir Kaan Korukluoglu
|
Minh Nhat Le
|
Duyen Tran
|
Chau Minh Pham
|
Marzena Karpinska
|
Mohit Iyyer
Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce , a dataset of **31.5K** aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) **direct probing**, which asks the model to identify a book’s title and author; (2) **name cloze**, which requires predicting masked character names; and (3) **prefix probing**, which involves generating continuations. We find that some LLMs consistently recall content across languages, even for texts without existing translation. GPT-4o, for example, identifies authors and titles 69.4% of the time and masked entities 6.3% of the time in newly translated excerpts. While perturbations (e.g., masking characters, shuffling words) reduce accuracy, the model’s performance remains above chance level. Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
pdf
bib
abs
Enhanced Noun-Noun Compound Interpretation through Textual Enrichment
Bingyang Ye
|
Jingxuan Tu
|
James Pustejovsky
Interpreting Noun-Noun Compounds remains a persistent challenge for Large Language Models (LLMs) because the semantic relation between the modifier and the head is rarely stated explicitly. Recent benchmarks frame Noun-Noun Compound Interpretation as a multiple-choice question. While this setting allows LLMs to produce more controlled results, it still faces two key limitations: vague relation descriptions as options and the inability to handle polysemous compounds. We introduce a dual-faceted textual enrichment framework that augments prompts. Description enrichment paraphrases relations into event‐oriented descriptions instantiated with the target compound to explicitly surface the hidden event connecting head and modifier. Conditioned context enrichment identifies polysemous compounds leveraging qualia-role binding and assigns each compound with condition cues for disambiguation. Our method yields consistently higher accuracy across three LLM families. These gains suggest that surfacing latent compositional structure and contextual constraint is a promising path toward deeper semantic understanding in language models.
pdf
bib
abs
ICL CIPHERS: Quantifying ”Learning” in In-Context Learning via Substitution Ciphers
Zhouxiang Fang
|
Aayush Mishra
|
Muhan Gao
|
Anqi Liu
|
Daniel Khashabi
Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time ”learning” from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve tasks reformulated by ICL CIPHERS with a BIJECTIVE mapping, which requires ”deciphering” the latent cipher. We show that LLMs are better at solving tasks reformulated by ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify ”learning” in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, our interpretability analysis shows evidence that LLMs can internally decode ciphered inputs.
pdf
bib
abs
Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning
Yunhao Gou
|
Hansi Yang
|
Zhili Liu
|
Kai Chen
|
Yihan Zeng
|
Lanqing Hong
|
Zhenguo Li
|
Qun Liu
|
Bo Han
|
James Kwok
|
Yu Zhang
Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are corrupted but not broken. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.
pdf
bib
abs
Memory OS of AI Agent
Jiazheng Kang
|
Mingming Ji
|
Zhe Zhao
|
Ting Bai
Large Language Models (LLMs) face a crucial challenge from fixed context windows and inadequate memory management, leading to a severe shortage of long-term memory capabilities and limited personalization in the interactive experience with AI agents. To overcome this challenge, we innovatively propose a Memory Operating System, i.e., MemoryOS, to achieve comprehensive and efficient memory management for AI agents. Inspired by the memory management principles in operating systems, MemoryOS designs a hierarchical storage architecture and consists of four key modules: memory Storage, Updating, Retrieval, and Generation. Specifically, the architecture comprises three levels of storage units: short-term memory, mid-term memory, and long-term personal memory. Key operations within MemoryOS include dynamic updates between storage units: short-term to mid-term updates follow a dialogue-chain-based FIFO principle, while mid-term to long-term updates use a segmented page organization strategy. Our pioneering MemoryOS enables hierarchical memory integration and dynamic updating. Extensive experiments on the LoCoMo benchmark show an average improvement of 48.36% on F1 and 46.18% on BLEU-1 over the baselines on GPT-4o-mini, showing contextual coherence and personalized memory retention in long conversations.
pdf
bib
abs
Rule Discovery for Natural Language Inference Data Generation Using Out-of-Distribution Detection
Juyoung Han
|
Hyunsun Hwang
|
Changki Lee
Natural Language Inference (NLI) is a fundamental task in Natural Language Processing (NLP), yet adapting NLI models to new domains remains challenging due to the high cost of collecting domain-specific training data. While prior work proposed 15 sentence transformation rules to automate training data generation, these rules insufficiently capture the diversity of natural language. We propose a novel framework that combines Out-of-Distribution (OOD) detection and BERT-based clustering to identify premise–hypothesis pairs in the SNLI dataset that are not covered by existing rules and to discover four new transformation rules from them. Using these rules with Chain-of-Thought (CoT) prompting and Large Language Models (LLMs), we generate high-quality training data and augment the SNLI dataset. Our method yields consistent performance improvements across dataset sizes, achieving +0.85%p accuracy on 2k and +0.15%p on 550k samples. Furthermore, a distribution-aware augmentation strategy enhances performance across all scales. Beyond manual explanations, we extend our framework to automatically generated explanations (CoT-Ex), demonstrating that they provide a scalable alternative to human-written explanations and enable reliable rule discovery.
pdf
bib
abs
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Zesen Lyu
|
Dandan Zhang
|
Wei Ye
|
Fangdi Li
|
Zhihang Jiang
|
Yao Yang
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs’ spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the 90%+ performance achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.
pdf
bib
abs
Definition Generation for Word Meaning Modeling: Monolingual, Multilingual, and Cross-Lingual Perspectives
Francesco Periti
|
Roksana Goworek
|
Haim Dubossarsky
|
Nina Tahmasebi
The task of Definition Generation has recently gained attention as an interpretable approach to modeling word meaning. Thus far, most research has been conducted in English, with limited work and resources for other languages. In this work, we expand Definition Generation beyond English to a suite of 22 languages and evaluate Llama-based models within a monolingual, multilingual, and cross-lingual setting. Our experiments show that monolingual fine-tuning consistently outperforms pretrained baselines, with the largest gains observed in languages with lower initial performance; and that multilingual fine-tuning does not consistently improve performance on the individual fine-tuning languages. Our cross-lingual evaluation reveals that models fine-tuned on a single language typically lose the ability to generate definitions in other languages, whereas multilingual models exhibit robust generalization even to languages unseen during fine-tuning.
pdf
bib
abs
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
Juncheng Wang
|
Chao Xu
|
Cheng Yu
|
Zhe Hu
|
Haoyu Xie
|
Guoqi Yu
|
Lei Shang
|
Shujun Wang
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren opens a promising pathway toward unified multi-modal generation frameworks.
pdf
bib
abs
HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization
Huaqin Zhao
|
Jiaxi Li
|
Yi Pan
|
Shizhe Liang
|
Xiaofeng Yang
|
Fei Dou
|
Tianming Liu
|
Jin Lu
Fine-tuning large language models (LLMs) faces significant memory challenges due to the high cost of back-propagation. MeZO addresses this using zeroth-order (ZO) optimization, matching memory usage to inference but suffering from slow convergence due to varying curvatures across model parameters. To overcome this limitation, We propose HELENE, a scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with diagonal Hessian estimation and layer-wise clipping as a second-order pre-conditioner. HELENE provably accelerates and stabilizes convergence by reducing dependence on total parameter space and scaling with the largest layer dimension. Experiments on RoBERTa-large and OPT-1.3B show up to a 20× speedup over MeZO with an average accuracy improvement of 1.5%. HELENE supports full and parameter-efficient fine-tuning, outperforming several state-of-the-art optimizers.
pdf
bib
abs
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Yejin Choi
|
Jaewoo Park
|
Janghan Yoon
|
Saejin Kim
|
Jaehyun Jeon
|
Youngjae Yu
Rapid advances in Multimodal Large Language Models (MLLMs) have extended information retrieval beyond text, enabling access to complex real-world documents that combine both textual and visual content. However, most documents are private, either owned by individuals or confined within corporate silos, and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross-modal pre-questions (preQs) before retrieval. Unlike earlier multimodal retrievers that embed entire documents as a single vector, PREMIR leverages preQs, decomposed from documents into finer token-level representations across modalities, enabling richer contextual understanding. Experiments show that PREMIR achieves state-of-the-art performance on out-of-distribution benchmarks, including closed-domain and multilingual settings, outperforming strong baselines across all metrics. We confirm the contribution of each component through in-depth ablation studies, and qualitative analyses of the generated preQs further highlight the framework’s robustness in real-world settings.
pdf
bib
abs
From Parameters to Performance: A Data-Driven Study on LLM Structure and Development
Suqing Wang
|
Zuchao Li
|
Shi Luohe
|
Bo Du
|
Hai Zhao
|
Yun Li
|
Qianren Wang
Large language models (LLMs) have achieved remarkable success across various domains, driving significant technological advancements and innovations. Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce. To address this gap, we present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks. Leveraging this dataset, we conduct a systematic, data mining-driven analysis to validate and quantify the relationship between structural configurations and performance. Our study begins with a review of the historical development of LLMs and an exploration of potential future trends. We then analyze how various structural choices impact performance across benchmarks and further corroborate our findings using mechanistic interpretability techniques. By providing data-driven insights into LLM optimization, our work aims to guide the targeted development and application of future models.
pdf
bib
abs
Logical Reasoning with Outcome Reward Models for Test-Time Scaling
Ramya Keerthy Thatikonda
|
Wray Buntine
|
Ehsan Shareghi
Logical reasoning is a critical benchmark for evaluating the capabilities of large language models (LLMs), as it reflects their ability to derive valid conclusions from given premises. While the combination of test-time scaling with dedicated outcome or process reward models has opened up new avenues to enhance LLMs performance in complex reasoning tasks, this space is under-explored in deductive logical reasoning. We present a set of Outcome Reward Models (ORMs) for deductive reasoning. To train the ORMs we mainly generate data using Chain-of-Thought (CoT) with single and multiple samples. Additionally, we propose a novel tactic to further expand the type of errors covered in the training dataset of the ORM. In particular, we propose an echo generation technique that leverages LLMs’ tendency to reflect incorrect assumptions made in prompts to extract additional training data, covering previously unexplored error types. While a standard CoT chain may contain errors likely to be made by the reasoner, the echo strategy deliberately steers the model toward incorrect reasoning. We show that ORMs trained on CoT and echo-augmented data demonstrate improved performance on the FOLIO, JustLogic, and ProverQA datasets across four different LLMs.
pdf
bib
abs
Speculating LLMs’ Chinese Training Data Pollution from Their Tokens
Qingjie Zhang
|
Di Wang
|
Haoting Qian
|
Liu Yan
|
Tianwei Zhang
|
Ke Xu
|
Qi Li
|
Minlie Huang
|
Hewu Li
|
Han Qiu
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens’ existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT’s vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token’s both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens’ appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT’s vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of “波*野结衣” related webpages in GPT-4o’s training data is around 0.5%.
pdf
bib
abs
NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts
Abhay Gupta
|
Kevin Zhu
|
Vasu Sharma
|
Sean O’Brien
|
Michael Lu
Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. We introduce NovelHopQA, the first benchmark to evaluate 1–4 hop QA over 64k–128k-token excerpts from 83 full-length public-domain novels. A keyword-guided pipeline builds hop-separated chains grounded in coherent storylines. We evaluate six state-of-the-art (SOTA) models and apply golden context filtering to ensure all questions are genuinely answerable. Human annotators validate both alignment and hop depth. We noticed consistent accuracy drops with increased hops and context length, even in frontier models—revealing that sheer scale does not guarantee robust reasoning. Our failure mode analysis highlights common breakdowns, such as missed final-hop integration and long-range drift. NovelHopQA offers a controlled diagnostic setting to stress-test multi-hop reasoning at scale.
pdf
bib
abs
Weights-Rotated Preference Optimization for Large Language Models
Chenxu Yang
|
Ruipeng Jia
|
Mingyu Zheng
|
Naibin Gu
|
Zheng Lin
|
Siyuan Chen
|
Weichong Yin
|
Hua Wu
|
Weiping Wang
Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.
pdf
bib
abs
The Stepwise Deception: Simulating the Evolution from True News to Fake News with LLM Agents
Yuhan Liu
|
Zirui Song
|
Juntian Zhang
|
Xiaoqing Zhang
|
Xiuying Chen
|
Rui Yan
With the growing spread of misinformation online, understanding how true news evolves into fake news has become crucial for early detection and prevention. However, previous research has often assumed fake news inherently exists rather than exploring its gradual formation. To address this gap, we propose FUSE (Fake news evolUtion Simulation framEwork), a novel Large Language Model (LLM)-based simulation approach explicitly focusing on fake news evolution from real news. Our framework model a social network with four distinct types of LLM agents commonly observed in daily interactions: spreaders who propagate information, commentators who provide interpretations, verifiers who fact-check, and standers who observe passively to simulate realistic daily interactions that progressively distort true news. To quantify these gradual distortions, we develop FUSE-EVAL, a comprehensive evaluation framework measuring truth deviation along multiple linguistic and semantic dimensions. Results show that FUSE effectively captures fake news evolution patterns and accurately reproduces known fake news, aligning closely with human evaluations. Experiments demonstrate that FUSE accurately reproduces known fake news evolution scenarios, aligns closely with human judgment, and highlights the importance of timely intervention at early stages. Our framework is extensible, enabling future research on broader scenarios of fake news:https://github.com/LiuYuHan31/FUSE
pdf
bib
abs
How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models
Kangtao Lv
|
Haibin Chen
|
Yujin Yuan
|
Langming Liu
|
Shilei Liu
|
Yongwei Wang
|
Wenbo Su
|
Bo Zheng
Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model’s size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
pdf
bib
abs
SMEC:Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression
Biao Zhang
|
Lixin Chen
|
Tong Liu
|
Bo Zheng
Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
pdf
bib
abs
Reverse Prompt Engineering: A Zero-Shot, Genetic Algorithm Approach to Language Model Inversion
Hanqing Li
|
Diego Klabjan
We explore a new language model inversion problem under strict black-box, zero-shot, and limited data conditions. We propose a novel training-free framework that reconstructs prompts using only a limited number of text outputs from a language model. Existing methods rely on the availability of a large number of outputs for both training and inference, an assumption that is unrealistic in the real world, and they can sometimes produce garbled text. In contrast, our approach, which relies on limited resources, consistently yields coherent and semantically meaningful prompts. Our framework leverages a large language model together with an optimization process inspired by the genetic algorithm to effectively recover prompts. Experimental results on several datasets derived from public sources indicate that our approach achieves high-quality prompt recovery and generates prompts more semantically and functionally aligned with the originals than current state-of-the-art methods. Additionally, use-case studies introduced demonstrate the method’s strong potential for generating high-quality text data on perturbed prompts.
pdf
bib
abs
DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
Hang Wu
|
Hongkai Chen
|
Yujun Cai
|
Chang Liu
|
Qingwen Ye
|
Ming-Hsuan Yang
|
Yiwei Wang
Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model’s initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.
pdf
bib
abs
SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models
Jia Wang
|
Ziyu Zhao
|
Tingjuntao Ni
|
Zhongyu Wei
Large language models (LLMs) show strong potential for simulating human social behaviors and interactions, yet lack large-scale, systematically constructed benchmarks for evaluating their alignment with real-world social attitudes. To bridge this gap, we introduce SocioBench—a comprehensive benchmark derived from the annually collected, standardized survey data of the
International Social Survey Programme (ISSP). The benchmark aggregates over 480,000 real respondent records from more than 30 countries, spanning 10 sociological domains and over 40 demographic attributes. Our experiments indicate that LLMs achieve only 30–40% accuracy when simulating individuals in complex survey scenarios, with statistically significant differences across domains and demographic subgroups. These findings highlight several limitations of current LLMs in survey scenarios, including insufficient individual-level data coverage, inadequate scenario diversity, and missing group-level modeling. We have open-sourced
SocioBench at
https://github.com/JiaWANG-TJ/SocioBench.
pdf
bib
abs
Financial Risk Relation Identification through Dual-view Adaptation
Wei-Ning Chiu
|
Yu-Hsiang Wang
|
Andy Hsiao
|
Yu-Shiang Huang
|
Chuan-Ju Wang
A multitude of interconnected risk events—ranging from regulatory changes to geopolitical tensions—can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings—authoritative, standardized financial documents—as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.
pdf
bib
abs
CopySpec: Accelerating LLMs with Speculative Copy-and-Paste
Razvan-Gabriel Dumitru
|
Minglai Yang
|
Vikas Yadav
|
Mihai Surdeanu
We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model’s chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn’s answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K’s self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
pdf
bib
abs
GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Kainan Liu
|
Yong Zhang
|
Ning Cheng
|
Zhitao Li
|
Shaojun Wang
|
Jing Xiao
Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost. While such approaches can improve efficiency, indiscriminate layer pruning often results in significant performance degradation. In this paper, we propose **GRASP** (**G**radient-based **R**etention of **A**daptive **S**ingular **P**arameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values. Unlike direct layer pruning, GRASP leverages gradient-based attribution on a small calibration dataset to adaptively identify and retain critical singular components. By replacing redundant layers with only a minimal set of parameters, GRASP achieves efficient compression while maintaining strong performance with minimal overhead. Experiments across multiple LLMs show that GRASP consistently outperforms existing compression methods, achieving 90% of the original model’s performance under a 20% compression ratio.
pdf
bib
abs
GraphAgent: Agentic Graph Language Assistant
Yuhao Yang
|
Jiabin Tang
|
Lianghao Xia
|
Xingchen Zou
|
Yuxuan Liang
|
Chao Huang
Real-world data combines structured (e.g., graph connections) and unstructured (e.g., text, visuals) formats, capturing explicit relationships (e.g., social links) and implicit semantic interdependencies (e.g., knowledge graphs). We propose GraphAgent, an automated agent pipeline addressing both explicit and implicit graph-enhanced semantic dependencies for predictive (e.g., node classification) and generative (e.g., text generation) tasks. GraphAgent integrates three components: (i) a Graph Generator Agent creating knowledge graphs for semantic dependencies; (ii) a Task Planning Agent interpreting user queries and formulating tasks via self-planning; and (iii) a Task Execution Agent automating task execution with tool matching. These agents combine language and graph language models to reveal complex relational and semantic patterns. Extensive experiments on diverse datasets validate GraphAgent’s effectiveness in graph-related predictive and text generative tasks. GraphAgent is open-sourced at: https://anonymous.4open.science/r/GraphAgent-Submit-6F52/.
pdf
bib
abs
DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration
Zhihao Jia
|
Mingyi Jia
|
Junwen Duan
|
Jianxin Wang
Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose DDO, a novel LLM-based framework that performs Dual-Decision Optimization by decoupling the two sub-tasks and optimizing them with distinct objectives through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task. The code is available at https://github.com/zh-jia/DDO.
pdf
bib
abs
FedMABench: Benchmarking Mobile GUI Agents on Decentralized Heterogeneous User Data
WenHao Wang
|
Zijie Yu
|
Rui Ye
|
Jianqing Zhang
|
Guangyi Liu
|
Liang Liu
|
Siheng Chen
|
Yanfeng Wang
Mobile GUI agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field. To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile GUI agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: https://github.com/wwh0411/FedMABench.
pdf
bib
abs
VLA-Mark: A cross modal watermark for large vision-language alignment models
Shuliang Liu
|
Zheng Qi
|
Jesse Jiaxi Xu
|
Yibo Yan
|
Junyan Zhang
|
He Geng
|
Aiwei Liu
|
Peijie Jiang
|
Jia Liu
|
Yik-Cheung Tam
|
Xuming Hu
Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking.
pdf
bib
abs
Sentence Smith: Controllable Edits for Evaluating Text Embeddings
Hongji Li
|
Andrianos Michail
|
Reto Gubelmann
|
Simon Clematide
|
Juri Opitz
Controllable and transparent text generation has been a long-standing goal in NLP. Almost as long-standing is a general idea for addressing this challenge: Parsing text to a symbolic representation, and generating from it. However, earlier approaches were hindered by parsing and generation insufficiencies. Using modern parsers and a safety supervision mechanism, we show how close current methods come to this goal. Concretely, we propose the framework for English, which has three steps: 1. Parsing a sentence into a semantic graph. 2. Applying human-designed semantic manipulation rules. 3. Generating text from the manipulated graph. A final entailment check (4.) verifies the validity of the applied transformation. To demonstrate our framework’s utility, we use it to induce hard negative text pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can evaluate text embedding models in a fine-grained way, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that our transparent generation process produces texts of good quality. Notably, our way of generation is very resource-efficient, since it relies only on smaller neural networks.
pdf
bib
abs
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Yu Sun
|
Xingyu Qian
|
Weiwen Xu
|
Hao Zhang
|
Chenghao Xiao
|
Long Li
|
Deli Zhao
|
Wenbing Huang
|
Tingyang Xu
|
Qifeng Bai
|
Yu Rong
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts.To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline.ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier.Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential.The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
pdf
bib
abs
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval
Seongwan Park
|
Taeklim Kim
|
Youngjoong Ko
Despite their strong performance, Dense Passage Retrieval (DPR) models suffer from a lackof interpretability. In this work, we propose a novel interpretability framework that leveragesSparse Autoencoders (SAEs) to decompose previously uninterpretable dense embeddings fromDPR models into distinct, interpretable latent concepts. We generate natural language descriptionsfor each latent concept, enabling human interpretations of both the dense embeddingsand the query-document similarity scores of DPR models. We further introduce Concept-Level Sparse Retrieval (CL-SR), a retrieval framework that directly utilizes the extractedlatent concepts as indexing units. CL-SR effectively combines the semantic expressiveness ofdense embeddings with the transparency and efficiency of sparse representations. We showthat CL-SR achieves high index-space and computational efficiency while maintaining robustperformance across vocabulary and semantic mismatches.
pdf
bib
abs
UICOMPASS: UI Map Guided Mobile Task Automation via Adaptive Action Generation
Yuanzhang Lin
|
Zhe Zhang
|
He Rui
|
Qingao Dong
|
Mingyi Zhou
|
Jing Zhang
|
Xiang Gao
|
Hailong Sun
Mobile task automation is an emerging technology that leverages AI to automatically execute routine tasks by users’ commands on mobile devices like Android, thus enhancing efficiency and productivity. While large language models (LLMs) excel at general mobile tasks through training on massive datasets, they struggle with app-specific workflows. To solve this problem, we designed UI Map, a structured representation of target app’s UI information. We further propose a UI Map-guided LLM-based approach UICompass to automate mobile tasks. Specifically, UICompass first leverages static analysis and LLMs to automatically build UI Map from either source codes of apps or byte codes (i.e., APK packages). During task execution, UICompass mines the task-relevant information from UI Map to feed into the LLMs, generate a planned paths, and adaptively adjust the path based on the actual app state and action history. Experimental results demonstrate that UICompass achieves a 15.87% higher task executing success rate than SOTA approaches. Even when only APK is available, UICompass maintains superior performance, demonstrating its applicability to closed-source apps.
pdf
bib
abs
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
Tommaso Green
|
Martin Gubri
|
Haritz Puerto
|
Sangdoo Yun
|
Seong Joon Oh
We study privacy leakage in the reasoning traces of large reasoning models used as personal agents which handle sensitive user data. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model’s internal thinking, not just its outputs.
pdf
bib
abs
Model Unlearning via Sparse Autoencoder Subspace Guided Projections
Xu Wang
|
Zihao Li
|
Benyou Wang
|
Yan Hu
|
Difan Zou
Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose **S**AE–Guided **S**ubspace **P**rojection **U**nlearning (**SSPU**), a novel framework that leverages SAE features to drive targeted updates in the model’s parameter space, enabling precise, interpretable, and robust unlearning. SSPU’s three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an “irrelevant” subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP–Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.
pdf
bib
abs
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning
Changtai Zhu
|
Siyin Wang
|
Ruijun Feng
|
Kai Song
|
Xipeng Qiu
Conversational search systems require effective handling of context-dependent queries that often contain ambiguity, omission, and coreference. Conversational Query Reformulation (CQR) addresses this challenge by transforming these queries into self-contained forms suitable for off-the-shelf retrievers. However, existing CQR approaches suffer from two critical constraints: high dependency on costly external supervision from human annotations or large language models, and insufficient alignment between the rewriting model and downstream retrievers. We present ConvSearch-R1, the first self-driven framework that completely eliminates dependency on external rewrite supervision by leveraging reinforcement learning to optimize reformulation directly through retrieval signals. Our novel two-stage approach combines Self-Driven Policy Warm-Up to address the cold-start problem through retrieval-guided self-distillation, followed by Retrieval-Guided Reinforcement Learning with a specially designed rank-incentive reward shaping mechanism that addresses the sparsity issue in conventional retrieval metrics. Extensive experiments on TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly outperforms previous state-of-the-art methods, achieving over 10% improvement on the challenging TopiOCQA dataset while using smaller 3B parameter models without any external supervision.
pdf
bib
abs
How to Make Large Language Models Generate 100% Valid Molecules?
Wen Tao
|
Jing Tang
|
Alvin Chan
|
Bryan Hooi
|
Baolong Bi
|
Nanyun Peng
|
Yuansheng Liu
|
Yiwei Wang
Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs’ ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES’ mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs’ practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.
pdf
bib
abs
Exploring Quality and Diversity in Synthetic Data Generation for Argument Mining
Jianzhu Bao
|
Yuqi Huang
|
Yang Sun
|
Wenya Wang
|
Yice Zhang
|
Bojun Jin
|
Ruifeng Xu
The advancement of Argument Mining (AM) is hindered by a critical bottleneck: the scarcity of structure-annotated datasets, which are expensive to create manually. Inspired by recent successes in synthetic data generation across various NLP tasks, this paper explores methodologies for LLMs to generate synthetic data for AM.We investigate two complementary synthesis perspectives: a quality-oriented synthesis approach, which employs structure-aware paraphrasing to preserve annotation quality, and a diversity-oriented synthesis approach, which generates novel argumentative texts with diverse topics and argument structures.Experiments on three datasets show that augmenting original training data with our synthetic data, particularly when combining both quality- and diversity-oriented instances, significantly enhances the performance of existing AM models, both in full-data and low-resource settings.Moreover, the positive correlation between synthetic data volume and model performance highlights the scalability of our methods.
pdf
bib
abs
Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Mohammad Amin Ghanizadeh
|
Mohammad Javad Dousti
Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English ↔ Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.
pdf
bib
abs
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
Ivan Sviridov
|
Amina Miftakhova
|
Tereshchenko Artemiy Vladimirovich
|
Galina Zubkova
|
Pavel Blinov
|
Andrey Savchenko
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents **3MDBench** (**M**edical **M**ultimodal **M**ulti-agent **D**ialogue **Bench**mark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM’s context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.
pdf
bib
abs
OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution
Lucio La Cava
|
Andrea Tagarelli
Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors.
pdf
bib
abs
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
Shiting Huang
|
Zhen Fang
|
Zehui Chen
|
Siyu Yuan
|
Junjie Ye
|
Yu Zeng
|
Lin Chen
|
Qi Mao
|
Feng Zhao
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at https://github.com/Shellorley0513/CriticTool.
pdf
bib
abs
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Marek Kadlčík
|
Michal Štefánik
|
Timothee Mickus
|
Josef Kuchař
|
Michal Spiegel
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns.In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.
pdf
bib
abs
Enhancing Large Vision-Language Models with Ultra-Detailed Image Caption Generation
Yu Zeng
|
Yukun Qi
|
Yiming Zhao
|
Xikun Bao
|
Lin Chen
|
Zehui Chen
|
Shiting Huang
|
Jie Zhao
|
Feng Zhao
High-quality image captions are essential for improving modality alignment and visual understanding in Large Vision-Language Models (LVLMs). However, the scarcity of ultra-detailed image caption data limits further advancements. This paper presents a systematic pipeline for generating high-quality, ultra-detailed image captions, encompassing both pre-processing and post-processing stages. In the pre-processing stage, we classify and deduplicate images, extract visual information using expert tools, and leverage GPT-4o with structured prompts to generate initial captions. To enhance comprehensiveness, we introduce an expansion strategy based on Large Language Models (LLMs), defining eight descriptive dimensions to refine and extend captions, which serve as seed data for training a proprietary captioner model. In the post-processing stage, we incorporate human error-correction annotations and an active learning-inspired approach to refine low-quality samples. Using high-quality corrected data, we apply Direct Preference Optimization (DPO) and develop a critic-rewrite pipeline, training a sentence-level critic model to mitigate hallucinations. Experimental results demonstrate that our ultra-detailed captions significantly enhance LVLMs’ perception and cognitive abilities across multiple vision-language benchmarks. The code and dataset are available at https://github.com/yuzeng0-0/UltraCaption.
pdf
bib
abs
Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
António Farinhas
|
Nuno M Guerreiro
|
Sweta Agrawal
|
Ricardo Rei
|
Andre Martins
Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.
pdf
bib
abs
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs
Julius Mayer
|
Mohamad Ballout
|
Serwan Jassim
|
Farbod Nosrat Nezami
|
Elia Bruni
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. iVISPAR is based on a variant of the sliding tile puzzle—a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 3D, 2D, and text-based input modalities, enabling comprehensive assessments of VLMs’ planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while VLMs perform better on 2D tasks compared to 3D or text-based settings, they struggle with complex spatial configurations and consistently fall short of human performance, illustrating the persistent challenge of visual alignment. This underscores critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition. Project website: https://microcosm.ai/ivispar.
pdf
bib
abs
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Omer Nahum
|
Nitay Calderon
|
Orgad Keller
|
Idan Szpektor
|
Roi Reichart
NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. We conduct a case study on four factual consistency datasets from the TRUE benchmark, spanning diverse NLP tasks, and on SummEval, which uses Likert-scale ratings of summary quality across multiple dimensions. We empirically analyze the labeling quality of existing datasets and compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs’ so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve performance.
pdf
bib
abs
Detecting Legal Citations in United Kingdom Court Judgments
Holli Sargeant
|
Andreas Östling
|
Måns Magnusson
Legal citation detection in court judgments underpins reliable precedent mapping, citation analytics, and document retrieval. Extracting references to legislation and case law in the United Kingdom is especially challenging: citation styles have evolved over centuries, and judgments routinely cite foreign or historical authorities. We conduct the first systematic comparison of three modelling paradigms on this task using the Cambridge Law Corpus: (i) rule‐based regular expressions; (ii) transformer-based encoders (BERT, RoBERTa, LEGAL‐BERT, ModernBERT); and (iii) large language models (GPT‐4.1). We produced a gold‐standard high-quality corpus of 190 court judgments containing 45,179 fine-grained annotations for UK and non-UK legislation and case references. ModernBERT achieves a macro-averaged F1 of 93.3%, only marginally ahead of the other encoder-only models, yet significantly outperforming the strongest regular-expression baseline (35.42% F1) and GPT-4.1 (76.57% F1).
pdf
bib
abs
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Guangxiang Zhao
|
Saier Hu
|
Xiaoqi Jian
|
Wu Jinzhu
|
Yuhan Wu
|
Lin Sun
|
Xiangzheng Zhang
In this paper, we propose a “Generalization Stress Test” to assess Large Language Models’ (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B’s MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and shifts in irrelevant content.
pdf
bib
abs
Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Ehsan Doostmohammadi
|
Marco Kuhlmann
Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query–context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
pdf
bib
abs
Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
Pedro Henrique Luz de Araujo
|
Paul Röttger
|
Dirk Hovy
|
Benjamin Roth
Expert persona prompting—assigning roles such as expert in math to language models—is widely used for task improvement. However, prior work shows mixed results on its effectiveness, and does not consider when and why personas should improve performance. We analyze the literature on persona prompting for task improvement and distill three desiderata: 1) performance advantage of expert personas, 2) robustness to irrelevant persona attributes, and 3) fidelity to persona attributes. We then evaluate 9 state-of-the-art LLMs across 27 tasks with respect to these desiderata. We find that expert personas usually lead to positive or non-significant performance changes. Surprisingly, models are highly sensitive to irrelevant persona details, with performance drops of almost 30 percentage points. In terms of fidelity, we find that while higher education, specialization, and domain-relatedness can boost performance, their effects are often inconsistent or negligible across tasks. We propose mitigation strategies to improve robustness—but find they only work for the largest, most capable models. Our findings underscore the need for more careful persona design and for evaluation schemes that reflect the intended effects of persona usage.
pdf
bib
abs
HydraOpt: Navigating the Efficiency-Performance Trade-off of Adapter Merging
Taha Ceritli
|
Ondrej Bohdal
|
Mete Ozay
|
Jijoong Moon
|
Kyenghun Lee
|
Hyeonmok Ko
|
Umberto Michieli
Large language models (LLMs) often leverage adapters, such as low-rank-based adapters, to achieve strong performance on downstream tasks. However, storing a separate adapter for each task significantly increases memory requirements, posing a challenge for resource-constrained environ ments such as mobile devices. Although model merging techniques can reduce storage costs, they typically result in substantial performance degradation. In this work, we introduce HydraOpt, a new model merging technique that capitalizes on the inherent similarities between the matrices of low-rank adapters. Unlike existing methods that produce a fixed trade-off between storage size and performance, HydraOpt allows us to navigate this spectrum of efficiency and performance. Our experiments show that HydraOpt significantly reduces storage size (48% reduction) compared to storing all adapters, while achieving competitive performance (0.2-1.8% drop). Furthermore, it outperforms existing merging techniques in terms of performance at the same or slightly worse storage efficiency.
pdf
bib
abs
Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning
Senjie Jin
|
Lu Chen
|
Zhiheng Xi
|
Yuhui Wang
|
Sirui Song
|
Yuhao Zhou
|
Xinbo Zhang
|
Peng Sun
|
Hong Lu
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms’ strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the LLaMA2’s and CodeLLaMA’s N-CoT performance achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.
pdf
bib
abs
Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance
Songsheng Wang
|
Rucheng Yu
|
Zhihang Yuan
|
Chao Yu
|
Feng Gao
|
Yu Wang
|
Derek F. Wong
Vision-Language-Action (VLA) models have made substantial progress by leveraging the robust capabilities of Visual Language Models (VLMs). However, VLMs’ significant parameter size and autoregressive (AR) decoding nature impose considerable computational demands on VLA models. While Speculative Decoding (SD) has shown efficacy in accelerating Large Language Models (LLMs) by incorporating efficient drafting and parallel verification, allowing multiple tokens to be generated in one forward pass, its application to VLA models remains unexplored. This work introduces Spec-VLA, an SD framework designed to accelerate VLA models. Due to the difficulty of the action prediction task and the greedy decoding mechanism of the VLA models, the direct application of the advanced SD framework to the VLA prediction task yields a minor speed improvement. To boost the generation speed, we propose an effective mechanism to relax acceptance utilizing the relative distances represented by the action tokens of the VLA model. Empirical results across diverse test scenarios affirm the effectiveness of the Spec-VLA framework, and further analysis substantiates the impact of our proposed strategies, which enhance the acceptance length by 44%, achieving 1.42× speedup compared with the OpenVLA baseline, without compromising the success rate. The success of the Spec-VLA framework highlights the potential for broader application of speculative execution in VLA prediction scenarios.
pdf
bib
abs
Leveraging Text-to-Text Transformers as Classifier Chain for Few-Shot Multi-Label Classification
Quang Anh Nguyen
|
Nadi Tomeh
|
Mustapha Lebbah
|
Thierry Charnois
|
Hanane Azzag
Multilabel text classification (MLTC) is an essential task in NLP applications. Traditional methods require extensive labeled data and are limited to fixed label sets. Extracting labels by LLMs is more effective and universal, but incurs high computational costs. In this work, we introduce a distillation-based T5 generalist model for zero-shot MLTC and few-shot fine-tuning. Our model accommodates variable label sets with general domain-agnostic pertaining, while modeling dependency between labels. Experiments show that our approach outperforms baselines of similar size on three few-shot tasks.Our code is available at https://anonymous.4open.science/r/t5-multilabel-0C32/README.md
pdf
bib
abs
M-Wanda: Improving One-Shot Pruning for Multilingual LLMs
Rochelle Choenni
|
Ivan Titov
Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.
pdf
bib
abs
Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language
Hamidreza Saffari
|
Mohammadamin Shafiei
|
Hezhao Zhang
|
Lasana T. Harris
|
Nafise Sadat Moosavi
Dehumanization, i.e., denying human qualities to individuals or groups, is a particularly harmful form of hate speech that can normalize violence against marginalized communities. Despite advances in NLP for detecting general hate speech, approaches to identifying dehumanizing language remain limited due to scarce annotated data and the subtle nature of such expressions. In this work, we systematically evaluate four state-of-the-art large language models (LLMs) — Claude, GPT, Mistral, and Qwen — for dehumanization detection.Our results show that only one model—Claude—achieves strong performance (over 80% F1) under an optimized configuration, while others, despite their capabilities, perform only moderately. Performance drops further when distinguishing dehumanization from related hate types such as derogation. We also identify systematic disparities across target groups: models tend to over-predict dehumanization for some identities (e.g., Gay men), while under-identifying it for others (e.g., Refugees). These findings motivate the need for systematic, group-level evaluation when applying pretrained language models to dehumanization detection tasks.
pdf
bib
abs
Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Eunseong Choi
|
June Park
|
Hyeri Lee
|
Jongwuk Lee
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM’s parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes external context into compact memory embeddings. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
pdf
bib
abs
R-CHAR: A Metacognition-Driven Framework for Role-Playing in Large Language Models
Haiming Qin
|
Jiwei Zhang
|
Wei Zhang
|
KeZhong Lu
|
Mingyang Zhou
|
Hao Liao
|
Rui Mao
Role-playing capabilities in large language models (LLMs) often lack cognitive consistency in complex scenarios that require deep understanding and coherent reasoning. While recent reasoning models excel in math and coding tasks, they show limited effectiveness in open-ended role-playing scenarios. We introduce R-CHAR (Role-Consistent Hierarchical Adaptive Reasoning), a metacognition-driven framework that enhances role-playing performance through guided thinking trajectories synthesis and adaptive evaluation. Our approach demonstrates that concise thinking processes can achieve superior performance efficiently compared to elaborate reasoning chains in role-playing social intelligence tasks, outperforming existing specialized models. Experimental results on the SocialBench benchmark show significant and stable performance improvements across varying scenario complexities, showing particular strength in long-context comprehension (from 34.64% to 68.59%) and group-level social interactions. Our work advances the development of cognitively consistent role-playing systems, bridging the gap between surface-level mimicry and authentic character simulation.
pdf
bib
abs
Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models
Gaifan Zhang
|
Yi Zhou
|
Danushka Bollegala
Semantic similarity between two sentences depends on the aspects considered between those sentences. To study this phenomenon, Deshpande et al. (2023) proposed the Conditional Semantic Textual Similarity (C-STS) task and annotated a human-rated similarity dataset containing pairs of sentences compared under two different conditions. However, Tu et al. (2024) found various annotation issues in this dataset and showed that manually re-annotating a small portion of it leads to more accurate C-STS models. Despite these pioneering efforts, the lack of large and accurately annotated C-STS datasets remains a blocker for making progress on this task as evidenced by the subpar performance of the C-STS models. To address this training data need, we resort to Large Language Models (LLMs) to correct the condition statements and similarity ratings in the original dataset proposed by Deshpande et al. (2023). Our proposed method is able to re-annotate a large training dataset for the C-STS task with minimal manual effort. Importantly, by training a supervised C-STS model on our cleaned and re-annotated dataset, we achieve a 5.4% statistically significant improvement in Spearman correlation. The re-annotated dataset is available at https://LivNLP.github.io/CSTS-reannotation.
pdf
bib
abs
When Words Smile: Generating Diverse Emotional Facial Expressions from Text
Haidong Xu
|
Meishan Zhang
|
Hao Ju
|
Zhedong Zheng
|
Erik Cambria
|
Min Zhang
|
Hao Fei
Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text–3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.
pdf
bib
abs
Improving Online Job Advertisement Analysis via Compositional Entity Extraction
Kai Krüger
|
Johanna Binnewitt
|
Kathrin Ehmann
|
Stefan Winnige
|
Alan Akbik
We propose a compositional entity modeling framework for requirement extraction from online job advertisements (OJAs), representing complex, tree-like structures that connect atomic entities via typed relations. Based on this schema, we introduce GOJA, a manually annotated dataset of 500 German job ads that captures roles, tools, experience levels, attitudes, and their functional context. We report strong inter-annotator agreement and benchmark transformer models, demonstrating the feasibility of learning this structure. A focused case study on AI-related requirements illustrates the analytical value of our approach for labor market research.
pdf
bib
abs
Correlation-Aware Example Selection for In-Context Learning with Nonsymmetric Determinantal Point Processes
Qiunan Du
|
Zhiliang Tian
|
Zhen Huang
|
Kailun Bian
|
Tianlun Liu
|
Zhaoning Zhang
|
Xinwang Liu
|
Feng Liu
|
Dongsheng Li
LLMs with in-context learning (ICL) obtain remarkable performance but are sensitive to the quality of ICL examples. Prior works on ICL example selection explored unsupervised heuristic methods and supervised LLM-based methods, but they typically focus on the selection of individual examples and ignore correlations among examples. Researchers use the determinantal point process (DPP) to model negative correlations among examples to select diverse examples. However, the DPP fails to model positive correlations among examples, while ICL still requires the positive correlations of examples to ensure the consistency of examples, which provides a clear instruction for LLMs. In this paper, we propose an ICL example selection method based on the nonsymmetric determinantal point process (NDPP) to capture positive and negative correlations, considering both the diversity and the relevance among ICL examples. Specifically, we optimize NDPP via kernel decomposition-based MLE to fit a constructed pseudo-labeled dataset, where we also propose a low-rank decomposition to reduce the computational cost. Further, we perform query-aware kernel adaptation on our NDPP to customize the input query, and we select examples via a MAP inference based on the adapted NDPP. Experimental results show our model outperforms strong baselines in ICL example selection.
pdf
bib
abs
Leveraging Cognitive Complexity of Texts for Contextualization in Dense Retrieval
Effrosyni Sokli
|
Georgios Peikos
|
Pranav Kasela
|
Gabriella Pasi
Dense Retrieval Models (DRMs) estimate the semantic similarity between queries and documents based on their embeddings. Prior studies highlight the importance of embedding contextualization in enhancing retrieval performance. To this aim, existing approaches primarily leverage token-level information derived from query/document interactions. In this paper, we introduce a novel DRM, namely DenseC3, which leverages query/document interactions based on the full embedding representations generated by a Transformer-based model. To enhance similarity estimation, DenseC3 integrates external linguistic information about the Cognitive Complexity of texts, enriching the contextualization of embeddings. We empirically evaluate our approach across seven benchmarks and three different IR tasks to assess the impact of Cognitive Complexity-aware query and document embeddings for contextualization in dense retrieval. Results show that our approach consistently outperforms standard fine-tuning techniques on lightweight bi-encoders (e.g., BERT-based) and traditional late-interaction models (i.e., ColBERT) across all benchmarks. On larger retrieval-optimized bi-encoders like Contriever, our model achieves comparable or higher performance on four of the considered evaluation benchmarks. Our findings suggest that Cognitive Complexity-aware embeddings enhance query and document representations, improving retrieval effectiveness in DRMs. Our code is available online at: https://github.com/FaySokli/DenseC3.
pdf
bib
abs
Beyond Online Sampling: Bridging Offline-to-Online Alignment via Dynamic Data Transformation for LLMs
Zhang Zhang
|
Guhao Feng
|
Jian Guan
|
Di He
|
Wei Wu
While Direct Preference Optimization (DPO) eliminates complex reward modeling in aligning large language models (LLMs) with human preferences, its online variant faces significant efficiency bottlenecks due to costly real-time preference sampling and the reward model annotation. We propose a novel framework that bridges offline-to-online alignment by systematically transforming static datasets into dynamically adaptive equivalents, without the need for an explicit reward model. Our approach employs paraphrasing techniques to preserve response correctness while aligning data distributions with model-generated outputs, circumventing the need for resource-intensive online interactions. Experiments on mathematical reasoning and conversational tasks demonstrate that our method matches or exceeds the performance of a fully online DPO. This work establishes a computationally sustainable paradigm for LLM alignment, particularly benefiting scenarios requiring iterative preference updates and domain adaptation.
pdf
bib
abs
CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar
|
Syrielle Montariol
|
Angelika Romanou
|
Beatriz Borges
|
Irina Rish
|
Antoine Bosselut
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
pdf
bib
abs
Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training
Linjuan Wu
|
Hao-Ran Wei
|
Huan Lin
|
Tianhao Li
|
Baosong Yang
|
Fei Huang
|
Weiming Lu
Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.
pdf
bib
abs
SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking
Sifan Li
|
Yujun Cai
|
Yiwei Wang
Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden texts, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0–5.36%) even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions, which unlocks over 99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.
pdf
bib
abs
Order Doesn’t Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation
Qianxi He
|
Qianyu He
|
Jiaqing Liang
|
Weikang Zhou
|
Zeye Sun
|
Fei Yu
|
Yanghua Xiao
Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://anonymous.4open.science/r/Order-Centric-Data-Augmentation-822C.
pdf
bib
abs
Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models
Alessandro De Bellis
|
Salvatore Bufi
|
Giovanni Servedio
|
Vito Walter Anelli
|
Tommaso Di Noia
|
Eugenio Di Sciascio
Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at https://github.com/sisinflab/tyler .
pdf
bib
abs
Extracting Linguistic Information from Large Language Models: Syntactic Relations and Derivational Knowledge
Tsedeniya Kinfe Temesgen
|
Marion Di Marco
|
Alexander Fraser
This paper presents a study of the linguistic knowledge and generalization capabilities of Large Language Models (LLMs), focusing ontheir morphosyntactic competence. We design three diagnostic tasks: (i) labeling syntactic information at the sentence level - identifying subjects, objects, and indirect objects; (ii) derivational decomposition at the word level - identifying morpheme boundaries and labeling thedecomposed sequence; and (iii) in-depth study of morphological decomposition in German and Amharic. We evaluate prompting strategies in GPT-4o and LLaMA 3.3-70B to extract different types of linguistic structure for typologically diverse languages. Our results showthat GPT-4o consistently outperforms LLaMA in all tasks; however, both models exhibit limitations and show little evidence of abstract morphological rule learning. Importantly, we show strong evidence that the models fail to learn underlying morphological structures. Therefore,raising important doubts about their ability to generalize.
pdf
bib
abs
Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
Qianxi He
|
Qingyu Ren
|
Shanzhe Lei
|
Xuhong Wang
|
Yingchun Wang
Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in
https://github.com/qianxiHe147/C2RM.
pdf
bib
abs
TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
Dominik Meier
|
Jan Philip Wahle
|
Paul Röttger
|
Terry Ruas
|
Bela Gipp
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information (“secrets”). We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the TrojanStego threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning that is learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, the compromised LLMs maintain high utility, coherence, and can evade human detection. Our results highlight a new type of LLM data exfiltration attacks that is covert, practical, and dangerous
pdf
bib
abs
Frequency & Compositionality in Emergent Communication
Jean-Baptiste Sevestre
|
Emmanuel Dupoux
In natural languages, frequency and compositionality exhibit an inverse relationship: the most frequent words often resist regular patterns, developing idiosyncratic forms. This phenomenon, exemplified by irregular verbs where the most frequent verbs resist regular patterns, raises a compelling question: do artificial communication systems follow similar principles?Through systematic experiments with neural network agents in a referential game setting, and by manipulating input frequency through Zipfian distributions, we investigate if these systems mirror the irregular verbs phenomenon, where messages referring to frequent objects develop less compositional structure than messages referring to rare ones.We establish that compositionality is not an inherent property of the frequency itself and provide compelling evidence that limited data exposure, which frequency distributions naturally create, serves as a fundamental driver for the emergence of compositional structure in communication systems, offering insights into the cognitive and computational pressures that shape linguistic systems.
pdf
bib
abs
Summarizing Speech: A Comprehensive Survey
Fabian Retkowski
|
Maike Züfle
|
Andreas Sudmann
|
Dinah Pfau
|
Shinji Watanabe
|
Jan Niehues
|
Alexander Waibel
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization remains loosely defined. The field intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation protocols, which are crucial for assessing the quality of summarization approaches, but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions. In doing so, we surface the ongoing challenges, such as the need for realistic evaluation benchmarks, multilingual datasets, and long-context handling.
pdf
bib
abs
CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards
Cheng Liu
|
Yifei Lu
|
Fanghua Ye
|
Jian Li
|
Xingyu Chen
|
Feiliang Ren
|
Zhaopeng Tu
|
Xiaolong Li
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying cognitive mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce CogDual, a novel RPLA adopting a cognize-then-respond reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
pdf
bib
abs
Assay2Mol: Large Language Model-based Drug Design Using BioAssay Context
Yifan Deng
|
Spencer S Ericksen
|
Anthony Gitter
Scientific databases aggregate vast amounts of quantitative data alongside descriptive text. In biochemistry, chemical screening assays evaluate the functional responses of candidate compounds against disease targets. Unstructured text that describes the biological mechanisms through which these targets operate, experimental screening protocols, and other attributes of assays offer rich information for new drug discovery campaigns, but has been untapped because of that unstructured format. We present Assay2Mol, a large language model-based workflow that can capitalize on the vast existing biochemical screening assays for early-stage drug discovery. Assay2Mol retrieves existing assay records involving targets similar to the new target and generates candidate compounds using in-context learning with the retrieved assay screening data. Assay2Mol outperforms recent machine learning approaches that generate candidate ligand compounds for target protein structures, while also promoting more synthesizable molecule generation.
pdf
bib
abs
Frame First, Then Extract: A Frame-Semantic Reasoning Pipeline for Zero-Shot Relation Triplet Extraction
Zehan Li
|
Fu Zhang
|
Wenqing Zhang
|
Jiawei Li
|
Zhou Li
|
Jingwei Cheng
|
Tianyue Peng
Large Language Models (LLMs) have shown impressive capabilities in language understanding and generation, leading to growing interest in zero-shot relation triplet extraction (ZeroRTE), a task that aims to extract triplets for unseen relations without annotated data. However, existing methods typically depend on costly fine-tuning and lack the structured semantic guidance required for accurate and interpretable extraction. To overcome these limitations, we propose FrameRTE, a novel ZeroRTE framework that adopts a “frame first, then extract” paradigm. Rather than extracting triplets directly, FrameRTE first constructs high-quality Relation Semantic Frames (RSFs) through a unified pipeline that integrates frame retrieval, synthesis, and enhancement. These RSFs serve as structured and interpretable knowledge scaffolds that guide frozen LLMs in the extraction process. Building upon these RSFs, we further introduce a human-inspired three-stage reasoning pipeline consisting of semantic frame evocation, frame-guided triplet extraction, and core frame elements validation to achieve semantically constrained extraction. Experiments demonstrate that FrameRTE achieves competitive zero-shot performance on multiple benchmarks. Moreover, the RSFs we construct serve as high-quality semantic resources that can enhance other extraction methods, showcasing the synergy between linguistic knowledge and foundation models.
pdf
bib
abs
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety
Yahan Yang
|
Soham Dan
|
Shuo Li
|
Dan Roth
|
Insup Lee
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we introduce a multilingual guardrail with reasoning for prompt classification. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-based Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail, MrGuard, consistently outperforms recent baselines across both in-domain and out-of-domain languages by more than 15%. We also evaluate MrGuard’s robustness to multilingual variations, such as code-switching and low-resource language distractors in the prompt, and demonstrate that it preserves safety judgments under these challenging conditions. The multilingual reasoning capability of our guardrail enables it to generate explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
pdf
bib
abs
TALON: A Multi-Agent Framework for Long-Table Exploration and Question Answering
Ruochun Jin
|
Xiyue Wang
|
Dong Wang
|
Haoqi Zheng
|
Yunpeng Qi
|
Silin Yang
|
Meng Zhang
Table question answering (TQA) requires accurate retrieval and reasoning over tabular data. Existing approaches attempt to retrieve query-relevant content before leveraging large language models (LLMs) to reason over long tables. However, these methods often fail to accurately retrieve contextually relevant data which results in information loss, and suffer from excessive encoding overhead. In this paper, we propose TALON, a multi-agent framework designed for question answering over long tables. TALON features a planning agent that iteratively invokes a tool agent to access and manipulate tabular data based on intermediate feedback, which progressively collects necessary information for answer generation, while a critic agent ensures accuracy and efficiency in tool usage and planning. In order to comprehensively assess the effectiveness of TALON, we introduce two benchmarks derived from the WikiTableQuestion and BIRD-SQL datasets, which contain tables ranging from 50 to over 10,000 rows. Experiments demonstrate that TALON achieves average accuracy improvements of 7.5% and 12.0% across all language models, establishing a new state-of-the-art in long-table question answering. Our code is publicly available at: https://github.com/Wwestmoon/TALON.
pdf
bib
abs
You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models
Pawel Maka
|
Yusuf Can Semerci
|
Jan Scholtes
|
Gerasimos Spanakis
Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.
pdf
bib
abs
Improving Neutral Point-of-View Generation with Data- and Parameter-Efficient RL
Jessica Hoffmann
|
Christiane Ahlheim
|
Zac Yu
|
Aria Walfrand
|
Jarvis Jin
|
Marie Tano
|
Ahmad Beirami
|
Erin MacMurray van Liemt
|
Nithum Thain
|
Hakim Sidahmed
|
Lucas Dixon
The paper shows that parameter-efficient reinforcement learning (PE-RL) is a highly effective training regime to improve large language models’ (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e. to provide significantly more informative, diverse and impartial answers. This is shown by evaluating PE-RL and multiple strong baselines—including LoRA finetuning (strongest baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline (97.06% → 99.08%), but also scores much higher on features linguists identify as key to separating good answers from the best answers (60.25% → 85.21% for presence of supportive details, 68.74% → 91.43% for absence of oversimplification). A qualitative analysis corroborates this. Finally, our evaluation finds no statistical differences between results on topics that appear in the training dataset and those on separated evaluation topics, which provides strong evidence that our approach to training PE-RL exhibits very effective out of topic generalization. To enable the study, and enable further future studies we also release the dataset, SHQ-NPOV, and provide a methodology to create such datasets through iterative rounds of human peer-critique and annotator training.
pdf
bib
abs
Randomized Smoothing Meets Vision-Language Models
Emmanouil Seferis
|
Changshun Wu
|
Stefanos Kollias
|
Saddek Bensalem
|
Chih-Hong Cheng
Randomized smoothing (RS) is one of the prominent techniques to ensure the correctness of machine learning models, where point-wise robustness certificates can be derived analytically. While RS is well understood for classification, its application to generative models is unclear, since their outputs are sequences rather than labels. We resolve this by connecting generative outputs to an oracle classification task and showing that RS can still be enabled: the final response can be classified as a discrete action (e.g., service-robot commands in VLAs), as harmful vs. harmless (content moderation or toxicity detection in VLMs), or even applying oracles to cluster answers into semantically equivalent ones. Provided that the error rate for the oracle classifier comparison is bounded, we develop the theory that associates the number of samples with the corresponding robustness radius. We further derive improved scaling laws analytically relating the certified radius and accuracy to the number of samples, showing that the earlier result of 2 to 3 orders of magnitude fewer samples sufficing with minimal loss remains valid even under weaker assumptions. Together, these advances make robustness certification both well-defined and computationally feasible for state-of-the-art VLMs, as validated against recent jailbreak-style adversarial attacks.
pdf
bib
abs
PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues
Matthew Zent
|
Digory Smith
|
Simon Woodhead
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD_2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
pdf
bib
abs
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Yinuo Wang
|
Baiyang Wang
|
Robert Mercer
|
Frank Rudzicz
|
Sudipta Singha Roy
|
Pengjie Ren
|
Zhumin Chen
|
Xindi Wang
Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges—such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies—and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
pdf
bib
abs
Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning
Wesley Scivetti
|
Tatsuya Aoyama
|
Ethan Wilcox
|
Nathan Schneider
Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English Let-Alone construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about Let-Alone’s meaning. These results point to an asymmetry in the current architectures’ sample efficiency between language form and meaning, something which is not present in human language learners.
pdf
bib
abs
BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Pierre Andrews
|
Mikel Artetxe
|
Mariano Coria Meglioli
|
Marta R. Costa-jussà
|
Joe Chuang
|
David Dale
|
Mark Duppenthaler
|
Nathanial Paul Ekberg
|
Cynthia Gao
|
Daniel Edward Licht
|
Jean Maillard
|
Alexandre Mourachko
|
Christophe Ropers
|
Safiyyah Saleem
|
Eduardo Sánchez
|
Ioannis Tsiamas
|
Arina Turkatenko
|
Albert Ventayol-Boada
|
Shireen Yates
BOUQuET is a multi-way, multicentric and multi-register/domain dataset and benchmark, and a broader collaborative initiative. This dataset is handcrafted in 8 non-English languages (i.e. Egyptian Arabic and Modern Standard Arabic, French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish). Each of these source languages are representative of the most widely spoken ones and therefore they have the potential to serve as pivot languages that will enable more accurate translations. The dataset is multicentric to enforce representation of multilingual language features. In addition, the dataset goes beyond the sentence level, as it is organized in paragraphs of various lengths. Compared with related machine translation datasets, we show that BOUQuET has a broader representation of domains while simplifying the translation task for non-experts. Therefore, BOUQuET is specially suitable for crowd-source extension for which we are launching a call aim-ing at collecting a multi-way parallel corpus covering any written language. The dataset is freely available at https://huggingface.co/datasets/facebook/bouquet.
pdf
bib
abs
HealthCards: Exploring Text-to-Image Generation as Visual Aids for Healthcare Knowledge Democratizing and Education
Qian Wu
|
Zheyao Gao
|
Longfei Gou
|
Yifan Hou
|
Ann Sin Nga Lau
|
Qi Dou
The evolution of text-to-image (T2I) generation techniques has introduced new capabilities for information visualization, with the potential to advance knowledge democratization and education. In this paper, we investigate how T2I models can be adapted to generate educational health knowledge contents, exploring their potential to make healthcare information more visually accessible and engaging. We explore methods to harness recent T2I models for generating health knowledge flashcards—visual educational aids that present healthcare information through appealing and concise imagery. To support this goal, we curated a diverse, high-quality healthcare knowledge flashcard dataset containing 2,034 samples sourced from credible medical resources. We further validate the effectiveness of fine-tuning open-source models with our dataset, demonstrating their promise as specialized health flashcard generators. Our code and dataset are available at: https://github.com/med-air/HealthCards.
pdf
bib
abs
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Ammar Khairi
|
Daniel D’souza
|
Ye Shen
|
Julia Kreutzer
|
Sara Hooker
Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute—improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. While existing work has focused on English and specific domains, we study how to robustly scale inference-time compute in a multilingual, multi-task setting: spanning open-ended generations, math and translation tasks, for open models at 8B and 111B scale, across seven languages. Our findings highlight the need for tailored sampling and selection strategies. We propose novel solutions tailored for this multi-faceted inference scenario, demonstrating notable gains across languages and tasks. Our methods achieve an average +6.8 jump in win-rates for 8B models on m-ArenaHard-v2.0 prompts in non-English languages against proprietary models like Gemini. At larger scale, our 111B model shows a +9.0 improvement with just five samples compared to single-sample decoding. These results emphasize the importance of language- and task-aware approaches to democratize inference-time improvements.
pdf
bib
abs
Creativity in LLM-based Multi-Agent Systems: A Survey
Yi-Cheng Lin
|
Kang-Chieh Chen
|
Zhe-Yan Li
|
Tzu-Heng Wu
|
Tzu-Hsuan Wu
|
Kuan-Yu Chen
|
Hung-yi Lee
|
Yun-Nung Chen
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of creativity, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present:(1) a taxonomy of agent proactivity and persona design;(2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and(3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.
pdf
bib
abs
Context and POS in Action: A Comparative Study of Chinese Homonym Disambiguation in Human and Language Models
Xie Chenwei
|
Matthew King-Hang Ma
|
Wenbo Wang
|
William Shiyuan Wang
Ambiguity is pervasive in language, yet we resolve it effortlessly and unconsciously, often aided by context and part-of-speech (POS) cues. This study investigates how context similarity and POS influence homonym disambiguation in humans and large language models (LLMs). To enable comparable analyses between humans and LLMs, we first built an expert-curated sentence-pair dataset, manipulating context similarity and homonym POS categories (nouns vs. verbs). Participants (n = 55) and LLMs (via prompting) were asked to rate the sense similarity of target homonyms embedded within each sentence on a 7-point Likert scale. We found that context similarity influenced both groups similarly, but only humans utilized POS information, likely contributing to their superior performance. Model-derived metrics (surprisal, entropy) predicted human reaction times, and angular similarity between homonym representations accounted for additional variance, highlighting the roles of both expectation-based and semantic processes. Psycholinguistic factors like age of acquisition affected only human responses, underscoring distinct language acquisition mechanisms. Together, our findings illustrate how context and POS information interactively shape homonym resolution in humans, while exposing the limitations of current language models in capturing these nuanced processes. Dataset and codes are publicly available at https://github.com/neurothew/context-and-pos-in-action.
pdf
bib
abs
Attacking Misinformation Detection Using Adversarial Examples Generated by Language Models
Piotr Przybyła
|
Euan McGill
|
Horacio Saggion
Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform (1) quantitative evaluation using various prompts, models and query limits, (2) targeted manual assessment of the generated text and (3) qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
pdf
bib
abs
Leveraging Loanword Constraints for Improving Machine Translation in a Low-Resource Multilingual Context
Felermino D. M. A. Ali
|
Henrique Lopes Cardoso
|
Rui Sousa-Silva
This research investigates how to improve machine translation systems for low-resource languages by integrating loanword constraints as external linguistic knowledge. Focusing on the Portuguese-Emakhuwa language pair, which exhibits significant lexical borrowing, we address the challenge of effectively adapting loanwords during the translation process. To tackle this, we propose a novel approach that augments source sentences with loanword constraints, explicitly linking source-language loanwords to their target-language equivalents. Then, we perform supervised fine-tuning on multilingual neural machine translation models and multiple Large Language Models of different sizes. Our results demonstrate that incorporating loanword constraints leads to significant improvements in translation quality as well as in handling loanword adaptation correctly in target languages, as measured by different machine translation metrics. This approach offers a promising direction for improving machine translation performance in low-resource settings characterized by frequent lexical borrowing.
pdf
bib
abs
Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages
Yuemei Xu
|
Kexin Xu
|
Jian Zhou
|
Ling Hu
|
Lin Gui
The current Large Language Models (LLMs) face significant challenges in improving their performance on low-resource languagesand urgently need data-efficient methods without costly fine-tuning.From the perspective of language-bridge,we propose a simple yet effective method, namely BridgeX-ICL, to improve the zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons,BridgeX-ICL explores whether sharingneurons can improve cross-lingual performance in LLMs.We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly to ensure full activation of these anchored neurons.Subsequently, we propose an HSIC-based metric to quantify LLMs’ internal linguistic spectrumbased on overlapping neurons, guiding optimal bridge selection.The experiments conducted on 4 cross-lingual tasks and 15 language pairs from 7diverse families, covering both high-low and moderate-low pairs, validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs. The code is publicly available at https://github.com/xuyuemei/BridgeX-ICL.
pdf
bib
abs
Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Ona de Gibert
|
Joseph Attieh
|
Teemu Vahtola
|
Mikko Aulamo
|
Zihao Li
|
Raúl Vázquez
|
Tiancheng Hu
|
Jörg Tiedemann
We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its overall high quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, (iii) studying the effect of varying training data size, and (iiii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
pdf
bib
abs
Tailoring Table Retrieval from a Field-aware Hybrid Matching Perspective
Da Li
|
Keping Bi
|
Jiafeng Guo
|
Xueqi Cheng
Table retrieval, essential for accessing information through tabular data, is less explored compared to text retrieval. The row/column structure and distinct fields of tables (including titles, headers, and cells) present unique challenges. For example, different table fields have varying matching preferences: cells may favor finer-grained (word/phrase level) matching over broader (sentence/passage level) matching due to their fragmented and detailed nature, unlike titles. This necessitates a table-specific retriever to accommodate the various matching needs of each table field. Therefore, we introduce a Table-tailored HYbrid Matching rEtriever (THYME), which approaches table retrieval from a field-aware hybrid matching perspective. Empirical results on two table retrieval benchmarks, NQ-TABLES and OTT-QA, show that THYME significantly outperforms state-of-the-art baselines. Comprehensive analyses have confirmed the differing matching preferences across table fields and validated the efficacy of THYME.
pdf
bib
abs
Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
Sotaro Takeshita
|
Yurina Takeshita
|
Daniel Ruffinelli
|
Simone Paolo Ponzetto
In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.
pdf
bib
abs
Morables: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
Matteo Marcuzzo
|
Alessandro Zangari
|
Andrea Albarelli
|
Jose Camacho-Collados
|
Mohammad Taher Pilehvar
As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present Morables, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.
pdf
bib
abs
MessIRve: A Large-Scale Spanish Information Retrieval Dataset
Francisco Valentini
|
Viviana Cotik
|
Damián Furman
|
Ivan Bercovich
|
Edgar Altszyler
|
Juan Manuel Pérez
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
pdf
bib
abs
AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi
|
Israel Abebe Azime
|
Miaoran Zhang
|
Cristina España-Bonet
|
Rachel Bawden
|
Dawei Zhu
|
David Ifeoluwa Adelani
|
Clement Oyeleke Odoje
|
Idris Akinade
|
Iffat Maab
|
Davis David
|
Shamsuddeen Hassan Muhammad
|
Neo Putini
|
David O. Ademuyiwa
|
Andrew Caines
|
Dietrich Klakow
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating the ability of neural machine translation (NMT) models and large language models (LLMs) to translate between English and these languages, at both the sentence and pseudo-document levels, the outputs being realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieves the best average performance among the standard NMT models, while GPT-4o outperforms general-purpose LLMs. Fine-tuning selected models leads to substantial performance gains, but models trained on sentences struggle to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, over-generation, repetition of words and phrases, and off-target translations, specifically for translation into African languages.
pdf
bib
abs
Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead
Jesujoba Oluwadara Alabi
|
Michael A. Hedderich
|
David Ifeoluwa Adelani
|
Dietrich Klakow
With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors—including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.
pdf
bib
abs
GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
Yiyang Zhou
|
Linjie Li
|
Shi Qiu
|
Zhengyuan Yang
|
Yuyang Zhao
|
Siwei Han
|
Yangfan He
|
Kangqi Li
|
Haonian Ji
|
Zihao Zhao
|
Haibo Tong
|
Lijuan Wang
|
Huaxiu Yao
Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce , a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context—this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos. We publicly release our benchmark and code at https://github.com/aiming-lab/GLIMPSE.
pdf
bib
abs
Social Bias in Multilingual Language Models: A Survey
Lance Calvin Lim Gamboa
|
Yue Feng
|
Mark G. Lee
Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field’s dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature’s inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.
pdf
bib
abs
BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering
Costas Mavromatis
|
Soji Adeshina
|
Vassilis N. Ioannidis
|
Zhen Han
|
Qi Zhu
|
Ian Robinson
|
Bryan Thompson
|
Huzefa Rangwala
|
George Karypis
Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.
pdf
bib
abs
Synth-SBDH: A Synthetic Dataset of Social and Behavioral Determinants of Health for Clinical Text
Avijit Mitra
|
Zhichao Yang
|
Emily Druhl
|
Raelene Goodwin
|
Hong Yu
Social and behavioral determinants of health (SBDH) play a crucial role in health outcomes and are frequently documented in clinical text. Automatically extracting SBDH information from clinical text relies on publicly available good-quality datasets. However, existing SBDH datasets exhibit substantial limitations in their availability and coverage. In this study, we introduce Synth-SBDH, a novel synthetic dataset with detailed SBDH annotations, encompassing status, temporal information, and rationale across 15 SBDH categories. We showcase the utility of Synth-SBDH on three tasks using real-world clinical datasets from two distinct hospital settings, highlighting its versatility, generalizability, and distillation capabilities. Models trained on Synth-SBDH consistently outperform counterparts with no Synth-SBDH training, achieving up to 63.75% macro-F improvements. Additionally, Synth-SBDH proves effective for rare SBDH categories and under-resource constraints while being substantially cheaper than expert-annotated real-world data. Human evaluation reveals a 71.06% Human-LLM alignment and uncovers areas for future refinements.
pdf
bib
abs
Pun Unintended: LLMs and the Illusion of Humor Understanding
Alessandro Zangari
|
Matteo Marcuzzo
|
Andrea Albarelli
|
Mohammad Taher Pilehvar
|
Jose Camacho-Collados
Puns are a form of humorous wordplay that exploits polysemy and phonetic similarity. While LLMs have shown promise in detecting puns, we show in this paper that their understanding often remains shallow, lacking the nuanced grasp typical of human interpretation. By systematically analyzing and reformulating existing pun benchmarks, we demonstrate how subtle changes in puns are sufficient to mislead LLMs. Our contributions include comprehensive and nuanced pun detection benchmarks, human evaluation across recent LLMs, and an analysis of the robustness challenges these models face in processing puns.
pdf
bib
abs
RACCooN: Versatile Instructional Video Editing with Auto-Generated Narratives
Jaehong Yoon
|
Shoubin Yu
|
Mohit Bansal
Recent video generative models primarily rely on detailed, labor-intensive text prompts for tasks, like inpainting or style editing, limiting adaptability for personal/raw videos. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video editing method, supporting diverse video editing capabilities, such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P), which automatically generates structured video descriptions capturing both scene context and object details, and Paragraph-to-Video (P2V), where users (optionally) refine these descriptions to guide a video diffusion model for flexible content modifications, including removing, changing subjects, and/or adding new objects. Key contributions of RACCooN include: (1) A multi-granular spatiotemporal pooling strategy for structured video understanding, capturing both broad context and fine-grained details of major objects to enable precise text-based video editing without the need for complex human annotations. (2) A video generative model fine-tuned on our curated video-paragraph-mask dataset, enhances the editing and inpainting quality. (3) The capability to seamlessly generate new objects in videos by forecasting their movements through automatically generated mask planning. In the end, users can easily edit complex videos with RACCooN’s automatic explanations and guidance. We demonstrate its versatile capabilities in video-to-paragraph generation (up to 9.4%p absolute improvement in human evaluations) and video content editing (relative to 49.7% lower FVD), and can be integrated with SoTA video generation models for further enhancement.
pdf
bib
abs
Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law
Yanjin He
|
Qingkai Zeng
|
Meng Jiang
Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf’s law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf’s law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection. The code and data are available at: https://github.com/yanjinhe/Tokenizer
pdf
bib
abs
Do RAG Systems Really Suffer From Positional Bias?
Florin Cuconasu
|
Simone Filice
|
Guy Horowitz
|
Yoelle Maarek
|
Fabrizio Silvestri
Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM’s capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.
pdf
bib
abs
Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction
WonJin Yoon
|
Boyu Ren
|
Spencer Thomas
|
Chanhwi Kim
|
Guergana K Savova
|
Mei-Hua Hall
|
Timothy A. Miller
Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different information signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task – 30-day readmission prediction from a psychiatric discharge – using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.
pdf
bib
abs
Adapting Bias Evaluation to Domain Contexts using Generative Models
Tamara Quiroga
|
Felipe Bravo-Marquez
|
Valentin Barriere
Numerous datasets have been proposed to evaluate social bias in Natural Language Processing (NLP) systems. However, assessing bias within specific application domains remains challenging, as existing approaches often face limitations in scalability and fidelity across domains. In this work, we introduce a domain-adaptive framework that utilizes prompting with Large Language Models (LLMs) to automatically transform template-based bias datasets into domain-specific variants. We apply our method to two widely used benchmarks—Equity Evaluation Corpus (EEC) and Identity Phrase Templates Test Set (IPTTS)—adapting them to the Twitter and Wikipedia Talk data. Our results show that the adapted datasets yield bias estimates more closely aligned with real-world data. These findings highlight the potential of LLM-based prompting to enhance the realism and contextual relevance of bias evaluation in NLP systems.
pdf
bib
abs
Emergent morpho-phonological representations in self-supervised speech models
Jon Gauthier
|
Canaan Breiss
|
Matthew K Leonard
|
Edward F. Chang
Self-supervised speech models can be trained to efficiently recognize spoken words in naturalistic, noisy environments. However, we do not understand the types of linguistic representations these models use to accomplish this task. To address this question, we study how S3M variants optimized for word recognition represent phonological and morphological phenomena in frequent English noun and verb inflections. We find that their representations exhibit a global linear geometry which can be used to link English nouns and verbs to their regular inflected forms.This geometric structure does not directly track phonological or morphological units. Instead, it tracks the regular distributional relationships linking many word pairs in the English lexicon—often, but not always, due to morphological inflection. These findings point to candidate representational strategies that may support human spoken word recognition, challenging the presumed necessity of distinct linguistic representations of phonology and morphology.
pdf
bib
abs
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
|
Yao Lu
|
Maurice Weber
|
Max Ryabinin
|
David Ifeoluwa Adelani
|
Yihong Chen
|
Raphael Tang
|
Pontus Stenetorp
English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same can not be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that documents machine-translated from a high-quality English corpus, can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web corpus, into nine languages. resulting in a 1.7-trillion-token corpus, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this corpus. Across Non-English understanding and reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding fewer than 5% of TransWebLLM’s training tokens as domain-specific data for continued pretraining yields state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus and models under Open Source Initiative-approved licenses.
pdf
bib
abs
IntentionFrame: A Semi-Structured, Multi-Aspect Framework for Fine-Grained Conversational Intention Understanding
Jinggui Liang
|
Dung Vo
|
Lizi Liao
Understanding user intentions in multi-turn dialogues is critical for conversational AI, yet existing approaches—relying on rigid slot-value structures or unstructured free-text—fail to fully capture conversational complexity. In this paper, we propose IntentionFrame, a semi-structured framework inspired by psychological and cognitive intention theories, which organizes conversational intents into four interrelated aspects: situation, emotion, action, and knowledge. This design not only retains interpretability but also provides LLMs with a rich context to accurately parse and respond to nuanced user inputs. To efficiently scale IntentionFrame annotations, we introduce a Weakly-supervised Reinforced Generation (WeRG) method that leverages a small set of high-quality human annotations in conjunction with abundant coarsely labeled data. By applying reinforcement learning to balance these diverse signals, WeRG aims to effectively generate reliable IntentionFrame annotations, which serve as essential grounding for downstream tasks—leading to substantial improvements in response generation and task completion. Our experiments, supported by both automatic metrics and human evaluations, show that integrating IntentionFrame with WeRG significantly improves LLMs’ conversational understanding and sets a new benchmark for intent analysis.
pdf
bib
abs
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Ziyang Wang
|
Jaehong Yoon
|
Shoubin Yu
|
Md Mohaiminul Islam
|
Gedas Bertasius
|
Mohit Bansal
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine- tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.
pdf
bib
abs
Efficient Compositional Multi-tasking for On-device Large Language Models
Ondrej Bohdal
|
Mete Ozay
|
Jijoong Moon
|
Kyenghun Lee
|
Hyeonmok Ko
|
Umberto Michieli
Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
pdf
bib
abs
Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko
|
Mrinmaya Sachan
|
Bernhard Schölkopf
|
Zhijing Jin
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance.
pdf
bib
abs
Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
Taehee Park
|
Heejin Do
|
Gary Lee
Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.
pdf
bib
abs
Scaling Up Temporal Domain Generalization via Temporal Experts Averaging
Aoming Liu
|
Kevin Miller
|
Venkatesh Saligrama
|
Kate Saenko
|
Boqing Gong
|
Ser-Nam Lim
|
Bryan A. Plummer
Temporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time. Prior work often addresses this by predicting future model weights. However, full model prediction is prohibitively expensive for even reasonably sized models. Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components. To address this, we propose Temporal Expert Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs. Our theoretical analysis guides us to two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace. Expert’s contributions are based on their projected proximity to future domains. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient.
pdf
bib
abs
LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder
Yi Jing
|
Zijun Yao
|
Hongzhu Guo
|
Lingxu Ran
|
Xiaozhi Wang
|
Lei Hou
|
Juanzi Li
Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions—morphology, syntax, semantics, and pragmatics. By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism analysis. Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs. This work provides a systematic suite of resources and methods for studying linguistic mechanisms, offers strong evidence that LLMs possess genuine linguistic knowledge, and lays the foundation for more interpretable and controllable language modeling in future research.
pdf
bib
abs
The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models
Adrian Cosma
|
Stefan Ruseti
|
Emilian Radoi
|
Mihai Dascalu
Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.
pdf
bib
abs
Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Surangika Ranathunga
|
Aloka Fernando
|
Menan Velayuthan
|
Charitha Rathnayaka
|
Nisansa de Silva
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets
pdf
bib
abs
Weaver: Interweaving SQL and LLM for Table Reasoning
Rohit Khoja
|
Devanshu Gupta
|
Yanjie Fu
|
Dan Roth
|
Vivek Gupta
Querying tables with unstructured data is challenging due to the presence of text (or image), either embedded in the table or in external paragraphs, which traditional SQL struggles to process, especially for tasks requiring semantic reasoning. While Large Language Models (LLMs) excel at understanding context, they face limitations with long input sequences. Existing approaches that combine SQL and LLM typically rely on rigid, predefined workflows, limiting their adaptability to complex queries. To address these issues, we introduce Weaver, a modular pipeline that dynamically integrates SQL and LLM for table-based question answering (Table QA). Weaver generates a flexible, step-by-step plan that combines SQL for structured data retrieval with LLMs for semantic processing. By decomposing complex queries into manageable subtasks, Weaver improves accuracy and generalization. Our experiments show that consistently outperforms state-of-the-art methods across four Table QA datasets, reducing both API calls and error rates.
pdf
bib
abs
ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation
Seungmin Shin
|
Dooyoung Kim
|
Youngjoong Ko
Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding Entropy-based COntrol, which dynamically adjusts the control strength at each generation step according to the model’s entropy in both the language model and attribute classifier probability distributions. Experimental results on DailyDialog and MultiWOZ datasets show that our method achieves improved control accuracy while maintaining fluency and grammar, outperforming previous decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation, demonstrating its robust performance in both single and multi-attribute scenarios.
pdf
bib
abs
Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
Antara Raaghavi Bhattacharya
|
Isabel Papadimitriou
|
Kathryn Davidson
|
David Alvarez-Melis
Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols (+, ×, etc, as in “twenty + three”). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.
pdf
bib
abs
Unsupervised Concept Vector Extraction for Bias Control in LLMs
Hannah Cyberey
|
Yangfeng Ji
|
David Evans
Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of “gender” is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model’s representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias.
pdf
bib
abs
Seeing the Same Story Differently: Framing‐Divergent Event Coreference for Computational Framing Analysis
Jin Zhao
|
Xinrui Hu
|
Nianwen Xue
News articles often describe the same real-world event in strikingly different ways, shaping perception through framing rather than factual disagreement. However, traditional computational framing approaches often rely on coarse-grained topic classification, limiting their ability to capture subtle, event-level differences in how the same occurrences are presented across sources. We introduce Framing-divergent Event Coreference (FrECo), a novel task that identifies pairs of event mentions referring to the same underlying occurrence but differing in framing across documents to provide a event-centric lens for computational framing analysis. To support this task, we construct the high-agreement and diverse FrECo corpus. We evaluate the FrECo task on the corpus through supervised and preference-based tuning of large language models, providing strong baseline performance. To scale beyond the annotated data, we develop a bootstrapped mining pipeline that iteratively expands the training set with high-confidence FrECo pairs. Our approach enables scalable, interpretable analysis of how media frame the same events differently, offering a new lens for contrastive framing analysis at the event level.
pdf
bib
abs
LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition
Fan Bai
|
Hamid Hassanzadeh
|
Ardavan Saeedi
|
Mark Dredze
In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.
pdf
bib
abs
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
Jaewon Cheon
|
Pilsung Kang
The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
pdf
bib
abs
SimpleDoc: Multi‐Modal Document Understanding with Dual‐Cue Page Retrieval and Iterative Refinement
Chelsi Jain
|
Yiran Wu
|
Yifan Zeng
|
Jiale Liu
|
Shengyu Dai
|
Zhenwen Shao
|
Qingyun Wu
|
Huazheng Wang
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g., images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
pdf
bib
abs
VLP: Vision-Language Preference Learning for Embodied Manipulation
Runze Liu
|
Chenjia Bai
|
Jiafei Lyu
|
Shengjie Sun
|
Yali Du
|
Xiu Li
Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel Vision-Language Preference learning framework, named VLP, which learns a vision-language preference model to provide feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders. The model learns to extract language-related features, and then serves as a predictor in various downstream tasks. The policy can be learned according to the annotated labels via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin and shifting the burden from continuous, per-task human annotation to one-time, per-domain data collection.
pdf
bib
abs
QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models
Kuei-Chun Kao
|
Hsu Tzu-Yin
|
Yunqi Hong
|
Ruochen Wang
|
Cho-Jui Hsieh
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.
pdf
bib
abs
EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
Ashish Seth
|
Utkarsh Tyagi
|
Ramaneswaran Selvakumar
|
Nishit Anand
|
Sonal Kumar
|
Sreyan Ghosh
|
Ramani Duraiswami
|
Chirag Agarwal
|
Dinesh Manocha
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EGOILLUSION, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EGOILLUSION comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EGOILLUSION lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility
pdf
bib
abs
MULTIVOX: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
Ramaneswaran Selvakumar
|
Ashish Seth
|
Nishit Anand
|
Utkarsh Tyagi
|
Sonal Kumar
|
Sreyan Ghosh
|
Dinesh Manocha
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current open-source models consistently struggle to produce contextually grounded responses.
pdf
bib
abs
Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms
Minyeong Choe
|
Haehyun Cho
|
Changho Seo
|
Hyunil Kim
Understanding how Transformer-based language models store and retrieve factual associations is critical for improving interpretability and enabling targeted model editing. Prior work, primarily on GPT-style models, has identified MLP modules in early layers as key contributors to factual recall. However, it remains unclear whether these findings generalize across different autoregressive architectures. To address this, we conduct a comprehensive evaluation of factual recall across several models—including GPT, LLaMA, Qwen, and DeepSeek—analyzing where and how factual information is encoded and accessed. Consequently, we find that Qwen-based models behave differently from previous patterns: attention modules in the earliest layers contribute more to factual recall than MLP modules. Our findings suggest that even within the autoregressive Transformer family, architectural variations can lead to fundamentally different mechanisms of factual recall.
pdf
bib
abs
Probing Narrative Morals: A New Character-Focused MFT Framework for Use with Large Language Models
Luca Mitran
|
Sophie Wu
|
Andrew Piper
Moral Foundations Theory (MFT) provides a framework for categorizing different forms of moral reasoning, but its application to computational narrative analysis remains limited. We propose a novel character-centric method to quantify moral foundations in storytelling, using large language models (LLMs) and a novel Moral Foundations Character Action Questionnaire (MFCAQ) to evaluate the moral foundations supported by the behaviour of characters in stories. We validate our approach against human annotations and then apply it to a study of 2,697 folktales from 55 countries. Our findings reveal: (1) broad distribution of moral foundations across cultures, (2) significant cross-cultural consistency with some key regional differences, and (3) a more balanced distribution of positive and negative moral content than suggested by prior work. This work connects MFT and computational narrative analysis, demonstrating LLMs’ potential for scalable moral reasoning in narratives.
pdf
bib
abs
Probing and Boosting Large Language Models Capabilities via Attention Heads
Dezhi Zhao
|
Xin Liu
|
Xiaocheng Feng
|
Hui Wang
|
Bing Qin
Understanding the internal origins of capabilities in large language models (LLMs) is crucial for interpretability and efficient adaptation. However, the emergence of specific capabilities remains poorly understood, as most existing approaches rely on external signals (e.g., performance shifts or gradient similarities) with limited structural grounding. To address these issues, this paper proposes a lightweight and highly interpretable approach that links LLM capabilities to internal components by identifying correspondences at the level of attention heads. Specifically, we first define five fundamental capabilities, namely Mathematical Reasoning, Reading Comprehension, Commonsense Reasoning, Scientific Reasoning, and Professional Expertise, and employ probing techniques to detect the attention heads most predictive of each, thereby establishing capability–head mappings. For targeted instruction tuning, complex tasks are decomposed into these fundamental capabilities, and training data are selected accordingly. Experiments on LLaMA3.1-8B and Qwen2.5-7B show over 70% discrimination accuracy in identifying capabilities. On MMLU and BBH, our method improves accuracy by 1 to 1.5 points over the gradient-based method LESS and by 5 to 6 points over other intermediate-state baselines.
pdf
bib
abs
A Survey of Link Prediction in N-ary Knowledge Graphs
Jiyao Wei
|
Saiping Guan
|
Da Li
|
Zhongni Hou
|
Miao Su
|
Yucan Guo
|
Xiaolong Jin
|
Jiafeng Guo
|
Xueqi Cheng
N-ary Knowledge Graphs (NKGs) are a specialized type of knowledge graph designed to efficiently represent complex real-world facts. Unlike traditional knowledge graphs, where a fact typically involves two entities, NKGs can capture n-ary facts containing more than two entities. Link prediction in NKGs aims to predict missing elements within these n-ary facts, which is essential for completing NKGs and improving the performance of downstream applications. This task has recently gained significant attention. In this paper, we present the first comprehensive survey of link prediction in NKGs, providing an overview of the field, systematically categorizing existing methods, and analyzing their performance and application scenarios. We also outline promising directions for future research.
pdf
bib
abs
Multi-Frequency Contrastive Decoding: Alleviating Hallucinations for Large Vision-Language Models
Bingqian Liu
|
Fu Zhang
|
Guoqing Chen
|
Jingwei Cheng
Large visual-language models (LVLMs) have demonstrated remarkable performance in visual-language tasks. However, object hallucination remains a significant challenge for LVLMs. Existing studies attribute object hallucinations in LVLMs mainly to linguistic priors and data biases. We further explore the causes of object hallucinations from the perspective of frequency domain and reveal that insufficient frequency information in images amplifies these linguistic priors, increasing the likelihood of hallucinations. To mitigate this issue, we propose the Multi-Frequency Contrastive Decoding (MFCD) method, a simple yet trainingfree approach that removes the hallucination distribution in the original output distribution, which arises from LVLMs neglecting the high-frequency information or low-frequency information in the image input. Without compromising the general capabilities of LVLMs, the proposed MFCD effectively mitigates the object hallucinations in LVLMs. Our experiments demonstrate that MFCD significantly mitigates object hallucination across diverse large-scale vision-language models, without requiring additional training or external tools. In addition, MFCD can be applied to various LVLMs without modifying model architecture or requiring additional training, demonstrating its generality and robustness. Codes are available at https://github.com/liubq-dev/mfcd.
pdf
bib
abs
ORPP: Self-Optimizing Role-playing Prompts to Enhance Language Model Capabilities
Yifan Duan
|
Yihong Tang
|
Kehai Chen
|
Liqiang Nie
|
Min Zhang
High-quality prompts are crucial for eliciting outstanding performance from large language models (LLMs) on complex tasks. Existing research has explored model-driven strategies for prompt optimization. However, these methods often suffer from high computational overhead or require strong optimization capabilities from the model itself, which limits their broad applicability.To address these challenges, we propose ORPP, a framework that enhances model performance by optimizing and generating role-playing prompts. The core idea of ORPP is to confine the prompt search space to role-playing scenarios, thereby fully activating the model’s intrinsic capabilities through carefully crafted, high-quality role-playing prompts. Specifically, ORPP first performs iterative optimization on a small subset of training samples to generate high-quality role-playing prompts. Then, leveraging the model’s few-shot learning capability, it transfers the optimization experience to efficiently generate suitable prompts for the remaining samples.Our experimental results show that ORPP not only matches but in most cases surpasses existing mainstream prompt optimization methods in terms of performance. Notably, ORPP suggests great “plug-and-play” capability. In most cases, it can be integrated with various other prompt methods and further enhance their effectiveness.
pdf
bib
abs
BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks
Tianyuan Huang
|
Zepeng Zhu
|
Hangdi Xing
|
Zirui Shao
|
Zhi Yu
|
Chaoxiong Yang
|
Jiaxian He
|
Xiaozhong Liu
|
Jiajun Bu
Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.
pdf
bib
abs
MAviS: A Multimodal Conversational Assistant For Avian Species
Yevheniia Kryklyvets
|
Mohammed Irfan Kurpath
|
Sahal Shaji Mullappilly
|
Jinxing Zhou
|
Fahad Shahbaz Khan
|
Rao Muhammad Anwer
|
Salman Khan
|
Hisham Cholakkal
Fine-grained understanding and species-specific, multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models (MM-LLMs) face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the **MAviS-Dataset**, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question–answer pairs. Building on the MAviS-Dataset, we introduce **MAviS-Chat**, a multimodal LLM that supports audio, vision, and text designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present **MAviS-Bench**, a benchmark of over 25,000 Q&A pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive MM-LLMs for ecological applications. Our code, training data, evaluation benchmark, and models are available at https://github.com/yevheniia-uv/MAviS.
pdf
bib
abs
Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization
Manato Tajiri
|
Michimasa Inaba
Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method’s effectiveness in fostering more natural and realistic conversational recommendation processes. Our implementation is publicly available at: https://github.com/UEC-InabaLab/Refining-LLM-Text
pdf
bib
abs
Large Language Models Threaten Language’s Epistemic and Communicative Foundations
Shashank Srivastava
Large language models are reshaping the norms of human communication, sometimes decou- pling words from genuine human thought. This transformation is deep, and undermines norms historically tied to authorship of text. We draw from linguistic philosophy and AI ethics to detail how large-scale text genera- tion can induce semantic drift, erode account- ability, and obfuscate intent and authorship. Our work here introduces hybrid authorship graphs (modeling humans, LLMs, and texts in a provenance network), epistemic doppel- gängers (LLM-generated texts that are indis- tinguishable from human-authored texts), and authorship entropy. We explore mechanisms such as “proof-of-interaction” authorship veri- fication and educational reforms to restore con- fidence in language. LLMs’ benefits (broader access, increased fluency, automation, etc.) are undeniable, but the upheavals they introduce to the linguistic landscape demand reckoning.
pdf
bib
abs
Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
Zhuo Chen
|
Xinyu Wang
|
Yong Jiang
|
Zhen Zhang
|
Xinyu Geng
|
Pengjun Xie
|
Fei Huang
|
Kewei Tu
Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tune a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM’s knowledge boundary, based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
pdf
bib
abs
Multi-view-guided Passage Reranking with Large Language Models
Jeongwoo Na
|
Jun Kwon
|
Eunseong Choi
|
Jongwuk Lee
Recent advances in large language models (LLMs) have shown impressive performance in passage reranking tasks. Despite their success, LLM-based methods still face challenges in efficiency and sensitivity to external biases. (1) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incur heavy computational overhead as the number of passages increases. (2) External biases, such as position or selection bias, hinder the model’s ability to accurately represent passages and increase input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative LLM-based reranking method that encodes query-passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, which is then used to directly compute relevance scores in a single decoding step. In addition, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100x reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks. The source code is available at: https://github.com/bulbna/MVP.
pdf
bib
abs
Disentangling Subjectivity and Uncertainty for Hate Speech Annotation and Modeling using Gaze
Özge Alacam
|
Sanne Hoeken
|
Andreas Säuberli
|
Hannes Gröner
|
Diego Frassinelli
|
Sina Zarrieß
|
Barbara Plank
Variation is inherent in opinion-based annotation tasks like sentiment or hate speech analysis. It does not only arise from errors, fatigue, or sentence ambiguity but also from genuine differences in opinion shaped by background, experience, and culture. In this paper, first, we show how annotators’ confidence ratings can be great use for disentangling subjective variation from uncertainty, without relying on specific features present in the data (text, gaze, etc.). Our goal is to establish distinctive dimensions of variation which are often not clearly separated in existing work on modeling annotator variation. We illustrate our approach through a hate speech detection task, demonstrating that models are affected differently by instances of uncertainty and subjectivity. In addition, we show that human gaze patterns offer valuable indicators of subjective evaluation and uncertainty. Disclaimer: This paper contains sentences that may be offensive.
pdf
bib
abs
VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model
Junhyuk Choi
|
Ro-hoon Oh
|
Jihwan Seol
|
Bugeun Kim
We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech modality, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs—LLaMA-Omni and Qwen2-Audio—and observe architectural contrasts: LLaMA-Omni retains strong acoustic sensitivity, amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.
pdf
bib
abs
Explaining Differences Between Model Pairs in Natural Language through Sample Learning
Advaith Malladi
|
Rakesh R Menon
|
Yuvraj Jain
|
Shashank Srivastava
With the growing adoption of machine learning models in critical domains, techniques for explaining differences between models have become essential for trust, debugging, and informed deployment. Previous approaches address this by identifying input transformations that cause divergent predictions or by learning joint surrogate models to align and contrast behaviors. These methods often require access to training data and do not produce natural language explanations. In this paper, we introduce SLED, a framework that generates faithful natural language explanations of when and how two ML models converge or diverge in their predictions. SLED first uses gradient-based optimization to synthesize input samples that highlight divergence and convergence patterns, and then leverages a large language model (LLM) to generate explanations grounded in these synthetic samples. Across both text-based (3 tasks, 7 models) and structured (10 tasks, 4 models) classification tasks, we show that SLED explanations are 18–24% more faithful than the strongest baselines. User studies also indicate that SLED explanations achieve a real-world simulatability of 63.5%. Importantly, SLED requires minimal access to training data and generalizes well to real-world samples, enabling transparent and data-efficient model comparison.
pdf
bib
abs
Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
Yu-Ang Lee
|
Guan-Ting Yi
|
Mei-Yi Liu
|
Jui-Chao Lu
|
Guan-Bo Yang
|
Yun-Nung Chen
Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field.
pdf
bib
abs
A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse
Xiaohan Ding
|
Kaike Ping
|
Buse Çarık
|
Eugenia Rho
Understanding causal language in informal discourse is a core yet underexplored challenge in NLP. Existing datasets largely focus on explicit causality in structured text, providing limited support for detecting implicit causal expressions, particularly those found in informal, user-generated social media posts. We introduce CausalTalk, a multi-level dataset of five years of Reddit posts (2020–2024) discussing public health related to the COVID-19 pandemic, among which 10,120 posts are annotated across four causal tasks: (1) binary causal classification, (2) explicit vs. implicit causality, (3) cause–effect span extraction, and (4) causal gist generation. Annotations comprise both gold-standard labels created by domain experts and silver-standard labels generated by GPT-4o and verified by human annotators.CausalTalk bridges fine-grained causal detection and gist-based reasoning over informal text. It enables benchmarking across both discriminative and generative models, and provides a rich resource for studying causal reasoning in social media contexts.
pdf
bib
abs
Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness
Zihan Liang
|
Ziwen Pan
|
Ruoxuan Xiong
Clinical notes contain rich patient information, such as diagnoses or medications, making them valuable for patient representation learning. Recent advances in large language models have further improved the ability to extract meaningful representations from clinical texts. However, clinical notes are often missing. For example, in our analysis of the MIMIC-IV dataset, 24.5% of patients have no available discharge summaries. In such cases, representations can be learned from other modalities such as structured data, chest X-rays, or radiology reports. Yet the availability of these modalities is influenced by clinical decision-making and varies across patients, resulting in modality missing-not-at-random (MMNAR) patterns. We propose a causal representation learning framework that leverages observed data and informative missingness in multimodal clinical records. It consists of: (1) an MMNAR-aware modality fusion component that integrates structured data, imaging, and text while conditioning on missingness patterns to capture patient health and clinician-driven assignment; (2) a modality reconstruction component with contrastive learning to ensure semantic sufficiency in representation learning; and (3) a multitask outcome prediction model with a rectifier that corrects for residual bias from specific modality observation patterns. Comprehensive evaluations across MIMIC-IV and eICU show consistent gains over the strongest baselines, achieving up to 13.8% improvement for hospital readmission and 13.1% for ICU admission (AUC, relative to best baseline).
pdf
bib
abs
XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering
Keonwoo Roh
|
Yeong-Joon Ju
|
Seong-Whan Lee
Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
pdf
bib
abs
Transformer-Based Temporal Information Extraction and Application: A Review
Xin Su
|
Phillip Howard
|
Steven Bethard
Temporal information extraction (IE) aims to extract structured temporal information from unstructured text, thereby uncovering the implicit timelines within. This technique is applied across domains such as healthcare, newswire, and intelligence analysis, aiding models in these areas to perform temporal reasoning and enabling human users to grasp the temporal structure of text. Transformer-based pre-trained language models have produced revolutionary advancements in natural language processing, demonstrating exceptional performance across a multitude of tasks. Despite the achievements garnered by Transformer-based approaches in temporal IE, there is a lack of comprehensive reviews on these endeavors. In this paper, we aim to bridge this gap by systematically summarizing and analyzing the body of work on temporal IE using Transformers while highlighting potential future research directions.
pdf
bib
abs
How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Ruohao Guo
|
Wei Xu
|
Alan Ritter
As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs’ capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.
pdf
bib
abs
AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection
Yejin Lee
|
Joonghyuk Hahn
|
Hyeseon Ahn
|
Yo-Sub Han
Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.
pdf
bib
abs
Can Large Language Models Act as Ensembler for Multi-GNNs?
Hanqi Duan
|
Yao Cheng
|
Jianxiang Yu
|
Yao Liu
|
Xiang Li
Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, GNNs lack the inherent semantic understanding capability of rich textual node attributes, limiting their effectiveness in applications. On the other hand, we empirically observe that for existing GNN models, no one can consistently outperforms others across diverse datasets. In this paper, we study whether LLMs can act as an ensembler for multi-GNNs and propose the LensGNN model. The model first aligns multiple GNNs, mapping the representations of different GNNs into the same space. Then, through LoRA fine-tuning, it aligns the space between the GNN and the LLM, injecting graph tokens and textual information into LLMs. This allows LensGNN to ensemble multiple GNNs and take advantage of the strengths of LLM, leading to a deeper understanding of both textual semantic information and graph structural information. The experimental results show that LensGNN outperforms existing models. This research advances text-attributed graph ensemble learning by providing a robust and superior solution for integrating semantic and structural information. We provide our code and data here: https://github.com/AquariusAQ/LensGNN.
pdf
bib
abs
Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Younwoo Choi
|
Changling Li
|
Yongjin Yang
|
Zhijing Jin
As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM’s ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions—reasoning patterns, linguistic style, and alignment preferences—and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity—sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments.
pdf
bib
abs
From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text
Ridwan Mahbub
|
Mohammed Saidul Islam
|
Mir Tafseer Nayeem
|
Md Tahmid Rahman Laskar
|
Mizanur Rahman
|
Shafiq Joty
|
Enamul Hoque
Charts are very common for exploring dataand communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country’s economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are available at <redacted>.
pdf
bib
abs
Real-time Ad Retrieval via LLM-generative Commercial Intention for Sponsored Search Advertising
Tongtong Liu
|
Zhaohui Wang
|
Meiyue Qin
|
Zenghui Lu
|
Xudong Chen
|
Yuekui Yang
|
Peng Shu
The integration of Large Language Models (LLMs) with retrieval systems has shown promising potential in retrieving documents (docs) or advertisements (ads) for a given query. Existing LLM-based retrieval methods generate numeric or content-based DocIDs to retrieve docs/ads. However, the one-to-few mapping between numeric IDs and docs, along with the time-consuming content extraction, leads to semantic inefficiency and limits the scalability of existing methods on large-scale corpora. In this paper, we propose the **R**eal-time **A**d **RE**trieval (RARE) framework, which leverages LLM-generated text called Commercial Intentions (CIs) as an intermediate semantic representation to directly retrieve ads for queries in real-time. These CIs are generated by a customized LLM injected with commercial knowledge, enhancing its domain relevance. Each CI corresponds to multiple ads, yielding a lightweight and scalable set of CIs. RARE has been implemented in a real-world online system, handling daily search volumes in billions. The online implementation has yielded significant benefits: a 5.04% increase in consumption, a 6.37% rise in Gross Merchandise Volume (GMV), a 1.28% enhancement in click-through rate (CTR) and a 5.29% increase in shallow conversions. Extensive offline experiments show RARE’s superiority over ten competitive baselines in four major categories.
pdf
bib
abs
Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models
Ikhyun Cho
|
Julia Hockenmaier
Sparse autoencoders (SAEs) have emerged as a powerful analytical tool in mechanistic interpretability for large language models (LLMs), with growing success in applications beyond interpretability. Building on this momentum, we present a novel approach that leverages SAEs to enhance the general in-context learning (ICL) performance of LLMs.Specifically, we introduce Feature Detection through Prompt Variation (FDPV), which leverages the SAE’s remarkable ability to capture subtle differences between prompts, enabling efficient feature selection for downstream steering. In addition, we propose a novel steering method tailored to ICL—Selective In-Context Steering (SISTER)—grounded in recent insights from ICL research that LLMs utilize label words as key anchors. Our method yields a 3.5% average performance improvement across diverse text classification tasks and exhibits greater robustness to hyperparameter variations compared to standard steering approaches. Our code is available at https://github.com/ihcho2/SAE-ICL.
pdf
bib
abs
CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing
Boyu Zhang
|
Ping He
|
Tianyu Du
|
Xuhong Zhang
|
Lei Yun
|
Kingsum Chow
|
Jianwei Yin
With the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.
pdf
bib
abs
The Good, the Bad and the Constructive: Automatically Measuring Peer Review’s Utility for Authors
Abdelrahman Sadallah
|
Tim Baumgärtner
|
Iryna Gurevych
|
Ted Briscoe
Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.
pdf
bib
abs
Evolving Chinese Spelling Correction with Corrector-Verifier Collaboration
Linfeng Liu
|
Hongqiu Wu
|
Hai Zhao
Recent methods address Chinese Spelling Correction (CSC) with either BERT-based models or large language models (LLMs) independently. However, both of them face challenges. BERT-based models are efficient for this task but struggle with limited generalizability to error patterns, thus failing in open-domain CSC. LLMs are advantageous in their extensive knowledge but fall into low efficiency in character-level editing. To address this dilemma, we propose Automatic Corrector Iteration (ACI), a novel model collaboration pipeline to iteratively optimize a BERT-based corrector. This pipeline is free of human annotation, by leveraging an LLM verifier to provide useful signals for the corrector. Experimental results demonstrate that our pipeline consistently improves the model performance across iterations and significantly outperforms existing data augmentation methods, achieving comparable performance with human annotation.
pdf
bib
abs
M2Edit: Locate and Edit Multi-Granularity Knowledge in Multimodal Large Language Model
Yang Zhou
|
Pengfei Cao
|
Yubo Chen
|
Qingbin Liu
|
Dianbo Sui
|
Xi Chen
|
Kang Liu
|
Jun Zhao
Multimodal knowledge editing is an important method for modifying outdated or incorrect knowledge in Multimodal Large Language Models (MLLMs). However, existing datasets for multimodal knowledge editing lack multi-granularity knowledge. In this paper, we present a more realistic dataset called M2Edit, which includes three distinct types of knowledge: entity, relation, and action. Additionally, existing knowledge editing methods for MLLMs lack the ability to handle multi-granularity knowledge and generalize to multimodal data. To address these limitations, we propose the multimodal knowledge editing method MLE. This approach identifies key knowledge layers within different components and collaboratively edits the various components of MLLMs. As a result, we observe significant improvements in visual generality performance, ranging from 4.8 to 10.8, and achieve the best overall performance on knowledge data of different granularities.
pdf
bib
abs
Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions
Haochen Shi
|
Shaobo Li
|
Guoqing Chao
|
Xiaoliang Shi
|
Wentao Chen
|
Zhenzhou Ji
Large Language Models (LLMs) require robust evaluation. However, existing frameworks often rely on curated datasets that, once public, may be accessed by newer LLMs. This creates a risk of data leakage, where test sets inadvertently become part of training data, compromising evaluation fairness and integrity. To mitigate this issue, we propose Behave as Claimed (BaC), a novel evaluation framework inspired by counterfactual reasoning. BaC constructs a “what-if” scenario where LLMs respond to counterfactual questions about how they would behave if the input were manipulated. We refer to these responses as claims, which are verifiable by observing the LLMs’ actual behavior when given the manipulated input. BaC dynamically generates and verifies counterfactual questions using various few-shot in-context learning evaluation datasets, reducing their susceptibility to data leakage. Moreover, BaC provides a more challenging evaluation paradigm for LLMs. LLMs must thoroughly understand the prompt, the task, and the consequences of their responses to achieve better performance. We evaluate several state-of-the-art LLMs and find that, while most perform well on the original datasets, they struggle with BaC. This suggests that LLMs usually fail to align their claims with their actual behavior and that high performance on standard datasets may be less stable than previously assumed.
pdf
bib
abs
Multilingual vs Crosslingual Retrieval of Fact-Checked Claims: A Tale of Two Approaches
Alan Ramponi
|
Marco Rovera
|
Robert Moro
|
Sara Tonelli
Retrieval of previously fact-checked claims is a well-established task, whose automation can assist professional fact-checkers in the initial steps of information verification. Previous works have mostly tackled the task monolingually, i.e., having both the input and the retrieved claims in the same language. However, especially for languages with a limited availability of fact-checks and in case of global narratives, such as pandemics, wars, or international politics, it is crucial to be able to retrieve claims across languages. In this work, we examine strategies to improve the multilingual and crosslingual performance, namely selection of negative examples (in the supervised) and re-ranking (in the unsupervised setting). We evaluate all approaches on a dataset containing posts and claims in 47 languages (283 language combinations). We observe that the best results are obtained by using LLM-based re-ranking, followed by fine-tuning with negative examples sampled using a sentence similarity-based strategy. Most importantly, we show that crosslinguality is a setup with its own unique characteristics compared to the multilingual setup.
pdf
bib
abs
How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination
Saad Obaid Ul Islam
|
Anne Lauscher
|
Goran Glavaš
In the age of misinformation, hallucination—the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses—represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages’ digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.
pdf
bib
abs
LiTransProQA: An LLM-based Literary Translation Evaluation Metric with Professional Question Answering
Ran Zhang
|
Wei Zhao
|
Lieve Macken
|
Steffen Eger
The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics for literature prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LITRANSPROQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LITRANSPROQA integrates humans in the loop to incorporate insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LITRANSPROQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LITRANSPROQA reaches an adequacy performance comparable to trained linguistic student evaluators, though it still falls behind experienced professional translators. LITRANSPROQA shows broad applicability to open-source models like LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations.
pdf
bib
abs
Improving Handshape Representations for Sign Language Processing: A Graph Neural Network Approach
Alessa Carbo
|
Eric Nalisnick
Handshapes serve a fundamental phonological role in signed languages, with American Sign Language employing approximately 50 distinct shapes. However, computational approaches rarely model handshapes explicitly, which limits both recognition accuracy and linguistic analysis. We introduce a novel graph neural network that separates temporal dynamics from static handshape configurations. Our approach combines anatomically informed graph structures with contrastive learning to address key challenges in handshape recognition, including subtle inter-class distinctions and temporal variations. We establish the first benchmark for structured handshape recognition in signing sequences, achieving 46% accuracy across 37 handshape classes, compared to 25% for baseline methods.
pdf
bib
abs
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Oscar Sainz
|
Naiara Perez
|
Julen Etxaniz
|
Joseba Fernandez de Landa
|
Itziar Aldabe
|
Iker García-Ferrero
|
Aimar Zabala
|
Ekhi Azurmendi
|
German Rigau
|
Eneko Agirre
|
Mikel Artetxe
|
Aitor Soroa
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
pdf
bib
abs
SOCIAL SCAFFOLDS: A Generalization Framework for Social Understanding Tasks
Ritam Dutt
|
Carolyn Rose
|
Maarten Sap
Effective human communication in social settings is contingent on recognizing subtle cues, such as intents or implications. Without such cues, NLP models risk missing social signals, instead relying on surface patterns. We introduce SOCIAL SCAFFOLDS, an automated framework for facilitating generalization across social reasoning tasks by generating rationales that make these social cues explicit. Grounded in narrative modeling principles, we generate task-agnostic rationales that capture different perspectives, i.e., that of the speaker, the listener, and the general world-view. Our experimental suite showcases that providing rationales as augmentations aids task performance for both supervised fine-tuning and in-context learning paradigms. Notably, providing all three rationale types significantly improves cross-task performance in 44% of cases, and inferred speaker intent in 31.3% of cases. We conduct statistical and ablation analyses that show how rationales complement the input text and are used effectively by models.
pdf
bib
abs
Beyond A Single AI Cluster: A Survey of Decentralized LLM Training
Haotian Dong
|
Jingyan Jiang
|
Rongwei Lu
|
Jiajun Luo
|
Jiajun Song
|
Bowen Li
|
Ying Shen
|
Zhi Wang
The emergence of large language models (LLMs) has revolutionized AI development, yet their resource demands beyond a single cluster or even datacenter, limiting accessibility to well-resourced organizations. Decentralized training has emerged as a promising paradigm to leverage dispersed resources across clusters, datacenters and even regions, offering the potential to democratize LLM development for broader communities. As the first comprehensive exploration of this emerging field, we present decentralized LLM training as a resource-driven paradigm and categorize existing efforts into community-driven and organizational approaches. We further clarify this through: (1) a comparison with related paradigms, (2) characterization of decentralized resources, and (3) a taxonomy of recent advancements. We also provide up-to-date case studies and outline future directions to advance research in decentralized LLM training.
pdf
bib
abs
Can LLM Agents Maintain a Persona in Discourse?
Pranav Bhandari
|
Nicolas Fay
|
Michael J Wise
|
Amitava Datta
|
Stephanie Meek
|
Usman Naseem
|
Mehwish Nasim
Large Language Models (LLMs) are widely used as conversational agents exploiting their capabilities in various sectors such as education, law, medicine, and more. However, LLMs are often subjected to context-shifting behaviour, resulting in a lack of consistent and interpretable personality-aligned interactions. Adherence to psychological traits lacks comprehensive analysis, especially in the case of dyadic (pairwise) conversations. We examine this challenge from two viewpoints, initially using two conversation agents to generate a discourse on a certain topic with an assigned personality from the OCEAN framework (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This is followed by using multiple judge agents to infer the original traits assigned to explore prediction consistency, inter-model agreement, and alignment with the assigned personality. Our findings indicate that while LLMs can be guided toward personality-driven dialogue, their ability to maintain personality traits varies significantly depending on the combination of models and discourse settings. These inconsistencies emphasise the challenges in achieving stable and interpretable personality-aligned interactions in LLMs.
pdf
bib
abs
Iterative Multilingual Spectral Attribute Erasure
Shun Shao
|
Yftah Ziser
|
Zheng Zhao
|
Yifu Qiu
|
Shay B Cohen
|
Anna Korhonen
Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiassing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies and mitigates joint bias subspaces across multiple languages through iterative SVD-based truncation. Evaluating IMSAE across eight languages and five demographic dimensions, we demonstrate its effectiveness in both standard and zero-shot settings, where target language data is unavailable, but linguistically similar languages can be used for debiasing. Our comprehensive experiments across diverse language models (BERT, LLaMA, Mistral) show that IMSAE outperforms traditional monolingual and cross-lingual approaches while maintaining model utility.
pdf
bib
abs
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
Abir Harrasse
|
Philip Quirke
|
Clement Neo
|
Dhruv Nathawani
|
Luke Marks
|
Amir Abdullah
Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.
pdf
bib
abs
SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Fares Fawzi
|
Vinitra Swamy
|
Dominik Glandorf
|
Tanya Nazaretsky
|
Tanja Käser
Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
pdf
bib
abs
Logit Space Constrained Fine-Tuning for Mitigating Hallucinations in LLM-Based Recommender Systems
Jianfeng Deng
|
Qingfeng Chen
|
Debo Cheng
|
Jiuyong Li
|
Lin Liu
Large language models (LLMs) have gained increasing attention in recommender systems, but their inherent hallucination issues significantly compromise the accuracy and reliability of recommendation results. Existing LLM-based recommender systems predominantly rely on standard fine-tuning methodologies, often ignoring hallucination issues during the fine-tuning process. To address this challenge, we propose Logit Space Constraints Fine-Tuning (LCFT), a novel fine-tuning framework designed to mitigate hallucination in LLM-based recommenders. Specifically, LCFT takes as input semantically positive and negative instruction pairs and incorporates Kullback–Leibler (KL) divergence into the training objective to explicitly maximise their distributional disparity in the logit space. By conducting such logit space-constrained fine-tuning, LCFT encourages more distinguishable and semantically grounded representations, thereby reducing the model’s susceptibility to hallucination. Extensive experiments on two recommendation models with distinct LLM backbones and four real-world datasets demonstrate that LCFT consistently reduces hallucination and enhances recommendation performance.
pdf
bib
abs
PACHAT: Persona-Aware Speech Assistant for Multi-party Dialogue
Dongjie Fu
|
Xize Cheng
|
Linjun Li
|
Xiaoda Yang
|
Lujia Yang
|
Tao Jin
Extensive research on LLM-based spoken dialogue systems has significantly advanced the development of intelligent voice assistants. However, the integration of role information within speech remains an underexplored area, limiting its application in real-world scenarios, particularly in multi-party dialogue settings. With the growing demand for personalization, voice assistants that can recognize and remember users establish a deeper connection with them. We focus on enabling LLMs with speaker-awareness capabilities and enhancing their understanding of character settings through synthetic data to generate contextually appropriate responses. We introduce Persona-Dialogue, the first large-scale multi-party spoken dialogue dataset that incorporates speaker profiles. Based on this dataset, we propose PAChat, an architecture that simultaneously models both linguistic content and speaker features, allowing LLMs to map character settings to speaker identities in speech. Through extensive experiments, we demonstrate that PAChat successfully achieves speaker-specific responses, character understanding, and the generation of targeted replies in multi-party dialogue scenarios, surpassing existing spoken dialogue systems.
pdf
bib
abs
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu
|
Lingyong Yan
|
Shuaiqiang Wang
|
Dawei Yin
|
Lei Sha
Large Reasoning Models (LRMs) have recently demonstrated impressive performances across diverse domains. However, how the safety of Large Language Models (LLMs) benefits from enhanced reasoning capabilities against jailbreak queries remains unexplored. To bridge this gap, in this paper, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into LLMs’ generation process. This enables self-evaluation at each step of the reasoning process, forming safety pivot tokens as indicators of the safety status of responses. Furthermore, in order to improve the accuracy of predicting pivot tokens, we propose Contrastive Pivot Optimization (CPO), which enhances the model’s perception of the safety status of given dialogues. LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their safety capabilities defending jailbreak attacks. Extensive experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances. This highlights the substantial potential of safety-aware reasoning in improving robustness of LRMs and LLMs against various jailbreaks.
pdf
bib
abs
Graph-Guided Textual Explanation Generation Framework
Shuzhou Yuan
|
Jingyi Sun
|
Ran Zhang
|
Michael Färber
|
Steffen Eger
|
Pepa Atanasova
|
Isabelle Augenstein
Natural language explanations (NLEs) are commonly used to provide plausible free-text explanations of a model’s reasoning about its predictions. However, recent work has questioned their faithfulness, as they may not accurately reflect the model’s internal reasoning process regarding its predicted answer. In contrast, highlight explanations–input fragments critical for the model’s predicted answers–exhibit measurable faithfulness. Building on this foundation, we propose G-TEx, a Graph-Guided Textual Explanation Generation framework designed to enhance the faithfulness of NLEs. Specifically, highlight explanations are first extracted as faithful cues reflecting the model’s reasoning logic toward answer prediction. They are subsequently encoded through a graph neural network layer to guide the NLE generation, which aligns the generated explanations with the model’s underlying reasoning toward the predicted answer. Experiments on both encoder-decoder and decoder-only models across three reasoning datasets demonstrate that G-TEx improves NLE faithfulness by up to 12.18% compared to baseline methods. Additionally, G-TEx generates NLEs with greater semantic and lexical similarity to human-written ones. Human evaluations show that G-TEx can decrease redundant content and enhance the overall quality of NLEs. Our work presents a novel method for explicitly guiding NLE generation to enhance faithfulness, serving as a foundation for addressing broader criteria in NLE and generated text.
pdf
bib
abs
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
Leonardo Bertolazzi
|
Philipp Mondorf
|
Barbara Plank
|
Raffaella Bernardi
The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models’ internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads\textemdash{}attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models’ internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why smaller-sized LLMs struggle to detect even simple arithmetic errors.
pdf
bib
abs
A Causal Lens for Evaluating Faithfulness Metrics
Kerem Zaman
|
Shashank Srivastava
Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model’s truereasoning faithfully. While several faithfulness metrics have been proposed, they are often evaluated in isolation, making principled comparisons between them difficult. We present Causal Diagnosticity, a testbed framework for evaluating faithfulness metrics for natural language explanations. We use the concept of diagnosticity, and employ model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate prominent faithfulness metrics, including post-hoc explanation and chain-of-thought methods. Diagnostic performance varies across tasks and models, with Filler Tokens performing best overall. Additionally, continuous metrics are generally more diagnostic than binary ones but can be sensitive to noise and model choice. Our results highlight the need for more robust faithfulness metrics.
pdf
bib
abs
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Yifei Yu
|
Qian-Wen Zhang
|
Lingfeng Qiao
|
Di Yin
|
Fang Li
|
Jie Wang
|
Chen Zeng Xi
|
Suncong Zheng
|
Xiaolong Liang
|
Xing Sun
Evaluating the ability of large language models (LLMs) to process lengthy contexts is critical, especially for retrieving query-relevant information embedded within them. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as needles) from long contexts. The benchmark includes three needle generation pipelines: synthetic-temporal, real-temporal, and real-logical orders, with context lengths ranging from 8K to 128K, which comprises 14,000 samples (2,000 for testing). To facilitate the evaluation of this benchmark, we trained an evaluation model that assesses the correctness of LLM responses by comparing their completeness and sequential consistency against the ground truth, which provides a more reliable evaluation metric than GPT-4 or Claude. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.50% on test set of this benchmark. Further analysis highlights the growing challenges posed by increasing the context length or the number of needles, underscoring substantial room for improvement of LLMs. Additionally, noise analysis validates the reliability and challenge of the benchmark, making Sequential-NIAH an important reference for advancing research on long text information extraction capabilities of LLMs.
pdf
bib
abs
FISTAPruner: Layer-wise Post-training Pruning for Large Language Models
Pengxiang Zhao
|
Hanyu Hu
|
Ping Li
|
Yi Zheng
|
Zhefeng Wang
|
Xiaoming Yuan
Pruning is a critical strategy for compressing trained large language models (LLMs), aiming at substantial memory conservation and computational acceleration without compromising performance. However, existing pruning methods typically necessitate inefficient retraining for billion-scale LLMs or rely on heuristically designed metrics to determine pruning masks, leading to performance degradation. This paper presents, for the first time, a LASSO-like convex optimization model crafted to induce sparsity in LLMs. By leveraging FISTA, we introduce FISTAPruner, a novel method that includes a cumulative error elimination mechanism within decoder layers and supports parallel pruning for unstructured pruning. Additionally, we extend this method to 2:4 semi-structured pruning. We comprehensively evaluate FISTAPruner on models such as OPT, LLaMA, and Qwen variants with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity, showcasing superior performance over existing methods across various language benchmarks. Notably, it can remove 50% of the model parameters for LLaMA-3-70B while retaining 98.6% and 95.6% of the zero-shot task performance under these two sparsity patterns, respectively.
pdf
bib
abs
Do LLMs Encode Frame Semantics? Evidence from Frame Identification
Jayanth Krishna Chundru
|
Rudrashis Poddar
|
Jie Cao
|
Tianyu Jiang
We investigate whether large language models encode latent knowledge of frame semantics, focusing on frame identification, a core challenge in frame semantic parsing that involves selecting the appropriate semantic frame for a target word in context. Using the FrameNet lexical resource, we evaluate models under prompt-based inference and observe that they can perform frame identification effectively even without explicit supervision. To assess the impact of task-specific training, we fine-tune the model on FrameNet data, which substantially improves in-domain accuracy while generalizing well to out-of-domain benchmarks. Further analysis shows that the models can generate semantically coherent frame definitions, highlighting the model’s internalized understanding of frame semantics.
pdf
bib
abs
StepER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models
Kyumin Lee
|
Minjin Jeon
|
Sanghwan Jang
|
Hwanjo Yu
Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is highly adaptable across various frameworks of multi-step retrieval-augmented language models, including those based on reasoning paths or question decomposition. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.
pdf
bib
abs
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
Yushi Yang
|
Filip Sondej
|
Harry Mayne
|
Andrew Lee
|
Adam Mahdi
Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations—attributing its effects solely to dampened toxic neurons in the MLP layers—are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO’s effects across models. Instead, DPO induces distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups—two aligned with reducing toxicity and two promoting anti-toxicity—whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method that mimics DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.
pdf
bib
abs
It’s All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
Yue Li
|
Zhixue Zhao
|
Carolina Scarton
Extremely low-resource languages, especially those written in rare scripts, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.
pdf
bib
abs
Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
Kwesi Adu Cobbina
|
Tianyi Zhou
In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, system prompt, and user message in LLM input are varied. This bias, we refer to as DEMOS’ POSITION IN PROMPT bias (DPP bias). We design a systematic evaluation pipeline to study this type of positional bias across classification, QA, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by demos’ position change. Extensive experiments on tenLLMs from four open-source model families(QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30% of predictions without improving correctness in QA tasks. Smaller models are most affected by this sensitivity, though even large models do remain marginally affected on complex tasks.
pdf
bib
abs
Multilingual Pretraining for Pixel Language Models
Ilker Kesen
|
Jonas F. Lotz
|
Ingo Ziegler
|
Phillip Rust
|
Desmond Elliott
Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
pdf
bib
abs
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
Gabrielle Kaili-May Liu
|
Gal Yona
|
Avi Caciularu
|
Idan Szpektor
|
Tim G. J. Rudner
|
Arman Cohan
A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of _faithful confidence calibration_ of LLMs, benchmarking models’ ability to use linguistic expressions of uncertainty that _faithfully reflect_ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.
pdf
bib
abs
Machine-generated text detection prevents language model collapse
George Drayson
|
Emine Yilmaz
|
Vasileios Lampos
As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource for LLM pre-training, subsequent models could be trained on an unknown portion of synthetic samples. This could lead to model collapse, a degenerative process whereby LLMs reinforce their own errors, reduce output diversity, and ultimately yield declining performance. In this study, we investigate the impact of decoding strategy on model collapse, analysing the text characteristics at each model generation, the similarity to human references, and the resulting model performance. Using the decoding strategies that lead to the most significant degradation, we evaluate model collapse in a more realistic scenario where the origin of the data (human or synthetic) is unknown. We train a machine-generated text detector and propose an importance resampling approach to prevent model collapse by up-sampling likely human content in the training data. Our method is validated on four LLMs from two model families (GPT-2 and SmolLM2), across a range of model sizes (124M to 1.7B). We demonstrate that it not only prevents model collapse but also improves performance compared to training on purely human data, underscoring the benefit of synthetic samples and the importance of data curation.
pdf
bib
abs
Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data
Faeze Ghorbanpour
|
Daryna Dementieva
|
Alexander Fraser
Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.
pdf
bib
abs
V-VAE: A Variational Auto Encoding Framework Towards Fine-Grained Control over Human-Like Chat
Qi Lin
|
Weikai Xu
|
Lisi Chen
|
Bin Dai
With the continued proliferation of Large Language Model (LLM) based chatbots, there is a growing demand for generating responses that are not only linguistically fluent but also consistently aligned with persona-specific traits in conversations. However, existing role-play and persona-based chat approaches rely heavily on static role descriptions, coarse-grained signal space, and low-quality synthetic data, which fail to capture dynamic fine-grained details in human-like chat. Human-like chat requires modeling subtle latent traits, such as emotional tone, situational awareness, and evolving personality, which are difficult to predefine and cannot be easily learned from synthetic or distillation-based data. To address these limitations, we propose a Verbal Variational Auto-Encoding (V-VAE) framework, containing a variational auto-encoding module and fine-grained control space which dynamically adapts dialogue behaviour based on fine-grained, interpretable latent variables across talking style, interaction patterns, and personal attributes. We also construct a high-quality dataset, HumanChatData, and benchmark HumanChatBench to address the scarcity of high-quality data in the human-like domain. Experiments show that LLMs based on V-VAE consistently outperform standard baselines on HumanChatBench and DialogBench, which further demonstrates the effectiveness of V-VAE and HumanChatData.
pdf
bib
abs
Mixture of Languages: Improved Multilingual Encoders Through Language Grouping
João Maria Janeiro
|
Belen Alastruey
|
Francisco Massa
|
Maha Elbayad
|
Benjamin Piwowarski
|
Patrick Gallinari
|
Loic Barrault
We propose Mixture of Languages (MoL), a new strategy to pretrain largely multilingual encoders. Recent work in this field has relied on training transformer encoders on a large amount of multilingual data, with all parameters shared across all languages, without studying how to optimally balance language transfer and interference to achieve better performance. To address this, MoL proposes to group languages based on their similarity, and add parallel, sparsely activated layers that process each group independently. This architecture allows MoL to boost language transfer while minimizing interference, without increasing the active parameter count. We show that MoL largely outperforms a dense counterpart trained with the same configuration, as well as MoE models and public multilingual encoders such as XLM-R or mBERT on downstream tasks.
pdf
bib
abs
Too Helpful, Too Harmless, Too Honest or Just Right?
Gautam Siddharth Kashyap
|
Mark Dras
|
Usman Naseem
Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks—Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)—demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX’s generalization across diverse LLM backbones. Ourcode is available at: https://github.com/gskgautam/TrinityX
pdf
bib
abs
Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Danrui Li
|
Sen Zhang
|
Samuel S. Sohn
|
Kaidong Hu
|
Muhammad Usman
|
Mubbasir Kapadia
The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game variations, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated action-value functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers.
pdf
bib
abs
Assessing effective de-escalation of crisis conversations using transformer-based models and trend statistics
Ignacio J. Tripodi
|
Greg Buda
|
Margaret Meagher
|
Elizabeth A. Olson
One of the core goals of crisis counseling services is to support emotional de-escalation of the individual in crisis, by reducing intense negative emotional affect and emotional dysregulation. The science of crisis intervention has been impeded, however, by a lack of quantitative approaches that allow for detailed analysis of emotion in crisis conversations. In order to measure de-escalation at scale (millions of text-based conversations), lightweight models are needed that can assign not just binary sentiment predictions but quantitative scores to capture graded change in emotional valence. Accordingly, we developed a transformer-based emotional valence scoring model fit for crisis conversations, BERT-EV, that assigns numerical emotional valence scores to rate the intensity of expressed negative versus positive emotion. This transformer-based model can run on modest hardware configurations, allowing it to scale affordably and efficiently to a massive corpus of crisis conversations. We evaluated model performance on a corpus of hand-scored social media messages, and found that BERT-EV outperforms existing dictionary-based standard tools in the field, as well as other transformer-based implementations and an LLM in accurately matching scores from human annotators. Finally, we show that trends in these emotional valence scores can be used to assess emotional de-escalation during crisis conversations, with sufficient turn-by-turn granularity to help identify helpful vs. detrimental crisis counselor statements.
pdf
bib
abs
Measuring and Mitigating Media Outlet Name Bias in Large Language Models
Seong-Jin Park
|
Kang-Min Kim
Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, but concerns persist regarding their potential political biases. While prior research has extensively explored political biases in LLMs’ text generation and perception, limited attention has been devoted to biases associated with media outlet names. In this study, we systematically investigate the presence of media outlet name biases in LLMs and evaluate their impact on downstream tasks, such as political bias prediction and news summarization. Our findings demonstrate that LLMs consistently exhibit biases toward the known political leanings of media outlets, with variations across model families and scales. We propose a novel metric to quantify media outlet name biases in LLMs and leverage this metric to develop an automated prompt optimization framework. Our framework effectively mitigates media outlet name biases, offering a scalable approach to enhancing the fairness of LLMs in news-related applications.
pdf
bib
abs
The Good, the Bad, and the Debatable: A Survey on the Impacts of Data for In-Context Learning
Stephanie Schoch
|
Yangfeng Ji
In-context learning is an emergent learning paradigm that enables an LLM to learn an unseen task by seeing a number of demonstrations in the context window. The quality of the demonstrations is of paramount importance as 1) context window size limitations restrict the number of demonstrations that can be presented to the model, and 2) the model must identify the task and potentially learn new, unseen input-output mappings from the limited demonstration set. An increasing body of work has also shown the sensitivity of predictions to perturbations on the demonstration set. Given this importance, this work presents a survey on the current literature pertaining to the relationship between data and in-context learning. We present our survey in three parts: the “good” – qualities that are desirable when selecting demonstrations, the “bad” – qualities of demonstrations that can negatively impact the model, as well as issues that can arise in presenting demonstrations, and the “debatable” – qualities of demonstrations with mixed results or factors modulating data impacts.
pdf
bib
abs
Where Confabulation Lives: Latent Feature Discovery in LLMs
Thibaud Ardoin
|
Yi Cai
|
Gerhard Wunder
Hallucination remains a critical failure mode of large language models (LLMs), undermining their trustworthiness in real-world applications. In this work, we focus on confabulation, a foundational aspect of hallucination where the model fabricates facts about unknown entities. We introduce a targeted dataset designed to isolate and analyze this behavior across diverse prompt types. Using this dataset, and building on recent progress in interpreting LLM internals, we extract latent directions associated with confabulation using sparse projections. A simple vector-based steering method demonstrates that these directions can modulate model behavior with minimal disruption, shedding light on the inner representations that drive factual and non-factual output. Our findings contribute to a deeper mechanistic understanding of LLMs and pave the way toward more trustworthy and controllable generation. We release the code and dataset at https://github.com/Thibaud-Ardoin/where-confabulation-lives.
pdf
bib
abs
Analysing Chain of Thought Dynamics: Active Guidance or Unfaithful Post-hoc Rationalisation?
Samuel Lewis-Lim
|
Xingwei Tan
|
Zhixue Zhao
|
Nikolaos Aletras
Recent work has demonstrated that using chain of thought (CoT), on soft-reasoning problems such as analytical and commonsense reasoning, often yields limited or even negative performance gains. CoT can also be unfaithful to the model’s actual reasoning. This paper investigates dynamics and unfaithfulness of CoT in soft-reasoning tasks across instruction-tuned, reasoning and reasoning-distilled models. Our findings show that distilled‐reasoning models rely heavily on CoT for these tasks, while instruction‐tuned and reasoning models often use it post‐hoc. Additionally, we find that CoT can steer model predictions without faithfully reflecting reasoning, indicating a disconnect between CoT influence and faithfulness.
pdf
bib
abs
Playpen: An Environment for Exploring Learning From Dialogue Game Feedback
Nicola Horst
|
Davide Mazzaccara
|
Antonia Schmidt
|
Michael Sullivan
|
Filippo Momentè
|
Luca Franceschetti
|
Philipp Sadler
|
Sherzod Hakimov
|
Alberto Testoni
|
Raffaella Bernardi
|
Raquel Fernández
|
Alexander Koller
|
Oliver Lemon
|
David Schlangen
|
Mario Giulianelli
|
Alessandro Suglia
Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games—goal-directed and rule-governed activities driven predominantly by verbal actions—can also serve as a source of feedback signals for learning.We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with Group Relative Policy Optimization (GRPO). We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in this promising new direction of “learning in (synthetic) interaction”.
pdf
bib
abs
GenLink: Generation-Driven Schema-Linking via Multi-Model Learning for Text-to-SQL
Zhifeng Hao
|
Junqi Huang
|
Shaobin Shi
|
Ruichu Cai
|
Boyan Xu
Schema linking is widely recognized as a key factor in improving text-to-SQL performance. Supervised fine-tuning approaches enhance SQL generation quality by explicitly fine-tuning schema linking as an extraction task. However, they suffer from two major limitations: (i) The training corpus of small language models restricts their cross-domain generalization ability. (ii) The extraction-based fine-tuning process struggles to capture complex linking patterns. To address these issues, we propose GenLink, a generation-driven schema-linking framework based on multi-model learning. Instead of explicitly extracting schema elements, GenLink enhances linking through a generation-based learning process, effectively capturing implicit schema relationships. By integrating multiple small language models, GenLink improves schema-linking recall rate and ensures robust cross-domain adaptability. Experimental results on the BIRD and Spider benchmarks validate the effectiveness of GenLink, achieving execution accuracies of 67.34% (BIRD), 89.7% (Spider development set), and 87.8% (Spider test set), demonstrating its superiority in handling diverse and complex database schemas.
pdf
bib
abs
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
Marek Strong
|
Andreas Vlachos
Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 287 real-world claims sourced from 38 fact-checking organizations and a curated database of 400 time series covering diverse domains.Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of 𝜅 = 0.745 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.37 accuracy score on verdicts and an Ev2R score of 48.63 on verdict justifications.
pdf
bib
abs
Cross-MoE: An Efficient Temporal Prediction Framework Integrating Textual Modality
Ruizheng Huang
|
Zhicheng Zhang
|
Yong Wang
It has been demonstrated that incorporating external information as textual modality can effectively improve time series forecasting accuracy. However, current multi-modal models ignore the dynamic and different relations between time series patterns and textual features, which leads to poor performance in temporal-textual feature fusion. In this paper, we propose a lightweight and model-agnostic temporal-textual fusion framework named Cross-MoE. It replaces Cross Attention with Cross-Ranker to reduce computational complexity, and enhances modality-aware correlation memorization with Mixture-of-Experts (MoE) networks to tolerate the distributional shifts in time series. The experimental results demonstrate a 8.78% average reduction in Mean Squared Error (MSE) compared to the SOTA multi-modal time series framework. Notably, our method requires only 75% of computational overhead and 12.5% of activated parameters compared with Cross Attention mechanism. Our codes are available at
https://github.com/Kilosigh/Cross-MoE.gitpdf
bib
abs
Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant
|
Shan Chen
|
Kuleen Sasse
|
Hugo Aerts
|
Thomas Hartvigsen
|
Danielle Bitterman
Sparse Autoencoders (SAEs) provide potential for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications.
pdf
bib
abs
KGE Calibrator: An Efficient Probability Calibration Method of Knowledge Graph Embedding Models for Trustworthy Link Prediction
Yang Yang
|
Mohan Timilsina
|
Edward Curry
Knowledge graph embedding (KGE) models are designed for the task of link prediction, which aims to infer missing triples by learning representations for entities and relations. While KGE models excel at ranking-based link prediction, the critical issue of probability calibration has been largely overlooked, resulting in uncalibrated estimates that limit their adoption in high-stakes domains where trustworthy predictions are essential. Addressing this is challenging, as we demonstrate that existing calibration methods are ill-suited to KGEs, often significantly degrading the essential ranking performance they are meant to support. To overcome this, we introduce the KGE Calibrator (KGEC), the first probability calibration method tailored for KGE models to enhance the trustworthiness of their predictions. KGEC integrates three key techniques: a Jump Selection Strategy that improves efficiency by selecting the most informative instances while filtering out less significant ones; Multi-Binning Scaling, which models different confidence levels separately to increase capacity and flexibility; and a Wasserstein distance-based calibration loss that further boosts calibration performance. Extensive experiments across multiple datasets demonstrate that KGEC consistently outperforms existing calibration methods in terms of both effectiveness and efficiency, making it a promising solution for calibration in KGE models.
pdf
bib
abs
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
Takumi Shibata
|
Yuichi Miyamura
Recent advances in large language models (LLMs) have enabled zero-shot automated essay scoring (AES), providing a promising way to reduce the cost and effort of essay scoring in comparison with manual grading. However, most existing zero-shot approaches rely on LLMs to directly generate absolute scores, which often diverge from human evaluations owing to model biases and inconsistent scoring. To address these limitations, we propose LLM-based Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task. Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores. Considering that the number of possible comparisons grows quadratically with the number of essays, we improve scalability by employing RankNet to efficiently transform LLM preferences into scalar scores. Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency. Moreover, LCES is robust across different LLM backbones, highlighting its applicability to real-world zero-shot AES.
pdf
bib
abs
The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness
Sanad Sha’ban
|
Nizar Habash
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness. Code is publicly available at https://github.com/CAMeL-Lab/arabic-generality-score.
pdf
bib
abs
Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
Mostafa Saeed
|
Nizar Habash
Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character-level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.
pdf
bib
abs
A Comprehensive Framework to Operationalize Social Stereotypes for Responsible AI Evaluations
Aida Mostafazadeh Davani
|
Sunipa Dev
|
Héctor Pérez-Urbina
|
Vinodkumar Prabhakaran
Societal stereotypes are at the center of a myriad of responsible AI interventions targeted at reducing the generation and propagation of potentially harmful outcomes. While these efforts are much needed, they tend to be fragmented and often address different parts of the issue without adopting a unified or holistic approach to social stereotypes and how they impact various parts of the machine learning pipeline. As a result, current interventions fail to capitalize on the underlying mechanisms that are common across different types of stereotypes, and to anchor on particular aspects that are relevant in certain cases. In this paper, we draw on social psychological research and build on NLP data and methods, to propose a unified framework to operationalize stereotypes in generative AI evaluations. Our framework identifies key components of stereotypes that are crucial in AI evaluation, including the target group, associated attribute, relationship characteristics, perceiving group, and context. We also provide considerations and recommendations for its responsible use.
pdf
bib
abs
Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
Amber Shore
|
Russell Scheinberg
|
Ameeta Agrawal
|
So Young Lee
Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
pdf
bib
abs
GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
Melissa Kazemi Rad
|
Alberto Purpura
|
Himanshu Kumar
|
Emily Chen
|
Mohammad Shahed Sorower
We address the problem of data scarcity in harmful text classification for guardrailing applications and introduce GRAID (Geometric and Reflective AI-Driven Data Augmentation), a novel pipeline that leverages Large Language Models (LLMs) for dataset augmentation. GRAID consists of two stages: (i) generation of geometrically controlled examples using a constrained LLM, and (ii) augmentation through a multi-agentic reflective process that promotes stylistic diversity and uncovers edge cases. This combination enables both reliable coverage of the input space and nuanced exploration of harmful content. Using two benchmark data sets, we demonstrate that augmenting a harmful text classification dataset with GRAID leads to significant improvements in downstream guardrail model performance.
pdf
bib
abs
LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents
Taro Yano
|
Yoichi Ishibashi
|
Masafumi Oyamada
Large Language Models (LLMs) excel across diverse tasks, with post-training methods like Supervised Fine-Tuning (SFT), Preference Learning, and Model Merging enabling effective domain and task adaptation. While outcomes can vary with data orderings or component combinations, yet manual pipeline optimization is costly and labor-intensive. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging parameters. We propose LaMDAgent, an LLM Agent-driven framework that autonomously constructs and optimizes end-to-end post-training pipelines by exploring various model improving methods, objects, and their applied orderings based on task-based feedback. LaMDAgent achieves a 9.0-point gain in tool-use accuracy without degrading instruction-following, and identifies high-performing strategies overlooked by manual design.We further analyze the impact of data and model scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
pdf
bib
abs
Finetuning LLMs for Human Behavior Prediction in Social Science Experiments
Akaash Kolluri
|
Shengguang Wu
|
Joon Sung Park
|
Michael S. Bernstein
Large language models (LLMs) offer a powerful opportunity to simulate the results of social science experiments. In this work, we demonstrate that finetuning LLMs directly on individual-level responses from past experiments meaningfully improves the accuracy of such simulations. We construct SocSci210 via an automatic pipeline, a dataset comprising 2.9 million responses from 400,491 participants in 210 open-source social science experiments. Through finetuning, we achieve multiple levels of generalization. In completely unseen studies, our strongest model, Socrates-Qwen-14B, produces predictions that are 36% more aligned with distributions of human responses to diverse outcome questions under varying conditions relative to its base model (Qwen2.5-14B), outperforming GPT-4o by 15%. By finetuning on a subset of conditions in a study, generalization to new unseen conditions is particularly robust, improving by 71%. Since SocSci210 contains rich demographic information, we reduce demographic parity difference, a measure of bias, by 10.6% through finetuning. Because social sciences routinely generate rich, topic-specific datasets, our findings indicate that finetuning on such data could enable more accurate simulations for experimental hypothesis screening. We release our data, models and finetuning code.
pdf
bib
abs
How Private are Language Models in Abstractive Summarization?
Anthony Hughes
|
Nikolaos Aletras
|
Ning Ma
In sensitive domains such as medical and legal, protecting sensitive information is critical, with protective laws strictly prohibiting the disclosure of personal data. This poses challenges for sharing valuable data such as medical reports and legal cases summaries. While language models (LMs) have shown strong performance in text summarization, it is still an open question to what extent they can provide privacy-preserving summaries from non-private source documents. In this paper, we perform a comprehensive study of privacy risks in LM-based summarization across two closed- and four open-weight models of different sizes and families. We experiment with both prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets including medical and legal domains. Our quantitative and qualitative analysis, including human evaluation, shows that LMs frequently leak personally identifiable information in their summaries, in contrast to human-generated privacy-preserving summaries, which demonstrate significantly higher privacy protection levels. These findings highlight a substantial gap between current LM capabilities and expert human expert performance in privacy-sensitive summarization tasks.
pdf
bib
abs
Expectation Preference Optimization: Reliable Preference Estimation for Improving the Reasoning Capability of Large Language Models
Zelin Li
|
Dawei Song
Pairwise preference optimization, such as Direct Preference Optimization (DPO), was originally designed to align large language models (LLMs) with human values. It has recently been used to improve the supervised fine-tuning (SFT) performance of LLMs. Using pairs of single samples, DPO estimates the probability distribution of the preferences of picking one response over another. However, in tasks that involve more complicated preferences (e.g., reasoning tasks) than those in the human value alignment task, this sampling method is likely to bring deviations from the ground-truth distribution. To solve the problem, extra efforts (e.g., external annotations or amendment of the loss function) are often required. In this paper, we hypothesise that the preferences can be better estimated through a multi-sampling process. Accordingly, we propose an Expectation Preference Optimization (EPO) algorithm that takes pairs of sample groups, instead of pairs of single samples as in DPO, for preference learning. Compared to pairwise DPO, the proposed EPO tends to produce more reliable preference estimations. Applying different preference optimization methods in a self-training paradigm, we have conducted extensive experiments on various reasoning benchmarks. The results show that our EPO approach outperforms a range of baseline approaches in terms of zero-shot accuracy on all benchmarks.
pdf
bib
abs
Split-Merge: Scalable and Memory-Efficient Merging of Expert LLMs
Sruthi Gorantla
|
Aditya Rawal
|
Devamanyu Hazarika
|
Kaixiang Lin
|
Mingyi Hong
|
Mahdi Namazifar
We introduce a zero-shot merging framework for large language models (LLMs) that consolidates specialized domain experts into a single model without any further training. Our core contribution lies in leveraging relative task vectors—difference representations encoding each expert’s unique traits with respect to a shared base model—to guide a principled and efficient merging process. By dissecting parameters into common dimensions (averaged across experts) and complementary dimensions (unique to each expert), we strike an optimal balance between generalization and specialization. We further devise a compression mechanism for the complementary parameters, retaining only principal components and scalar multipliers per expert, thereby minimizing overhead. A dynamic router then selects the most relevant domain at inference, ensuring that domain-specific precision is preserved. Experiments on code generation, mathematical reasoning, medical question answering, and instruction-following benchmarks confirm the versatility and effectiveness of our approach. Altogether, this framework enables truly adaptive and scalable LLMs that seamlessly integrate specialized knowledge for improved zero-shot performance.
pdf
bib
abs
Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores
Ashwin Ramaswamy
|
Nestor Demeure
|
Ermal Rrapaj
New large language models (LLMs) are being released every day. Some perform significantly better or worse than expected given their parameter count. Therefore, there is a need for a method to independently evaluate models. The current best way to evaluate a model is to measure its Elo score by comparing it to other models in a series of contests—an expensive operation since humans are ideally required to compare LLM outputs. We observe that when an LLM is asked to judge such contests, the consistency with which it selects a model as the best in a matchup produces a metric that is 91% correlated with its own human-produced Elo score. This provides a simple proxy for Elo scores that can be computed cheaply, without any human data or prior knowledge.
pdf
bib
abs
Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
Xueqing Peng
|
Triantafillos Papadopoulos
|
Efstathia Soufleri
|
Polydoros Giannouris
|
Ruoyu Xiang
|
Yan Wang
|
Lingfei Qian
|
Jimin Huang
|
Qianqian Xie
|
Sophia Ananiadou
Despite Greece’s pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. While multilingual financial NLP has revealed large performance gaps across languages, no benchmarks or LLMs have been tailored for Greek financial tasks until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the first financial LLM fine-tuned on Greek-specific financial data. Plutus-ben addresses six core tasks: numeric/textual named entity recognition, question answering, extractive summarization, abstractive summarization, and topic classification. To support these tasks, we release four new expert-annotated Greek financial datasets and incorporate two existing resources. Our comprehensive evaluation of 24 LLMs reveals persistent challenges in Greek financial NLP, driven by linguistic complexity, domain terminology, and financial reasoning gaps. Experiment results underscore the limitations of cross-lingual transfer and the need for Greek-specific financial modeling. We publicly release Plutus-ben, Plutus-8B, and all associated datasets to promote reproducible research and advance multilingual financial NLP.
pdf
bib
abs
TaxoAlign: Scholarly Taxonomy Generation Using Language Models
Avishek Lahiri
|
Yufang Hou
|
Debarshi Kumar Sanyal
Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.
pdf
bib
abs
DiNaM: Disinformation Narrative Mining with Large Language Models
Witold Sosnowski
|
Arkadiusz Modzelewski
|
Kinga Skorupska
|
Adam Wierzbicki
Disinformation poses a significant threat to democratic societies, public health, and national security. To address this challenge, fact-checking experts analyze and track disinformation narratives. However, the process of manually identifying these narratives is highly time-consuming and resource-intensive. In this article, we introduce DiNaM, the first algorithm and structured framework specifically designed for mining disinformation narratives. DiNaM uses a multi-step approach to uncover disinformation narratives. It first leverages Large Language Models (LLMs) to detect false information, then applies clustering techniques to identify underlying disinformation narratives. We evaluated DiNaM’s performance using ground-truth disinformation narratives from the EUDisinfoTest dataset. The evaluation employed the Weighted Chamfer Distance (WCD), which measures the similarity between two sets of embeddings: the ground truth and the predicted disinformation narratives. DiNaM achieved a state-of-the-art WCD score of 0.73, outperforming general-purpose narrative mining methods by a notable margin of 16.4–24.7%. We are releasing DiNaM’s codebase and the dataset to the public.
pdf
bib
abs
VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
Lesheng Jin
|
Zhenyuan Ruan
|
Haohui Mai
|
Jingbo Shang
Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85–99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.
pdf
bib
abs
MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Mohamed Bayan Kmainasi
|
Abul Hasnat
|
Md Arid Hasan
|
Ali Ezzat Shahroor
|
Firoj Alam
The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to label detection and the generation of explanation-based rationales for predicted labels. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propaganda memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a novel multi-stage optimization approach and train Vision-Language Models (VLMs). Our results demonstrate that this approach significantly improves performance over the base model for both label detection and explanation generation, outperforming the current state-of-the-art with an absolute improvement of approximately 3% on ArMeme and 7% on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available.
pdf
bib
abs
FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue across English, Chinese, and Korean
Seoyoon Park
|
Hyeji Choi
|
Minseon Kim
|
Subin An
|
Xiaonan Wang
|
Gyuri Choi
|
Hansaem Kim
Figurative language conveys stance, emotion, and social nuance, making its appropriate use essential in dialogue. While large language models (LLMs) often succeed in recognizing figurative expressions at the sentence level, their ability to use them coherently in conversation remains uncertain. We introduce FLUID QA, the first multilingual benchmark that evaluates figurative usage in dialogue across English, Korean, and Chinese. Each item embeds figurative choices into multi-turn contexts. To support interpretation, we include FLUTE-bi, a sentence-level diagnostic task. Results reveal a persistent gap: models that perform well on FLUTE-bi frequently fail on FLUID QA, especially in sarcasm and metaphor. These errors reflect systematic rhetorical confusion and limited discourse reasoning. FLUID QA provides a scalable framework for assessing usage-level figurative competence across languages.
pdf
bib
abs
Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework
Mohna Chakraborty
|
Lu Wang
|
David Jurgens
Large language models (LLMs) are increasingly deployed in domains requiring moral understanding, yet their reasoning often remains shallow, and misaligned with human reasoning. Unlike humans, whose moral reasoning integrates contextual trade-offs, value systems, and ethical theories, LLMs often rely on surface patterns, leading to biased decisions in morally and ethically complex scenarios. To address this gap, we present a value-grounded framework for evaluating and distilling structured moral reasoning in LLMs. We benchmark 12 open-source models across four moral datasets using a taxonomy of prompts grounded in value systems, ethical theories, and cognitive reasoning strategies. Our evaluation is guided by four questions: (1) Does reasoning improve LLM decision-making over direct prompting? (2) Which types of value/ethical frameworks most effectively guide LLM reasoning? (3) Which cognitive reasoning strategies lead to better moral performance? (4) Can small-sized LLMs acquire moral competence through distillation? We find that prompting with explicit moral structure consistently improves accuracy and coherence, with first-principles reasoning and Schwartz’s + care-ethics scaffolds yielding the strongest gains. Furthermore, our supervised distillation approach transfers moral competence from large to small models without additional inference cost. Together, our results offer a scalable path toward interpretable and value-grounded models.
pdf
bib
abs
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
|
Yunjia Qi
|
Xiaozhi Wang
|
Bin Xu
|
Lei Hou
|
Juanzi Li
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We will release our datasets, codes, and models to facilitate future research.
pdf
bib
abs
UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation
Ruihan Yang
|
Caiqi Zhang
|
Zhisong Zhang
|
Xinting Huang
|
Dong Yu
|
Nigel Collier
|
Deqing Yang
Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs’ ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE covers five domains and includes more than 1,000 entities, each with paired short- and long-form QA items. Our dataset is the first to directly link short- and long-form QA through aligned questions and gold-standard answers.Along with UNCLE, we propose a suite of new metrics to assess the models’ capabilities to selectively express uncertainty. We then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models’ performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
pdf
bib
abs
Enhancing Study-Level Inference from Clinical Trial Papers via Reinforcement Learning-Based Numeric Reasoning
Massimiliano Pronesti
|
Michela Lorandi
|
Paul Flanagan
|
Oisín Redmond
|
Anya Belz
|
Yufang Hou
Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments.In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model.When evaluated on the CochraneForest benchmark, our best-performing approach – using RL to train a small-scalenumber extraction model – yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%.Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.
pdf
bib
abs
Context-aware Biases for Length Extrapolation
Ali Veisi
|
Hamidreza Amirzadeh
|
Amir M. Mansourian
Transformers often struggle to generalize to longer sequences than those seen during training - a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing either fixed linear biases or globally learned biases, which lack the capacity to adapt to different input contexts. In this work, we propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE), a method that learns token-specific, context-aware biases for each attention head in transformers. By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs. When evaluated on sequences longer than originally trained with, GPT-2 Medium (334M parameters) with CABLE achieves lower perplexity than counterparts using other widely adopted positional encoding methods. Additionally, by applying CABLE to the BERT base model we improved performance in long-context retrieval tasks. Our method significantly enhances the extrapolation performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets. Our code is available at: https://github.com/AlgonetLabs/Cable.
pdf
bib
abs
AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
Yifei Li
|
Hanane Nour Moussa
|
Ziru Chen
|
Shijie Chen
|
Botao Yu
|
Mingyi Xue
|
Benjamin Burns
|
Tzu-Yao Chiu
|
Vishal Dey
|
Zitong Lu
|
Chen Wei
|
Qianheng Zhang
|
Tianyu Zhang
|
Song Gao
|
Xuhui Huang
|
Xia Ning
|
Nesreen K. Ahmed
|
Ali Payani
|
Huan Sun
Despite long-standing efforts in accelerating scientific discovery with AI, building AI co-scientists remains challenging due to limited high-quality data for training and evaluation. To tackle this data scarcity issue, we present AutoSDT, an automatic pipeline that collects high-quality coding tasks in real-world data-driven discovery workflows. AutoSDT leverages the coding capabilities and parametric knowledge of LLMs to search for diverse sources, select ecologically valid tasks, and synthesize accurate task instructions and code solutions. Using our pipeline, we construct AutoSDT-5K, a dataset of 5,404 coding tasks for data-driven discovery that covers four scientific disciplines and 756 unique Python packages. To the best of our knowledge, AutoSDT-5K is the only automatically collected and the largest open dataset for data-driven scientific discovery. Expert feedback on a subset of 256 tasks shows the effectiveness of AutoSDT: 93% of the collected tasks are ecologically valid, and 92.2% of the synthesized programs are functionally correct. Trained on AutoSDT-5K, the Qwen2.5-Coder-Instruct LLM series, dubbed AutoSDT-Coder, show substantial improvement on two challenging data-driven discovery benchmarks, ScienceAgentBench and DiscoveryBench. Most notably, AutoSDT-Coder-32B reaches the same level of performance as GPT-4o on ScienceAgentBench with a success rate of 7.8%, doubling the performance of its base model. On DiscoveryBench, it lifts the hypothesis matching score to 8.1, bringing a 17.4% relative improvement and closing the gap between open-weight models and GPT-4o.
pdf
bib
abs
Finding your MUSE: Mining Unexpected Solutions Engine
Nir Sweed
|
Hanit Hakim
|
Ben Wolfson
|
Hila Lifshitz
|
Dafna Shahaf
Innovators often exhibit cognitive fixation on existing solutions or nascent ideas, hindering the exploration of novel alternatives. This paper introduces a methodology for constructing Functional Concept Graphs (FCGs), interconnected representations of functional elements that support abstraction, problem reframing, and analogical inspiration. Our approach yields large-scale, high-quality FCGs with explicit abstraction relations, overcoming limitations of prior work. We further present MUSE, an algorithm leveraging FCGs to generate creative inspirations for a given problem. We demonstrate our method by computing an FCG on 500K patents, which we release for further research. We introduced MUSE, a novel engine to find unexpected solutions to problems. This engine consists of the inspiration graph, whose problem and solution nodes were extracted from 500K patent descriptions. For a given problem, MUSE aims to enhance users’ creative problem solving by providing them with inspirations sampled from the inspiration graph. A user study indicates that participants exposed to MUSE’s inspirations generated more creative ideas, both in terms of absolute number (up to 19% increase over participants not given inspirations) and ratio (75%, compared to 49% for no inspirations).
pdf
bib
abs
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Yao Fu
|
Xianxuan Long
|
Runchao Li
|
Haotian Yu
|
Mu Sheng
|
Xiaotian Han
|
Yu Yin
|
Pan Li
Quantization enables efficient deployment of large language models (LLMs) in resource-constrained environments by significantly reducing memory and computation costs. While quantized LLMs often maintain performance on perplexity and zero-shot tasks, their impact on truthfulness—whether generating truthful or deceptive responses—remains largely unexplored. In this work, we introduce TruthfulnessEval, a comprehensive evaluation framework for assessing the truthfulness of quantized LLMs across three dimensions: (1) Truthfulness on Logical Reasoning; (2) Truthfulness on Common Sense; and (3) Truthfulness on Imitative Falsehoods. Using this framework, we examine mainstream quantization techniques (ranging from 4-bit to extreme 2-bit) across several open-source LLMs. Surprisingly, we find that while quantized models retain internally truthful representations, they are more susceptible to producing false outputs under misleading prompts. To probe this vulnerability, we test 15 rephrased variants of “honest”, “neutral” and “deceptive” prompts and observe that “deceptive” prompts can override truth-consistent behavior, whereas “honest” and “neutral” prompts maintain stable outputs. Further, we reveal that quantized models “know” the truth internally yet still produce false outputs when guided by “deceptive” prompts via layer-wise probing and PCA visualizations. Our findings provide insights into future designs of quantization-aware alignment and truthfulness interventions.
pdf
bib
abs
Leveraging Knowledge Graph-Enhanced LLMs for Context-Aware Medical Consultation
Su-Hyeong Park
|
Ho-Beom Kim
|
Seong-Jin Park
|
Dinara Aliyeva
|
Kang-Min Kim
Recent advancements in large language models have significantly influenced the field of online medical consultations. However, critical challenges remain, such as the generation of hallucinated information and the integration of up-to-date medical knowledge. To address these issues, we propose **I**nformatics **Llama** (ILlama), a novel framework that combines retrieval-augmented generation with a structured medical knowledge graph. ILlama incorporates relevant medical knowledge by transforming subgraphs from a structured medical knowledge graph into text for retrieval-augmented generation. By generating subgraphs from the medical knowledge graph in advance, specifically focusing on diseases and symptoms, ILlama is able to enhance the accuracy and relevance of its medical reasoning. This framework enables effective incorporation of causal relationships between symptoms and diseases. Also, it delivers context-aware consultations aligned with user queries. Experimental results on the two medical consultation datasets demonstrate that ILlama outperforms the strong baselines, achieving a semantic similarity F1-score of 0.884 when compared with ground truth consultation answers. Furthermore, qualitative analysis of ILlama’s responses reveals significant improvements in hallucination reduction and clinical usefulness. These results suggest that ILlama has strong potential as a reliable tool for real-world medical consultation environments.
pdf
bib
abs
Reflective Agreement: Combining Self-Mixture of Agents with a Sequence Tagger for Robust Event Extraction
Fatemeh Haji
|
Mazal Bethany
|
Cho-Yu Jason Chiang
|
Anthony Rios
|
Peyman Najafirad
Event Extraction (EE) involves automatically identifying and extracting structured information about events from unstructured text, including triggers, event types, and arguments. Traditional discriminative models demonstrate high precision but often exhibit limited recall, particularly for nuanced or infrequent events. Conversely, generative approaches leveraging Large Language Models (LLMs) provide higher semantic flexibility and recall but suffer from hallucinations and inconsistent predictions. To address these challenges, we propose Agreement-based Reflective Inference System (ARIS), a hybrid approach combining a Self Mixture of Agents with a discriminative sequence tagger. ARIS explicitly leverages structured model consensus, confidence-based filtering, and an LLM reflective inference module to reliably resolve ambiguities and enhance overall event prediction quality. We further investigate decomposed instruction fine-tuning for enhanced LLM event extraction understanding. Experiments demonstrate our approach outperforms existing state-of-the-art event extraction methods across three benchmark datasets.
pdf
bib
abs
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification
Maya Kruse
|
Majid Afshar
|
Saksham Khatwani
|
Anoop Mayampurath
|
Guanhua Chen
|
Yanjun Gao
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines. In addition, we explore using MUSE as guided signals with chain-of-thought distillation to fine-tune LLMs for calibration. MUSE is available at: https://github.com/LARK-NLP-Lab/MUSE.
pdf
bib
abs
Exploring morphology-aware tokenization: A case study on Spanish language modeling
Alba Táboas García
|
Piotr Przybyła
|
Leo Wanner
This paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.
pdf
bib
abs
Studying Rhetorically Ambiguous Questions
Oghenevovwe Ikumariegbe
|
Eduardo Blanco
|
Ellen Riloff
Distinguishing between rhetorical questions and informational questions is a challenging task, as many rhetorical questions have similar surface forms to informational questions. Existing datasets, however, do not contain many questions that can be rhetorical or informational in different contexts. We introduce Studying Rhetorically Ambiguous Questions (SRAQ), a new dataset explicitly constructed to support the study of such rhetorical ambiguity. The questions in SRAQ can be interpreted as either rhetorical or informational depending on the context. We evaluate the performance of state-of-the-art language models on this dataset and find that they struggle to recognize many rhetorical questions.
pdf
bib
abs
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
Xiaoyuan Wu
|
Weiran Lin
|
Omer Akgul
|
Lujo Bauer
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses–the model’s confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of resopnses. However, it was not clear how well these approaches approximated users’ perceptions of consistency of LLM responses. To find out, we performed a user study (n=2,976) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans’ perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.
pdf
bib
abs
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
DongGeon Lee
|
Joonwon Jang
|
Jihae Jeong
|
Hwanjo Yu
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
pdf
bib
abs
Improving Rule-based Reasoning in LLMs using Neurosymbolic Representations
Varun Dhanraj
|
Chris Eliasmith
Large language models (LLMs) continue to face challenges in reliably solving reasoning tasks, particularly tasks that involve precise rule following, as often found in mathematical reasoning tasks. This paper introduces a novel neurosymbolic method that improves LLM reasoning by encoding hidden states into neurosymbolic vectors, enabling problem-solving within a neurosymbolic vector space. The results are decoded and merged with the original hidden state, significantly boosting the model’s performance on numerical reasoning tasks. By offloading computation through neurosymbolic representations, this method enhances efficiency, reliability, and interpretability. Our experimental results demonstrate an average of 88.6% lower cross-entropy loss and 15.4 times more problems correctly solved on a suite of mathematical reasoning tasks compared to chain-of-thought prompting and supervised fine-tuning (LoRA), while not hindering the LLM’s performance on other tasks. We make our code available at https://github.com/vdhanraj/Neurosymbolic-LLM.
pdf
bib
abs
Can LLMs Extract Frame-Semantic Arguments?
Jacob Devasier
|
Rishabh Mediratta
|
Chengkai Li
Frame-semantic parsing is a critical task in natural language understanding, yet the ability of large language models (LLMs) to extract frame-semantic arguments remains underexplored. This paper presents a comprehensive evaluation of LLMs on frame-semantic argument identification, analyzing the impact of input representation formats, model architectures, and generalization to unseen and out-of-domain samples. Our experiments, spanning models from 0.5B to 72B parameters, reveal that JSON-based representations significantly enhance performance, and while larger models generally perform better, smaller models can achieve competitive results through fine-tuning. We also introduce a novel approach to frame identification leveraging predicted frame elements, achieving state-of-the-art performance on ambiguous targets. Despite strong generalization capabilities, our analysis finds that LLMs still struggle with out-of-domain data.
pdf
bib
abs
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song
|
Saket Dingliwal
|
Sai Muralidhar Jayanthi
|
Bhavana Ganesh
|
Jinwoo Shin
|
Aram Galstyan
|
Sravan Babu Bodapati
Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.
pdf
bib
abs
Enhancing RLHF with Human Gaze Modeling
Karim Galliamov
|
Ivan Titov
|
Ilya Pershin
Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences but faces efficiency challenges. We explore two approaches leveraging human gaze prediction to enhance RLHF: (1) gaze-aware reward models and (2) gaze-based distribution of sparse rewards at token level. Our experiments show gaze-informed RLHF achieves faster convergence while maintaining or slightly improving performance, reducing computational requirements during policy optimization. Human visual attention patterns provide valuable signals for policy training, suggesting a promising direction for improving RLHF efficiency through human-like attention mechanisms.
pdf
bib
abs
Mapping semantic networks to Dutch word embeddings as a diagnostic tool for cognitive decline
Maithe van Noort
|
Michal Korenar
|
Jelke Bloem
We explore the possibility of semantic networks as a diagnostic tool for cognitive decline by using Dutch verbal fluency data to investigate the relationship between semantic networks and cognitive health. In psychology, semantic networks serve as abstract representations of the semantic memory system. Semantic verbal fluency data can be used to estimate said networks. Traditionally, this is done by counting the number of raw items produced by participants in a verbal fluency task. We used static and contextual word embedding models to connect the elicited words through semantic similarity scores, and extracted three network distance metrics. We then tested how well these metrics predict participants’ cognitive health scores on the Mini-Mental State Examination (MMSE). While the significant predictors differed per model, the traditional number-of-words measure was not significant in any case. These findings suggest that semantic network metrics may provide a more sensitive measure of cognitive health than traditional scoring.
pdf
bib
abs
CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models
Aneesh Komanduri
|
Karuna Bhaila
|
Xintao Wu
Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.
pdf
bib
abs
Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations
Yunzhe Wang
|
Gale Lucas
|
Burcin Becerik-Gerber
|
Volkan Ustun
Language-driven generative agents have enabled large-scale social simulations with transformative uses, from interpersonal training to aiding global policy-making. However, recent studies indicate that generative agent behaviors often deviate from expert expectations and real-world data—a phenomenon we term the *Behavior-Realism Gap*. To address this, we introduce a theoretical framework called Persona-Environment Behavioral Alignment (PEBA), formulated as a distribution matching problem grounded in Lewin’s behavior equation stating that behavior is a function of the person and their environment. Leveraging PEBA, we propose PersonaEvolve (PEvo), an LLM-based optimization algorithm that iteratively refines agent personas, implicitly aligning their collective behaviors with realistic expert benchmarks within a specified environmental context. We validate PEvo in an active shooter incident simulation we developed, achieving an 84% average reduction in distributional divergence compared to no steering and a 34% improvement over explicit instruction baselines. Results also show PEvo-refined personas generalize to novel, related simulation scenarios. Our method greatly enhances behavioral realism and reliability in high-stakes social simulations. More broadly, the PEBA-PEvo framework provides a principled approach to developing trustworthy LLM-driven social simulations.
pdf
bib
abs
Are Language Models Consequentialist or Deontological Moral Reasoners?
Keenan Samway
|
Max Kleiman-Weiner
|
David Guzman Piedrahita
|
Rada Mihalcea
|
Bernhard Schölkopf
|
Zhijing Jin
As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments.
pdf
bib
abs
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
Yongmin Yoo
|
Qiongkai Xu
|
Longbing Cao
High-stakes texts such as patent claims, medical records, and technical reports are structurally complex and demand a high degree of reliability and precision. While large language models (LLMs) have recently been applied to automate their generation in high-stakes domains, reliably evaluating such outputs remains a major challenge. Conventional natural language generation (NLG) metrics are effective for generic documents but fail to capture the structural and legal characteristics essential to evaluating complex high-stakes documents. To address this gap, we propose PatentScore, a multi-dimensional evaluation framework specifically designed for one of the most intricate and rigorous domains, patent claims. PatentScore integrates hierarchical decomposition of claim elements, validation patterns grounded in legal and technical standards, and scoring across structural, semantic, and legal dimensions. In experiments on our dataset which consists of 400 Claim1, PatentScore achieved the highest correlation with expert annotations (r = 0.819), significantly outperforming widely used NLG metrics. This work establishes a new standard for evaluating LLM-generated patent claims, providing a solid foundation for research on patent generation and validation.
pdf
bib
abs
All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
Siddarth Mamidanna
|
Daking Rai
|
Ziyu Yao
|
Yilun Zhou
Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific layers. Experiments show that this circuit is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.
pdf
bib
abs
A Position Paper on the Automatic Generation of Machine Learning Leaderboards
Roelien C. Timmer
|
Yufang Hou
|
Stephen Wan
An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g. same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, advocating for broader coverage by including all reported results and richer metadata.
pdf
bib
abs
SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models
Amirhossein Dabiriaghdam
|
Lele Wang
The widespread adoption of large language models (LLMs) necessitates reliable methods to detect LLM-generated text. We introduce SimMark, a robust sentence-level watermarking algorithm that makes LLMs’ outputs traceable without requiring access to model internals, making it compatible with both open and API-based LLMs. By leveraging the similarity of semantic sentence embeddings combined with rejection sampling to embed detectable statistical patterns imperceptible to humans, and employing a soft counting mechanism, SimMark achieves robustness against paraphrasing attacks. Experimental results demonstrate that SimMark sets a new benchmark for robust watermarking of LLM-generated content, surpassing prior sentence-level watermarking techniques in robustness, sampling efficiency, and applicability across diverse domains, all while maintaining the text quality and fluency.
pdf
bib
abs
SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models
Thong Nguyen
|
Yibin Lei
|
Jia-Huei Ju
|
Andrew Yates
Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision–language model first produces a detailed textual description of each document image, which is then embedded by a standard text encoder. On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder, and it scales similarly on MIRACL-VISION with broader multilingual coverage. Analysis shows that modern vision–language models capture complex textual and visual cues with sufficient granularity to act as a reusable semantic proxy. By off-loading modality alignment to pretrained vision–language models, our approach removes the need for computationally intensive text-image contrastive training and establishes a strong zero-shot baseline for future VDR systems.
pdf
bib
abs
Meta-Semantics Augmented Few-Shot Relational Learning
Han Wu
|
Jie Yin
Few-shot relational learning on knowledge graph (KGs) aims to perform reasoning over relations with only a few training examples. While current methods have focused primarily on leveraging specific relational information, rich semantics inherent in KGs have been largely overlooked. To bridge this gap, we propose PromptMeta, a novel prompted meta-learning framework that seamlessly integrates meta-semantics with relational information for few-shot relational learning. PromptMeta introduces two core innovations: (1) a Meta-Semantic Prompt (MSP) pool that learns and consolidates high-level meta-semantics shared across tasks, enabling effective knowledge transfer and adaptation to newly emerging relations; and (2) a learnable fusion mechanism that dynamically combines meta-semantics with task-specific relational information tailored to different few-shot tasks. Both components are optimized jointly with model parameters within a meta-learning framework. Extensive experiments and analyses on two real-world KG benchmarks validate the effectiveness of PromptMeta in adapting to new relations with limited supervision.
pdf
bib
abs
ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning
Rui Wang
|
Bohao Li
|
Xiyang Dai
|
Jianwei Yang
|
Yi-Ling Chen
|
Zhen Xing
|
Yifan Yang
|
Dongdong Chen
|
Xipeng Qiu
|
Zuxuan Wu
|
Yu-Gang Jiang
Video understanding is essential for multimodal large language models (MLLMs) to interact effectively with users and the real world. However, analyzing long videos remains a major challenge due to the lack of high-quality video instruction data and effective training strategies. In this paper, we introduce a simple yet effective baseline for long-context video understanding, including dataset construction and training recipes. We curate a large-scale video instruction dataset with over 1M samples, encompassing videos from a few seconds to several minutes across diverse sources, without any human annotations. Additionally, we propose a progressive video instruction tuning strategy that incrementally increases input context length, enabling better utilization of videos of varying durations. Comprehensive experiments demonstrate that our dataset significantly outperforms existing video instruction datasets for fine-tuning MLLMs. Furthermore, our training approach establishes a strong video MLLM baseline, surpassing previous open-source models on video benchmarks and outperforming proprietary models like GPT-4V and GPT-4o-mini on VideoMME, even with a compact 7B model.
pdf
bib
abs
ModelCitizens: Representing Community Voices in Online Safety
Ashima Suvarna
|
Christina A Chance
|
Karolina Naranjo
|
Hamid Palangi
|
Sophie Hao
|
Thomas Hartvigsen
|
Saadia Gabriel
Automatic toxic language detection is important for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To reflect the impact of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA and Gemma-based models finetuned on our dataset, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. We will release all code, data and models upon publication.
pdf
bib
abs
UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets
Pengyu Wang
|
Shaojun Zhou
|
Chenkun Tan
|
Xinghao Wang
|
Wei Huang
|
Zhen Ye
|
Zhaowei Li
|
Botian Jiang
|
Dong Zhang
|
Xipeng Qiu
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance of unified VLLMs. To bridge this critical gap, we introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K, a high-quality dataset meticulously designed to facilitate mutual enhancement between multimodal understanding and generation. UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning and precise text-to-image alignment. Our dataset encompasses a wide spectrum of tasks and data sources, ensuring rich diversity and addressing key shortcomings of prior resources. Extensive experiments demonstrate that models trained on UnifiedVisual-240K consistently achieve strong performance across a wide range of tasks. Notably, these models exhibit significant mutual reinforcement between multimodal understanding and generation, further validating the effectiveness of our framework and dataset. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential.
pdf
bib
abs
The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support
Suhas Bn
|
Yash Mahajan
|
Dominik O. Mattioli
|
Andrew M. Sherrill
|
Rosa I. Arriaga
|
Christopher Wiese
|
Saeed Abdullah
This paper investigates the capacity of small language models (0.5B-5B parameters) to generate empathetic responses for individuals with PTSD. We introduce Trauma-Informed Dialogue for Empathy (TIDE), a novel dataset comprising 10,000 two-turn conversations across 500 diverse, clinically-grounded PTSD personas (https://huggingface.co/datasets/yenopoya/TIDE). Using frontier model outputs as ground truth, we evaluate eight small LLMs in zero-shot settings and after fine-tuning. Fine-tuning enhances empathetic capabilities, improving cosine similarity and perceived empathy, although gains vary across emotional scenarios and smaller models exhibit a “knowledge transfer ceiling.” As expected, Claude Sonnet 3.5 consistently outperforms all models, but surprisingly, the smaller models often approach human-rated empathy levels. Demographic analyses showed that older adults favored responses that validated distress before offering support (p = .004), while graduate-educated users preferred emotionally layered replies in specific scenarios. Gender-based differences were minimal (p > 0.15), suggesting the feasibility of broadly empathetic model designs. This work offers insights into building resource-efficient, emotionally intelligent systems for mental health support.
pdf
bib
abs
Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding
Zirui Shao
|
Feiyu Gao
|
Zhaoqing Zhu
|
Chuwei Luo
|
Hangdi Xing
|
Zhi Yu
|
Qi Zheng
|
Ming Yan
|
Jiajun Bu
Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it “sees” and what it “understands”. Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.
pdf
bib
abs
AutoCT: Automating Interpretable Clinical Trial Prediction with LLM Agents
Fengze Liu
|
Haoyu Wang
|
Joonhyuk Cho
|
Dan Roth
|
Andrew Lo
Clinical trials are critical for advancing medical treatments but remain prohibitively expensive and time-consuming. Accurate prediction of clinical trial outcomes can significantly reduce research and development costs and accelerate drug discovery. While recent deep learning models have shown promise by leveraging unstructured data, their black-box nature, lack of interpretability, and vulnerability to label leakage limit their practical use in high-stakes biomedical contexts. In this work, we propose AutoCT, a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning. AutoCT autonomously generates, evaluates, and refines tabular features based on public information without human input. Our method uses Monte Carlo Tree Search to iteratively optimize predictive performance. Experimental results show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations, establishing a new paradigm for scalable, interpretable, and cost-efficient clinical trial prediction.
pdf
bib
abs
MMDocIR: Benchmarking Multimodal Retrieval for Long Documents
Kuicai Dong
|
Yujing Chang
|
Derrick Goh Xin Deik
|
Dexun Li
|
Ruiming Tang
|
Yong Liu
Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text.
pdf
bib
abs
Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval
Subhendu Khatuya
|
Shashwat Naidu
|
Pawan Goyal
|
Niloy Ganguly
Despite continuous advancements in the capabilities of large language models (LLMs), numerical reasoning remains a challenging area. Techniques like chain-of-thought prompting, tree-of-thought prompting, and program-of-thought prompting guide LLMs through intermediate reasoning steps. Although in-context learning with few-shot prompting has improved performance, LLMs still lag behind state-of-the-art models on financial numerical reasoning datasets such as FinQA and ConvFinQA. In this work, we introduce FINDER, a novel two-step framework, to enhance LLM’s capabilities in financial numerical reasoning. The first step utilizes a generative retriever to extract relevant facts from unstructured data, including both text and tables. This is followed by context-aware Program of Thought prompting with dynamic selection of in-context examples. Our model FINDER achieves a new state-of-the-art performance on both the FinQA and ConvFinQA datasets, surpassing previous benchmarks with execution accuracy improvements of 5.98% and 4.05%, respectively.
pdf
bib
abs
Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments
Muhammad Ali
|
Salman Khan
Recent advancements in Large Language Models (LLMs) have paved the way for VisionLarge Language Models (VLLMs) capable ofperforming a wide range of visual understand-ing tasks. While LLMs have demonstrated impressive performance on standard naturalimages, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformedshaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, character-ized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rig-orously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights intothe performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM’s ro-bustness to perform better in complex enviroments. The dataset and code for our experiments are available at https://github.com/aliman80/wastebench.
pdf
bib
abs
Demystifying Domain-adaptive Post-training for Financial LLMs
Zixuan Ke
|
Yifei Ming
|
Xuan-Phi Nguyen
|
Caiming Xiong
|
Shafiq Joty
Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs.
pdf
bib
abs
HICode: Hierarchical Inductive Coding with LLMs
Mian Zhong
|
Pristina Wang
|
Anjalie Field
Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode’s potential for facilitating nuanced analyses in large-scale data.
pdf
bib
abs
Cacheback: Speculative Decoding With Nothing But Cache
Zhiyao Ma
|
In Gim
|
Lin Zhong
We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference.Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences.Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems.Cacheback also shows potential for fast adaptation to new domains.
pdf
bib
abs
MA-DPR: Manifold-aware Distance Metrics for Dense Passage Retrieval
Yifan Liu
|
Qianfeng Wen
|
Mark Zhao
|
Jiazhou Liang
|
Scott Sanner
Dense Passage Retrieval (DPR) typically relies on Euclidean or cosine distance to measure query–passage relevance in embedding space, which is effective when embeddings lie on a linear manifold. However, our experiments across DPR benchmarks suggest that embeddings often lie on lower-dimensional, non-linear manifolds, especially in out-of-distribution (OOD) settings, where cosine and Euclidean distance fail to capture semantic similarity. To address this limitation, we propose a *manifold-aware* distance metric for DPR (**MA-DPR**) that models the intrinsic manifold structure of passages using a nearest-neighbor graph and measures query–passage distance based on their shortest path in this graph. We show that MA-DPR outperforms Euclidean and cosine distances by up to **26%** on OOD passage retrieval, with comparable in-distribution performance across various embedding models, while incurring a minimal increase in query inference time. Empirical evidence suggests that manifold-aware distance allows DPR to leverage context from related neighboring passages, making it effective even in the absence of direct semantic overlap. MA-DPR can be applied to a wide range of dense embedding and retrieval tasks, offering potential benefits across a wide spectrum of domains.
pdf
bib
abs
LLM-Guided Co-Training for Text Classification
Md Mezbaur Rahman
|
Cornelia Caragea
In this paper, we introduce a novel weighted co-training approach that is guided by Large Language Models (LLMs). Namely, in our co-training approach, we use LLM labels on unlabeled data as target labels and co-train two encoder-only based networks that train each other over multiple iterations: first, all samples are forwarded through each network and historical estimates of each network’s confidence in the LLM label are recorded; second, a dynamic importance weight is derived for each sample according to each network’s belief (or confidence) in the quality of the LLM label for that sample; finally, the two networks exchange importance weights with each other—each network back-propagates all samples weighted with the importance weights coming from its peer network and updates its own parameters. By strategically utilizing LLM-generated guidance, our approach significantly outperforms conventional SSL methods, particularly in settings with abundant unlabeled data. Empirical results show that it achieves state-of-the-art performance on 4 out of 5 benchmark datasets and ranks first among 14 compared methods according to the Friedman test. Our results highlight a new direction in semi-supervised learning—where LLMs serve as knowledge amplifiers, enabling backbone co-training models to achieve SOTA performance efficiently.
pdf
bib
abs
LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
Yike Zhang
|
Zhiyuan He
|
Huiqiang Jiang
|
Chengruidong Zhang
|
Yuqing Yang
|
Jianyong Wang
|
Lili Qiu
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%–18% V cache memory reduction, and 1.45× decoding speedup. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is anonymously available at https://anonymous.4open.science/r/LeanK-7A87/README.md.
pdf
bib
abs
DELOC: Document Element Localizer
Hammad Ayyubi
|
Puneet Mathur
|
Mehrab Tanjim
|
Vlad I Morariu
Editing documents and PDFs using natural language instructions is desirable for many reasons – ease of use, increasing accessibility to non-technical users, and for creativity. To do this automatically, a system needs to first understand the user’s intent and convert this to an executable plan or command, and then the system needs to identify or localize the elements that the user desires to edit. While there exist methods that can accomplish these tasks, a major bottleneck in these systems is the inability to ground the spatial edit location effectively. We address this gap through our proposed system, DELOC (Document Element LOCalizer). DELOC adapts the grounding capabilities of existing Multimodal Large Language Model (MLLM) from natural images to PDFs. This adaptation involves two novel contributions: 1) synthetically generating PDF-grounding instruction tuning data from partially annotated datasets; and 2) synthetic data cleaning via Code-NLI, an NLI-inspired process to clean data using generated Python code. The effectiveness of DELOC is apparent in the >3x zero-shot improvement it achieves over the next best Multimodal LLM, GPT-4o.
pdf
bib
abs
NL2Lean: Translating Natural Language into Lean 4 through Multi-Aspect Reinforcement Learning
Yue Fang
|
Shaohan Huang
|
Xin Yu
|
Haizhen Huang
|
Zihan Zhang
|
Weiwei Deng
|
Furu Wei
|
Feng Sun
|
Qi Zhang
|
Zhi Jin
Translating natural language into formal language such as Lean 4 has gained attention for its potential to automate formal proof development. Automated methods provide a scalable and cost-effective alternative to manual formalization, driving increasing interest in this task. However, existing LLMs mainly rely on instruction tuning and lack fine-grained structural and semantic alignment, making it difficult to generate syntactically and logically sound formal proofs.To address this, we propose a reinforcement learning framework ReLean that enables LLMs to generate high-quality Lean 4 statements from natural language.We first fine-tune a LLaMA3-8B model on NL–Lean 4 data to obtain a base translator with basic translation ability. Then, we design a multi-aspect dense reward mechanism covering four key dimensions: semantic alignment, term-level alignment, global-level alignment, and compile-checking. Separate reward models are trained via preference modeling, and their normalized outputs are combined to guide optimization via PPO. Finally, a curriculum learning strategy based on multi-dimensional difficulty allows the model to learn progressively from simple to complex cases. Experiments on NL-to-Lean 4 tasks show that our method consistently outperforms baseline models. Further analysis on reward model and curriculum learning confirms their effectiveness in enhancing model performance.
pdf
bib
abs
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications
Sunayana Sitaram
|
Adrian de Wynter
|
Isobel McCrum
|
Qilong Gu
|
Si-Qing Chen
Misgendering is the act of referring to someone by a gender that does not match their chosen identity. It marginalizes and undermines a person’s sense of self, causing significant harm. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun “they”. However, other languages pose unique challenges due to both grammatical and cultural constructs. In this work we develop methodologies to assess and mitigate misgendering across 42 languages and dialects using a participatory-design approach to design effective and appropriate guardrails across all languages. We test these guardrails in a standard LLM-based application (meeting transcript summarization), where both the data generation and the annotation steps followed a human-in-the-loop approach. We find that the proposed guardrails are very effective in reducing misgendering rates across all languages in the summaries generated, and without incurring loss of quality. Our human-in-the-loop approach demonstrates a method to feasibly scale inclusive and responsible AI-based solutions across multiple languages and cultures. We release the guardrails and synthetic dataset encompassing 42 languages, along with human and LLM-judge evaluations, to encourage further research on this subject.
pdf
bib
abs
X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
Prasanna Reddy Pulakurthi
|
Jiamian Wang
|
Majid Rabbani
|
Sohail Dianat
|
Raghuveer Rao
|
Zhiqiang Tao
Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
pdf
bib
abs
Token-level Proximal Policy Optimization for Query Generation
Yichen Ouyang
|
Lu Wang
|
Fangkai Yang
|
Pu Zhao
|
Chenghua Huang
|
Jianfeng Liu
|
Bochen Pang
|
Yaming Yang
|
Yuefeng Zhan
|
Hao Sun
|
Qingwei Lin
|
Saravan Rajmohan
|
Weiwei Deng
|
Dongmei Zhang
|
Feng Sun
Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. We conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine, demonstrating that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.
pdf
bib
abs
Prior Prompt Engineering for Reinforcement Fine-Tuning
Pittawat Taveekitworachai
|
Potsawee Manakul
|
Sarana Nutanong
|
Kunat Pipatanakul
This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt–the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning–remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies–reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization–into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
pdf
bib
abs
Beyond WER: Probing Whisper’s Sub‐token Decoder Across Diverse Language Resource Levels
Siyu Liang
|
Nicolas Ballier
|
Gina-Anne Levow
|
Richard Wright
While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.
pdf
bib
abs
ThinkTuning: Instilling Cognitive Reflections without Distillation
Aswin Rrv
|
Jacob Dineen
|
Divij Handa
|
Md Nayem Uddin
|
Mihir Parmar
|
Chitta Baral
|
Ben Zhou
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, recent studies show that solely RL does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback–enough to point the mind in the right direction and then show the solution. Each feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. Particularly, on average, our method shows 3.69% improvement over zero-shot baselines across benchmarks, and on MATH-500 and GPQA-Diamond, it shows 2.08% and 3.99% improvement over the vanilla-GRPO baseline.
pdf
bib
abs
Droid: A Resource Suite for AI-Generated Code Detection
Daniil Orel
|
Indraneil Paul
|
Iryna Gurevych
|
Preslav Nakov
We present DroidCollection, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and three real-world coding domains. Alongside fully AI-generated examples, our collection includes human-AI co-authored code, as well as adversarial examples explicitly crafted to evade detection. Subsequently, we develop DroidDetect, a suite of encoder-only detectors trained using a multi-task objective over DroidCollection. Our experiments show that existing detectors’ performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. We further demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small number of adversarial examples. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as way to enhance detector training on possibly noisy distributions.
pdf
bib
abs
LoRACoE: Improving Large Language Model via Composition-based LoRA Expert
Guanyu Li
|
Zhiheng Xi
|
Zhihao Zhang
|
Boyang Hong
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
The Mixture of Experts (MoE) architecture improves large language models (LLMs) by utilizing sparsely activated expert sub-networks with a routing module, but it typically demands high training cost. Previous work introduces parameter-efficient fine-tuning (PEFT) modules, e.g., LoRA, to achieve a lightweight MoE for training efficiency. However, they construct static experts by manually splitting the LoRA parameters into fixed groups, which limits flexibility and dynamism. Furthermore, this manual partitioning also hinders the effective utilization of well-initialized LoRA modules. To address the challenges, we first delve into the parameter patterns in LoRA modules, revealing that there exists task-relevant parameters that are concentrated along the rank dimension of the LoRA parameters. Based on this, we redesign the construction of experts and propose the method LoRACoE (LoRA Composition of Experts). Specifically, when confronted with a task, it dynamically builds experts based on rank-level parameter composition, i.e., experts can flexibly combine rank-level parameters in LoRA module. Extensive experiments demonstrate that compared to other LoRA-based MoE methods, our method achieves better task performance across a broader range of tasks.
pdf
bib
abs
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
Tingchen Fu
|
Fazl Barez
Insensitivity to semantically-preserving variations of prompts (paraphrases) is crucial for reliable behavior and real-world deployment of large language models. However, language models exhibit significant performance degradation with semantically equivalent but differently phrased prompts, and existing solutions either depend on trial-and-error prompt engineering or require computationally expensive inference-time algorithms. In this study, built on the key insight that worst-case prompts exhibit a drift in embedding space, we present Latent Adversarial Paraphrasing (LAP), a dual-loop adversarial framework that optimizes a trainable perturbation as “latent continuous paraphrase” and language model performance on these perturbations iteratively. Extensive experiments are conducted to demonstrate the effectiveness of LAP across multiple backbones on the RobustAlpaca benchmark with a 0.5%-4% absolution improvement on worst-case win-rate.
pdf
bib
abs
Pluralistic Alignment for Healthcare: A Role-Driven Framework
Jiayou Zhong
|
Anudeex Shetty
|
Chao Jia
|
Xuanrui Lin
|
Usman Naseem
As large language models are increasingly deployed in sensitive domains such as healthcare, ensuring their outputs reflect the diverse values and perspectives held across populations is critical. However, existing alignment approaches, including pluralistic paradigms like Modular Pluralism, often fall short in the health domain, where personal, cultural, and situational factors shape pluralism. Motivated by the aforementioned healthcare challenges, we propose a first lightweight, generalizable, pluralistic alignment approach, ETHOSAGENTS, designed to simulate diverse perspectives and values. We empirically show that it advances the pluralistic alignment for all three modes across seven varying-sized open and closed models. Our findings reveal that health-related pluralism demands adaptable and normatively aware approaches, offering insights into how these models can better respect diversity in other high-stakes domains.
pdf
bib
abs
Flexible-length Text Infilling for Discrete Diffusion Models
Andrew Zhang
|
Anushka Sivakumar
|
Chia-Wei Tang
|
Chris Thomas
Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce DDOT (Discrete Diffusion with Optimal Transport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.
pdf
bib
abs
Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
Sabri Boughorbel
|
Fahim Dalvi
|
Nadir Durrani
|
Majd Hawasly
As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain _why_ one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.
pdf
bib
abs
Explicit Learning and the LLM in Machine Translation
Malik Marmonier
|
Rachel Bawden
|
Benoît Sagot
This study explores an LLM’s ability to learn new languages using explanations found in a grammar book—a process we term “explicit learning.” To rigorously assess this ability, we design controlled translation experiments between English and constructed languages generated—through specific cryptographic means—from Latin or French. Contrary to previous studies, our results demonstrate that LLMs do possess a measurable capacity for explicit learning. This ability, however, diminishes as the complexity of the linguistic phenomena to be learned increases. Supervised fine-tuning on ad hoc chains of thought significantly enhances LLM performance but struggles to generalize to typologically novel or more complex linguistic features. These findings point to the need for more diverse training sets and alternative fine-tuning strategies to further improve explicit learning by LLMs, benefiting low-resource languages typically described in grammar books but lacking extensive corpora.
pdf
bib
abs
Towards Language-Agnostic STIPA: Universal Phonetic Transcription to Support Language Documentation at Scale
Jacob Lee Suchardt
|
Hana El-Shazli
|
Pierluigi Cassotti
This paper explores the use of existing state-of-the-art speech recognition models (ASR) for the task of generating narrow phonetic transcriptions using the International Phonetic Alphabet (STIPA). Unlike conventional ASR systems focused on orthographic output for high-resource languages, STIPA can be used as a language-agnostic interface valuable for documenting under-resourced and unwritten languages. We introduce a new dataset for South Levantine Arabic and present the first large-scale evaluation of STIPA models across 51 language families. Additionally, we provide a use case on Sanna, a severely endangered language. Our findings show that fine-tuned ASR models can produce accurate IPA transcriptions with limited supervision, significantly reducing phonetic error rates even in extremely low-resource settings. The results highlight the potential of STIPA for scalable language documentation.
pdf
bib
abs
Beyond Pairwise: Global Zero-shot Temporal Graph Generation
Alon Eirew
|
Kfir Bar
|
Ido Dagan
Temporal relation extraction (TRE) is a fundamental task in natural language processing (NLP) that involves identifying the temporal relationships between events in a document. Despite the advances in large language models (LLMs), their application to TRE remains limited. Most existing approaches rely on pairwise classification, where event pairs are classified in isolation, leading to computational inefficiency and a lack of global consistency in the resulting temporal graph. In this work, we propose a novel zero-shot method for TRE that generates a document’s complete temporal graph in a single step, followed by temporal constraint optimization to refine predictions and enforce temporal consistency across relations. Additionally, we introduce OmniTemp, a new dataset with complete annotations for all pairs of targeted events within a document. Through experiments and analyses, we demonstrate that our method outperforms existing zero-shot approaches and offers a competitive alternative to supervised TRE models.
pdf
bib
abs
“Feels Feminine to Me”: Understanding Perceived Gendered Style through Human Annotations
Hongyu Chen
|
Neele Falk
|
Michael Roth
|
Agnieszka Falenska
In NLP, language–gender associations are commonly grounded in the author’s gender identity, inferred from their language use. However, this identity-based framing risks reinforcing stereotypes and marginalizing individuals who do not conform to normative language–gender associations. To address this, we operationalize the language–gender association as a perceived gender expression of language, focusing on how such expression is externally interpreted by humans, independent of the author’s gender identity. We present the first dataset of itskind: 5,100 human annotations of perceived gendered style—human-written texts rated on a five-point scale from very feminine to verymasculine. While perception is inherently subjective, our analysis identifies textual features associated with higher agreement among annotators: formal expressions and lower emotional intensity. Moreover, annotator demographics influence their perception: women annotators are more likely to label texts as feminine, and men and non-binary annotators as masculine. Finally, feature analysis reveals that the text’s perceived gendered style is shaped by both affective and function words, partially overlapping with known patterns of language variation across gender identities. Our findings lay the groundwork for operationalizing gendered style through human annotation, while also highlighting annotators’ subjective judgments as meaningful signals to understand perception-based concepts.
pdf
bib
abs
RALS: Resources and Baselines for Romanian Automatic Lexical Simplification
Fabian Anghel
|
Cristea Petru-Theodor
|
Claudiu Creanga
|
Sergiu Nisioi
We introduce the first dataset that jointly covers both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification approaches. We propose a methodology for ordering simplification suggestions using a pairwise ranking approximation method, arranging candidates from simple to complex based on a separate set of human judgments. In addition, we provide human lexical complexity annotations for 3,921 word samples in context. Finally, we explore several novel pipelines for complexity prediction and simplification and present the first text simplification system for Romanian.
pdf
bib
abs
How Do Social Bots Participate in Misinformation Spread? A Comprehensive Dataset and Analysis
Herun Wan
|
Minnan Luo
|
Zihan Ma
|
Guang Dai
|
Xiang Zhao
Social media platforms provide an ideal environment to spread misinformation, where social bots can accelerate the spread. This paper explores the interplay between social bots and misinformation on the Sina Weibo platform. We construct a large-scale dataset that includes annotations for both misinformation and social bots. From the misinformation perspective, the dataset is multimodal, containing 11,393 pieces of misinformation and 16,416 pieces of verified information. From the social bot perspective, this dataset contains 65,749 social bots and 345,886 genuine accounts, annotated using a weakly supervised annotator. Extensive experiments demonstrate the comprehensiveness of the dataset, the clear distinction between misinformation and real information, and the high quality of social bot annotations. Further analysis illustrates that: (i) social bots are deeply involved in information spread; (ii) misinformation with the same topics has similar content, providing the basis of echo chambers, and social bots would amplify this phenomenon; and (iii) social bots generate similar content aiming to manipulate public opinions.
pdf
bib
abs
Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ?
Anthony Dubreuil
|
Antoine Gourru
|
Christine Largeron
|
Amine Trabelsi
Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model’s stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.
pdf
bib
abs
Multi-Modal Framing Analysis of News
Arnav Arora
|
Srishti Yadav
|
Maria Antoniak
|
Serge Belongie
|
Isabelle Augenstein
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-) language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
pdf
bib
abs
TempParaphraser: “Heating Up” Text to Evade AI-Text Detection through Paraphrasing
Junjie Huang
|
Ruiquan Zhang
|
Jinsong Su
|
Yidong Chen
The widespread adoption of large language models (LLMs) has increased the need for reliable AI-text detection. While current detectors perform well on benchmark datasets, we highlight a critical vulnerability: increasing the temperature parameter during inference significantly reduces detection accuracy. Based on this weakness, we propose TempParaphraser, a simple yet effective paraphrasing framework that simulates high-temperature sampling effects through multiple normal-temperature generations, effectively evading detection. Experiments show that TempParaphraser reduces detector accuracy by an average of 82.5% while preserving high text quality. We also demonstrate that training on TempParaphraser-augmented data improves detector robustness. All resources are publicly available at
https://github.com/HJJWorks/TempParaphraser.
pdf
bib
abs
ComicScene154: A Scene Dataset for Comic Analysis
Sandro Paval
|
Pascal Meißner
|
Ivan P. Yamshchikov
Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.
pdf
bib
abs
MedLinkDE – MedDRA Entity Linking for German with Guided Chain of Thought Reasoning
Roman Christof
|
Farnaz Zeidi
|
Manuela Messelhäußer
|
Dirk Mentzer
|
Renate Koenig
|
Liam Childs
|
Alexander Mehler
In pharmacovigilance, effective automation of medical data structuring, especially linking entities to standardized terminologies such as MedDRA, is critical. This challenge is rarely addressed for German data. With MedLinkDE we address German MedDRA entity linking for adverse drug reactions in a two-step approach: (1) retrieval of medical terms with fine-tuned embedding models, followed (2) by guided chain-of-thought re-ranking using LLMs. To this end, we introduce RENOde, a German real-world MedDRA dataset consisting of reportings from patients and healthcare professionals. To overcome the challenges posed by the linguistic diversity of these reports, we generate synthetic data mapping the two reporting styles of patients and healthcare professionals. Our embedding models, fine-tuned on these synthetic, quasi-personalized datasets, show competitive performance with real datasets in terms of accuracy at high top- recall, providing a robust basis for re-ranking. Our subsequent guided Chain of Thought (CoT) re-ranking, informed by MedDRA coding guidelines, improves entity linking accuracy by approximately 15% (Acc@1) compared to embedding-only strategies. In this way, our approach demonstrates the feasibility of entity linking in medical reports under the constraints of data scarcity by relying on synthetic data reflecting different informant roles of reporting persons.
pdf
bib
abs
HookMoE: A learnable performance compensation strategy of Mixture-of-Experts for LLM inference acceleration
Cheng Longkai
|
Along He
|
Mulin Li
|
Xie Xueshuo
|
Tao Li
Mixture of Experts (MoE) architectures have emerged as a promising paradigm for scaling model capacity through top-k routing mechanisms. Although reducing the number of activated experts inherently enables inference acceleration, this efficiency gain typically comes at the cost of significant performance degradation. To address this trade-off between efficiency and performance, we propose HookMoE, a plug-and-play single-layer compensation framework that effectively restores performance using only a small post-training calibration set. Our method strategically inserts a lightweight trainable Hook module immediately preceding selected transformer blocks. Comprehensive evaluations on four popular MoE models, with an average performance degradation of only 2.5% across various benchmarks, our method reduces the number of activated experts by more than 50% and achieves a 1.42× inference speed-up during the prefill stage. Through systematic analysis, we further reveal that the upper layers require fewer active experts, offering actionable insights for refining dynamic expert selection strategies and enhancing the overall efficiency of MoE models.
pdf
bib
abs
Cross-Document Cross-Lingual NLI via RST-Enhanced Graph Fusion and Interpretability Prediction
Mengying Yuan
|
WenHao Wang
|
Zixuan Wang
|
Yujie Huang
|
Kangli Wei
|
Fei Li
|
Chong Teng
|
Donghong Ji
Natural Language Inference (NLI) is a fundamental task in natural language processing. While NLI has developed many subdirections such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm: CDCL-NLI, which extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 25,410 instances and spanning 26 languages.To address the limitations of previous methods on CDCL-NLI task, we further propose an innovative method that integrates RST-enhanced graph fusion with interpretability-aware prediction.Our approach leverages RST (Rhetorical Structure Theory) within heterogeneous graph neural networks for cross-document context modeling, and employs a structure-aware semantic alignment based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU (Elementary Discourse Unit)-level attribution framework that produces extractive explanations.Extensive experiments demonstrate our approach’s superior performance, achieving significant improvements over both conventional NLI models as well as large language models.Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, hallucination elimination and interpretability inference.Our code and dataset are available at CDCL-NLI-link.
pdf
bib
abs
3R: Enhancing Sentence Representation Learning via Redundant Representation Reduction
Longxuan Ma
|
Xiao Wu
|
Yuxin Huang
|
Shengxiang Gao
|
Zhengtao Yu
Sentence representation learning (SRL) aims to learn sentence embeddings that conform to the semantic information of sentences. In recent years, fine-tuning methods based on pre-trained models and contrastive learning frameworks have significantly advanced the quality of sentence representations. However, within the semantic space of SRL models, both word embeddings and sentence representations derived from word embeddings exhibit substantial redundant information, which can adversely affect the precision of sentence representations. Existing approaches predominantly optimize training strategies to alleviate the redundancy problem, lacking fine-grained guidance on reducing redundant representations. This paper proposes a novel approach that dynamically identifies and reduces redundant information from a dimensional perspective, training the SRL model to redistribute semantics on different dimensions, and entailing better sentence representations. Extensive experiments across seven semantic text similarity benchmarks demonstrate the effectiveness and generality of the proposed method. A comprehensive analysis of the experimental results is conducted, and the code/data will be released.
pdf
bib
abs
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs
Abhirama Subramanyam Penamakuri
|
Navlika Singh
|
Piyush Arora
|
Anand Mishra
Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including Visual Question Answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which required specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLM on all benchmarks, reducing the performance gap while maintaining computational efficiency. We shall make our code and MPA-aligned models publicly available upon acceptance of this work.
pdf
bib
abs
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom
Jingqi Zhou
|
Sheng Wang
|
Jingwei Dong
|
Kai Liu
|
Lei Li
|
Jiahui Gao
|
Jiyue Jiang
|
Lingpeng Kong
|
Chuan Wu
Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks for both open-source and closed-source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high-quality visual reasoning data, which empowers ProReason-distilled models (i.e., ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones. The code is available at https://github.com/lian-tian-mo-zun/Pro_Reason.
pdf
bib
abs
Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass
Nicholas Popovič
|
Michael Färber
Recent works in Natural Language Inference (NLI) and related tasks, such as automated fact-checking, employ atomic fact decomposition to enhance interpretability and robustness. For this, existing methods rely on resource-intensive generative large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we produce a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in distribution and significantly improves robustness out of distribution and in adversarial settings over models based solely on extractive rationale supervision. Our findings show that interpretability and robust generalization in NLI can be realized using encoder-only architectures and synthetic rationales.
pdf
bib
abs
Structure-Conditional Minimum Bayes Risk Decoding
Bryan Eikema
|
Anna Rutkiewicz
|
Mario Giulianelli
Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model’s outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model’s distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure—dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list)—and we propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.
pdf
bib
abs
Label Set Optimization via Activation Distribution Kurtosis for Zero-Shot Classification with Generative Models
Yue Li
|
Zhixue Zhao
|
Carolina Scarton
In-context learning (ICL) performance is highly sensitive to prompt design, yet the impact of class label options (e.g. lexicon or order) in zero-shot classification remains underexplored. This study proposes LOADS (Label set Optimization via Activation Distribution kurtosiS), a post-hoc method for selecting optimal label sets in zero-shot ICL with large language models (LLMs).LOADS is built upon the observations in our empirical analysis, the first to systematically examine how label option design (i.e., lexical choice, order, and elaboration) impacts classification performance. This analysis shows that the lexical choice of the labels in the prompt (such as agree vs. support in stance classification) plays an important role in both model performance and model’s sensitivity to the label order. A further investigation demonstrates that optimal label words tend to activate fewer outlier neurons in LLMs’ feed-forward networks. LOADS then leverages kurtosis to measure the neuron activation distribution for label selection, requiring only a single forward pass without gradient propagation or labelled data. The LOADS-selected label words consistently demonstrate effectiveness for zero-shot ICL across classification tasks, datasets, models and languages, achieving maximum performance gain from 0.54 to 0.76 compared to the conventional approach of using original dataset label words.
pdf
bib
abs
The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs
Hinata Tezuka
|
Naoya Inoue
Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces.However, the internal dynamics of such transformation and the underlying mechanism remain underexplored.Towards a deeper understanding of this framework, we propose and empirically validate **The Transfer Neurons Hypothesis**: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space.Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces.Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs
pdf
bib
abs
VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
Thu Phuong Nguyen
|
Duc M. Nguyen
|
Hyotaek Jeon
|
Hyunwook Lee
|
Hyunmin Song
|
Sungahn Ko
|
Taehwan Kim
Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions—designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
pdf
bib
abs
All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
Caiqi Zhang
|
Chang Shu
|
Ehsan Shareghi
|
Nigel Collier
Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.
pdf
bib
abs
SEMMA: A Semantic Aware Knowledge Graph Foundation Model
Arvindh Arun
|
Sumit Kumar
|
Mojtaba Nayyeri
|
Bo Xiong
|
Ponnurangam Kumaraguru
|
Antonio Vergari
|
Steffen Staab
Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
pdf
bib
abs
Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
Mizanur Rahman
|
Md Tahmid Rahman Laskar
|
Shafiq Joty
|
Enamul Hoque
Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o’s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at <redacted>.
pdf
bib
abs
Predicting Prosodic Boundaries for Children’s Texts
Mansi Dhamne
|
Sneha Raman
|
Preeti Rao
Reading fluency in any language requires accurate word decoding but also natural prosodic phrasing i.e the grouping of words into rhythmically and syntactically coherent units. This holds for, both, reading aloud and silent reading. While adults pause meaningfully at clause or punctuation boundaries, children aged 8-13 often insert inappropriate pauses due to limited breath control and underdeveloped prosodic awareness. We present a text-based model to predict cognitively appropriate pause locations in children’s reading material. Using a curated dataset of 54 leveled English stories annotated for potential pauses, or prosodic boundaries, by 21 fluent speakers, we find that nearly 30% of pauses occur at non-punctuation locations of the text, highlighting the limitations of using only punctuation-based cues. Our model combines lexical, syntactic, and contextual features with a novel breath duration feature that captures syllable load since the last major boundary. This cognitively motivated approach can model both allowed and “forbidden” pauses. The proposed framework supports applications such as child-directed TTS and oral reading fluency assessment where the proper grouping of words is considered critical to reading comprehension.
pdf
bib
abs
Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision
Xingwei Tan
|
Marco Valentino
|
Mahmud Elahi Akhter
|
Maria Liakata
|
Nikolaos Aletras
Large language models (LLMs) have shown strong performance in many reasoning benchmarks. However, recent studies have pointed to memorization, rather than generalization, as one of the leading causes for such performance. LLMs, in fact, are susceptible to content variations, demonstrating a lack of robust planning or symbolic abstractions supporting their reasoning process. To improve reliability, many attempts have been made to combine LLMs with symbolic methods. Nevertheless, existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms. In this paper, we propose to overcome such limitations by synthesizing high-quality symbolic reasoning trajectories with stepwise pseudo-labels at scale via Monte Carlo estimation. A Process Reward Model (PRM) can be efficiently trained based on the synthesized data and then used to select more symbolic trajectories. The trajectories are then employed with Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) to improve logical reasoning and generalization. Our results on benchmarks (i.e., FOLIO and LogicAsker) show the effectiveness of the proposed method with gains on frontier and open-weight models. Moreover, additional experiments on claim verification data reveal that fine-tuning on the generated symbolic reasoning trajectories enhances out-of-domain generalizability, suggesting the potential impact of the proposed method in enhancing planning and logical reasoning.
pdf
bib
abs
Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
Piotr Sawicki
|
Marek Grzes
|
Dan Brown
|
Fabricio Goes
This study adapts the Consensual Assessment Technique (CAT) for Large Language Models (LLMs), introducing a novel methodology for poetry evaluation. Using a 90-poem dataset with a ground truth based on publication venue, we demonstrate that this approach allows LLMs to significantly surpass the performance of non-expert human judges. Our method, which leverages forced-choice ranking within small, randomized batches, enabled Claude-3-Opus to achieve a Spearman’s Rank Correlation of 0.87 with the ground truth, dramatically outperforming the best human non-expert evaluation (SRC = 0.38). The LLM assessments also exhibited high inter-rater reliability, underscoring the methodology’s robustness. These findings establish that LLMs, when guided by a comparative framework, can be effective and reliable tools for assessing poetry, paving the way for their broader application in other creative domains.
pdf
bib
abs
Beyond Human Labels: A Multi-Linguistic Auto-Generated Benchmark for Evaluating Large Language Models on Resume Parsing
Zijian Ling
|
Han Zhang
|
Jiahao Cui
|
Zhequn Wu
|
Xu Sun
|
Guohao Li
|
Xiangjian He
Efficient resume parsing is critical for global hiring, yet the absence of dedicated benchmarks for evaluating large language models (LLMs) on multilingual, structure-rich resumes hinders progress. To address this, we introduce ResumeBench, the first privacy-compliant benchmark comprising 2,500 synthetic resumes spanning 50 templates, 30 career fields, and 5 languages. These resumes are generated through a human-in-the-loop pipeline that prioritizes realism, diversity, and privacy compliance, which are validated against real-world resumes. This paper evaluates 24 state-of-the-art LLMs on ResumeBench, revealing substantial variations in handling resume complexities. Specifically, top-performing models like GPT-4o exhibit challenges in cross-lingual structural alignment while smaller models show inconsistent scaling effects. Code-specialized LLMs underperform relative to generalists, while JSON outputs enhance schema compliance but fail to address semantic ambiguities. Our findings underscore the necessity for domain-specific optimization and hybrid training strategies to enhance structural and contextual reasoning in LLMs.
pdf
bib
abs
Orthogonal Finetuning Made Scalable
Zeju Qiu
|
Weiyang Liu
|
Adrian Weller
|
Bernhard Schölkopf
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity. To overcome this, we propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic. We further introduce the Cayley–Neumann parameterization, an efficient orthogonal parameterization that approximates the matrix inversion in the Cayley transform via a truncated Neumann series. These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without compromising performance. In addition, we extend OFTv2 to support finetuning quantized foundation models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
pdf
bib
abs
AIR: Complex Instruction Generation via Automatic Iterative Refinement
Wei Liu
|
Yancheng He
|
Yu Li
|
Hui Huang
|
Chengwei Hu
|
Jiaheng Liu
|
Shilong Li
|
Wenbo Su
|
Bo Zheng
With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich knowledge and formatting in human written documents. In this paper, we propose a novel **A**utomatic **I**terative **R**efinement (**AIR**) framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs’ ability to follow complex instructions. The AIR framework consists of two stages: 1) Generate an initial instruction from a document; 2) Iteratively refine instructions with LLM-as-judge guidance by comparing the model’s output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model’s ability to follow complex instructions, outperforming existing methods for instruction generation.
pdf
bib
abs
SQUiD: Synthesizing Relational Databases from Unstructured Text
Mushtari Sadia
|
Zhenning Yang
|
Yunming Xiao
|
Ang Chen
|
Amrita Roy Chowdhury
Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.
pdf
bib
abs
RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
Yu Wang
|
Shiwan Zhao
|
Zhihu Wang
|
Ming Fan
|
Xicheng Zhang
|
Yubo Zhang
|
Zhengfan Wang
|
Heyuan Huang
|
Ting Liu
The integration of external knowledge through Retrieval-Augmented Generation (RAG) has become foundational in enhancing large language models (LLMs) for knowledge-intensive tasks. However, existing RAG paradigms often overlook the cognitive step of applying knowledge, leaving a gap between retrieved facts and task-specific reasoning. In this work, we introduce RAG+, a principled and modular extension that explicitly incorporates application-aware reasoning into the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and aligned application examples, created either manually or automatically, and jointly retrieves both during inference. This design enables LLMs not only to access relevant information but also to apply it within structured, goal-oriented reasoning processes. Experiments across mathematical, law, and medical domains, conducted on multiple models, demonstrate that RAG+ consistently outperforms standard RAG variants, achieving average improvements of 3–5%, and peak gains up to 13.5% in complex scenarios. By bridging retrieval with actionable application, RAG+ advances a more cognitively grounded framework for knowledge integration, representing a step toward more interpretable and capable LLMs.
pdf
bib
abs
Rapid Word Learning Through Meta In-Context Learning
Wentao Wang
|
Guangyuan Jiang
|
Tal Linzen
|
Brenden Lake
Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word’s usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.
pdf
bib
abs
EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe
|
Mateusz Klimaszewski
|
Liane Guillou
|
Shannon Vallor
|
Alexandra Birch
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are beautiful, empathetic and neat and men are leaders, strong, tough and professional. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuned models continue to exhibit gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
pdf
bib
abs
How Persuasive Is Your Context?
Tu Nguyen
|
Kevin Du
|
Alexander Miserlis Hoyle
|
Ryan Cotterell
Two central capabilities of language models (LMs) are: (i) drawing on prior knowledge about entities, which allows them to answer queries such as What’s the official language of Austria?, and (ii) adapting to new information provided in context, e.g., Pretend the official language of Austria is Tagalog., that is pre-pended to the question. In this article, we introduce targeted persuasion score (TPS), designed to quantify how persuasive a given context is to an LM where persuasion is operationalized as the ability of the context to alter the LM’s answer to the question. In contrast to evaluating persuasiveness only through a model’s most likely answer, TPS provides a more fine-grained view of model behavior. Based on the Wasserstein distance, TPS measures how much a context shifts a model’s original answer distribution towarda target distribution. Empirically, through aseries of experiments, we show that TPS captures a more nuanced notion of persuasiveness than previously proposed metrics.
pdf
bib
abs
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
Yu Fan
|
Yang Tian
|
Shauli Ravfogel
|
Mrinmaya Sachan
|
Elliott Ash
|
Alexander Miserlis Hoyle
Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate—often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.
pdf
bib
abs
Measuring scalar constructs in social science with LLMs
Hauke Licht
|
Rupak Sarkar
|
Patrick Y. Wu
|
Pranav Goel
|
Niklas Stoehr
|
Elliott Ash
|
Alexander Miserlis Hoyle
Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex”, but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study finds that pairwise comparisons made by LLMs produce better measurements than simply prompting the LLM to directly output the scores, which suffers from bunching around arbitrary numbers. However, taking the weighted mean over the token probability of scores further improves the measurements over the two previous approaches. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.
pdf
bib
abs
Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization
Jing Yu
|
Yibo Zhao
|
Jiapeng Zhu
|
Wenming Shao
|
Bo Pang
|
Zhao Zhang
|
Xiang Li
The widespread dissemination of toxic content on social media poses a serious threat to both online environments and public discourse, highlighting the urgent need for detoxification methods that effectively remove toxicity while preserving the original semantics.However, existing approaches often struggle to simultaneously achieve strong detoxification performance, semantic preservation, and robustness to out-of-distribution data. Moreover, they typically rely on costly, manually annotated parallel corpora while showing poor data efficiency.To address these challenges, we propose GEM, a two-stage training framework that jointly optimizes Model Generalization, Data Efficiency, and Semantic Preservation.We first perform supervised fine-tuning on a small set of high-quality, filtered parallel data to establish a strong initialization. Then, we leverage unlabeled toxic inputs and a custom-designed reward model to train the LLM using Group Relative Policy Optimization.Experimental results demonstrate that our method effectively mitigates the trade-offs faced by previous work, achieving state-of-the-art performance with improved generalization and significantly reduced dependence on annotated data. Our code is available at https://github.com/allacnobug/Detoxification-of-Text.
pdf
bib
abs
Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss
Kiana Aghakasiri
|
Noopur Zambare
|
JoAnn Thai
|
Carrie Ye
|
Mayur Mehta
|
J Ross Mitchell
|
Mohamed Abdalla
De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.
pdf
bib
abs
Reasoning under Uncertainty: Efficient LLM Inference via Unsupervised Confidence Dilution and Convergent Adaptive Sampling
Zhenning Shi
|
Yijia Zhu
|
Yi Xie
|
Junhan Shi
|
Guorui Xie
|
Haotian Zhang
|
Yong Jiang
|
Congcong Miao
|
Qing Li
Large language models (LLMs) excel at complex reasoning tasks but often suffer from overconfidence and computational inefficiency due to fixed computation budgets and miscalibrated confidence estimates. We present a novel framework for computationally efficient, trustworthy reasoning under uncertainty, introducing two complementary techniques: Diversity-Aware Self-Signal Dilution (DASD) and Convergent Adaptive Weighted Sampling (CAWS). DASD operates in an unsupervised manner to dilute overconfident, semantically redundant reasoning paths, thereby producing better-calibrated internal confidence estimates. CAWS dynamically allocates computational resources at inference time by aggregating these signals and terminating computation once answer dominance and stability are achieved. Comprehensive experiments across three reasoning datasets demonstrate that our approach maintains accuracy levels while achieving over 70% reduction in inference cost, surpassing competitive baselines. Our framework provides a scalable, unsupervised solution for reliable and efficient LLM reasoning.
pdf
bib
abs
Africa Health Check: Probing Cultural Bias in Medical LLMs
Charles Nimo
|
Shuheng Liu
|
Irfan Essa
|
Michael L. Best
Large language models (LLMs) are increasingly deployed in global healthcare, yet their outputs often reflect Western-centric training data and omit indigenous medical systems and region-specific treatments. This study investigates cultural bias in instruction-tuned medical LLMs using a curated dataset of African traditional herbal medicine. We evaluate model behavior across two complementary tasks, namely, multiple-choice questions and fill-in-the-blank completions, designed to capture both treatment preferences and responsiveness to cultural context. To quantify outcome preferences and prompt influences, we apply two complementary metrics: Cultural Bias Score (CBS) and Cultural Bias Attribution (CBA). Our results show that while prompt adaptation can reduce inherent bias and enhance cultural alignment, models vary in how responsive they are to contextual guidance. Persistent default to allopathic (Western) treatments in zero-shot scenarios suggests that many biases remain embedded in model training. These findings underscore the need for culturally informed evaluation strategies to guide the development of AI systems that equitably serve diverse global health contexts. By releasing our dataset and providing a dual-metric evaluation approach, we offer practical tools for developing more culturally aware and clinically grounded AI systems for healthcare settings in the Global South.
pdf
bib
abs
Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms
Orfeas Menis Mastromichalakis
|
Giorgos Filandrianos
|
Maria Symeonaki
|
Giorgos Stamou
Machine Translation (MT) systems frequently encounter gender-ambiguous occupational terms, where they must assign gender without explicit contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns—such as consistently translating certain professions with specific genders—can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we introduce GRAPE, a probability-based metric designed to evaluate gender bias by analyzing aggregated model responses. Alongside this, we present GAMBIT, a benchmarking dataset in English with gender-ambiguous occupational terms. Using GRAPE, we evaluate several MT systems and examine whether their gendered translations in Greek and French align with or diverge from societal stereotypes, real-world occupational gender distributions, and normative standards.
pdf
bib
abs
REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing
Aly M. Kassem
|
Zhuan Shi
|
Negar Rostamzadeh
|
Golnoosh Farnadi
LLMs are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects—such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on out-of-distribution (OOD) data (e.g., The Pile, LMSYS-Chat-1M), without access to fine-tuning data, to isolate behavioral shifts.Applied to five LLMs across three scenarios, WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves up to 95% accuracy in predicting side effects, aligning with known benchmarks and requiring no custom heuristics. Our results demonstrate that sparse probing and diffing offer a scalable and automated lens into fine-tuning-induced model changes, providing practical tools for understanding and managing LLM behavior.
pdf
bib
abs
ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions
Matteo Bortoletto
|
Constantin Ruhdorfer
|
Andreas Bulling
Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents’ mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models’ performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.
pdf
bib
abs
Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
Grgur Kovač
|
Jérémy Perez
|
Rémy Portelas
|
Peter Ford Dominey
|
Pierre-Yves Oudeyer
Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.
pdf
bib
abs
Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions
Hazel Kim
|
Tom A. Lamb
|
Adel Bibi
|
Philip Torr
|
Yarin Gal
Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow across model layers. We target cases when LLMs process inputs with ambiguous or insufficient context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics (ℒI) provides robust indicators of model reliability, accounting for both information gain and loss during computation. I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.
pdf
bib
abs
Extending Automatic Machine Translation Evaluation to Book-Length Documents
Kuang-Da Wang
|
Shuoyang Ding
|
Chao-Han Huck Yang
|
Ping-Chun Hsieh
|
Wen-Chih Peng
|
Vitaly Lavrukhin
|
Boris Ginsburg
Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.
pdf
bib
abs
MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses
Tong Chen
|
Zimu Wang
|
Yiyi Miao
|
Haoran Luo
|
Sun Yuanfei
|
Wei Wang
|
Zhengyong Jiang
|
Procheta Sen
|
Jionglong Su
Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.
pdf
bib
abs
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
|
Pooyan Fazli
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity. To address these limitations, we introduce VideoPASTA (Preference Alignment with Spatio-Temporal-Cross Frame Adversaries), a framework that enhances Video-LLMs through targeted preference optimization. VideoPASTA trains models to distinguish accurate video representations from carefully crafted adversarial examples that deliberately violate spatial, temporal, or cross-frame relationships. With only 7,020 preference pairs and Direct Preference Optimization, VideoPASTA enables models to learn robust representations that capture fine-grained spatial details and long-range temporal dynamics. Experiments demonstrate that VideoPASTA is model agnostic and significantly improves performance, for example, achieving gains of up to + 3.8 percentage points on LongVideoBench, +4.1 on VideoMME, and +4.0 on MVBench, when applied to various state-of-the-art Video-LLMs. These results demonstrate that targeted alignment, rather than massive pretraining or architectural modifications, effectively addresses core video-language challenges. Notably, VideoPASTA achieves these improvements without any human annotation or captioning, relying solely on 32-frame sampling. This efficiency makes our approach a scalable plug-and-play solution that seamlessly integrates with existing models while preserving their original capabilities.
pdf
bib
abs
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi
|
Bhaskara Hanuma Vedula
|
Hemank Lamba
|
Edward Raff
|
Ponnurangam Kumaraguru
|
Francis Ferraro
|
Manas Gaur
Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM’s task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.
pdf
bib
abs
Group-Aware Reinforcement Learning for Output Diversity in Large Language Models
Oron Anschel
|
Alon Shoshan
|
Adam Botach
|
Shunit Haviv Hakimi
|
Asaf Gendler
|
Emanuel Ben Baruch
|
Nadav Bhonker
|
Igor Kviatkovsky
|
Manoj Aggarwal
|
Gerard Medioni
Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage.We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses.Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
pdf
bib
abs
Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer
Abteen Ebrahimi
|
Adam Wiemerslage
|
Katharina von der Wense
We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.
pdf
bib
abs
PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality
Byeongho Yu
|
Changhun Lee
|
Jun-gyu Jin
|
Eunhyeok Park
To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.
pdf
bib
abs
Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues
Jinfeng Zhou
|
Yuxuan Chen
|
Jianing Yin
|
Yongkang Huang
|
Yihan Shi
|
Xikun Zhang
|
Libiao Peng
|
Rongsheng Zhang
|
Tangjie Lv
|
Zhipeng Hu
|
Hongning Wang
|
Minlie Huang
Cognitive Restructuring (CR) uses multi-turn dialogue to identify and restructure one’s negative thoughts, arising from mental health issues, into more helpful and positive ones. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, effectively implementing CR is hindered by entrenched cognitive distortions, emotional resistance, and individual differences, which existing works have not overcome. To bridge this gap, we propose CRDial, a novel framework that structures CR as theory-grounded multi-stage multi-turn dialogue, integrating multi-aspect supportive strategies for emotional management and a multi-channel loop mechanism to account for diverse individual distortions. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.
pdf
bib
abs
AccessEval: Benchmarking Disability Bias in Large Language Models
Srikant Panda
|
Amit Agarwal
|
Hitesh Laxmichand Patel
Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real life queries. To systematically investigate these effects with various disability context, we introduce AccessEval, a large-scale benchmark evaluating total 21 close & open source LLMs across six real-world domains and nine disability types using paired Neutral and Disability-Aware Queries. We evaluated model outputs with metrics for factual accuracy, sentiment, and social perception.Our analysis reveals that responses to disability-aware queries tend to have higher factual error, more negative tone, and increased stereotyping with social perception compared to neutral queries. These effects show notable variation by domain and disability type. Disabilities affecting hearing, speech and mobility are disproportionately impacted. These disparities reveal persistent forms of ableism, highlighting the need for more comprehensive and nuanced assessment.We further argue that framing bias in terms of model performance within real-world decision making helps to better link model behaviors to the potential harms users may face. This approach guides the development of more effective and tailored fairness interventions. AccessEval, therefore, serves as a crucial tool for advancing equitable and inclusive language technologies.
pdf
bib
abs
The Impact of Language Mixing on Bilingual LLM Reasoning
Yihao Li
|
Jiayi Xin
|
Miranda Muqing Miao
|
Qi Long
|
Lyle Ungar
Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit **language mixing**—alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We show that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by 2.92 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a *strategic reasoning behavior*.
pdf
bib
abs
VISaGE: Understanding Visual Generics and Exceptions
Stella Frank
|
Emily Allaway
While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances
pdf
bib
abs
Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
Alex Laitenberger
|
Christopher D Manning
|
Nelson F. Liu
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document’s Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.
pdf
bib
abs
Discursive Circuits: How Do Language Models Understand Discourse Relations?
Yisong Miao
|
Min-Yen Kan
Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits (≈0.2% of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).
pdf
bib
abs
Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning
Chan Young Park
|
Jillian Fisher
|
Marius Memmel
|
Dipika Khullar
|
Seoho Yun
|
Abhishek Gupta
|
Yejin Choi
Large language models (LLMs) have shown promise in robotic procedural planning, yet their human-centric reasoning often omits the low-level, grounded details needed for robotic execution. Vision-language models (VLMs) offer a path toward more perceptually grounded plans, but current methods either rely on expensive, large-scale models or are constrained to narrow simulation settings. We introduce SelfReVision, a lightweight and scalable self-improvement framework for vision-language procedural planning. SelfReVision enables small VLMs to iteratively critique, revise, and verify their own plans, without external supervision or teacher models, drawing inspiration from chain-of-thought prompting and self-instruct paradigms. Through this self-distillation loop, models generate higher-quality, execution-ready plans that can be used both at inference and for continued fine-tuning. Using models varying from 3B to 72B, our results show that SelfReVision not only boosts performance over weak base VLMs but also outperforms models 100X the size, yielding improved control in downstream embodied tasks.
pdf
bib
abs
ThinkSLM: Towards Reasoning in Small Language Models
Gaurav Srivastava
|
Shuxiang Cao
|
Xuan Wang
Reasoning has long been viewed as an emergent property of large language models (LLMs). However, recent studies challenge this assumption, showing that small language models (SLMs) can also achieve competitive reasoning performance. This paper introduces ThinkSLM, the first extensive benchmark to systematically evaluate and study the reasoning abilities of SLMs trained from scratch or derived from LLMs through quantization, pruning, and distillation. We first establish a reliable evaluation criterion comparing available methods and LLM judges against our human evaluations. Then we present a study evaluating 72 diverse SLMs from six major model families across 17 reasoning benchmarks. We repeat all our experiments three times to ensure a robust assessment. Our findings show that: 1) reasoning ability in SLMs is strongly influenced by training methods and data quality rather than solely model scale; 2) quantization preserves reasoning capability, while pruning significantly disrupts it; 3) larger models consistently exhibit higher robustness against adversarial perturbations and intermediate reasoning, but certain smaller models closely match or exceed the larger models’ performance. Our findings challenge the assumption that scaling is the only way to achieve strong reasoning. Instead, we foresee a future where SLMs with strong reasoning capabilities can be developed through structured training or post-training compression. Our ThinkSLM Leaderboard is publicly available at: https://ctrl-gaurav.github.io/thinkslm.github.io/.
pdf
bib
abs
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning
Justin Chen
|
Archiki Prasad
|
Swarnadeep Saha
|
Elias Stengel-Eskin
|
Mohit Bansal
Large language model (LLM) reasoning can be improved by scaling test-time compute with aggregation, i.e., generating multiple samples and aggregating over them. While improving performance, this strategy often reaches a saturation point beyond which additional compute provides no return. Refinement offers an alternative by using model-generated feedback to improve answer quality. However, refinement faces three key challenges: (1) Excessive refinement: Uniformly refining all instances can cause over-correction and reduce overall performance. (2) Inability to localize and address errors: LLMs struggle to identify and correct their own mistakes. (3) Insufficient refinement: Stopping refinement too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, a framework for Multi-Agent Iteration for Coarse-to-fine Refinement. MAgICoRe mitigates excessive refinement by categorizing problems as easy or hard, solving easy problems with coarse-grained aggregation, and solving the hard ones with fine-grained multi-agent refinement. To better localize errors, we incorporate external step-wise reward model scores, and to ensure sufficient refinement, we iteratively refine the solutions using a multi-agent setup. We evaluate MAgICoRe on Llama-3-8B and GPT- 3.5 and show its effectiveness across seven reasoning datasets. One iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% even when these baselines use k = 120, and MAgICoRe uses less than 50% of the compute.
pdf
bib
abs
Batched Self-Consistency Improves LLM Relevance Assessment and Ranking
Anton Korikov
|
Pan Du
|
Scott Sanner
|
Navid Rekabsaz
LLM query-passage relevance assessment is typically studied using a one-by-one pointwise (PW) strategy where each LLM call judges one passage at a time. However, this strategy requires as many LLM calls as there are passages while also preventing information sharing between passages. We thus hypothesize that batched PW methods, which evaluate multiple passages per LLM call, can improve not only efficiency but also judgment quality — by enabling content from multiple passages to be seen jointly. Moreover, batched PW methods may be better suited to harness the test-time scaling benefits of self-consistency — the ensembling technique of repeating (potentially perturbed) LLM tasks in parallel and aggregating results — since batching can naturally enable prompt diversification through varied batch permutations and compositions to create more robust ensembles. We evaluate several batched PW methods against one-by-one PW and listwise ranking baselines on LLM relevance assessment and ranking tasks, using three passage retrieval datasets and GPT-4o, Claude Sonnet 3, and Amazon Nova Pro. We show that batching can greatly amplify self-consistency benefits, making batched PW methods achieve the best performance while often reducing latency by an order of magnitude or more compared to one-by-one PW methods. For instance, on legal search, batched PW ranking with GPT-4o improves from 43.8% to 51.3% NDCG@10 when using 1 vs. 15 self-consistency calls, compared to one-by-one PW ranking improving from 44.9% to 46.8% and being 15.3x slower.
pdf
bib
abs
SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts
Marc Felix Brinner
|
Sina Zarrieß
We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model’s ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.
pdf
bib
abs
Controlled Generation for Private Synthetic Text
Zihao Zhao
|
Anjalie Field
Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.
pdf
bib
abs
Towards AI-Assisted Psychotherapy: Emotion-Guided Generative Interventions
Kilichbek Haydarov
|
Youssef Mohamed
|
Emilio Goldenhersch
|
Paul OCallaghan
|
Li-jia Li
|
Mohamed Elhoseiny
Large language models (LLMs) hold promise for therapeutic interventions, yet most existing datasets rely solely on text, overlooking non-verbal emotional cues essential to real-world therapy. To address this, we introduce a multimodal dataset of 1,441 publicly sourced therapy session videos containing both dialogue and non-verbal signals such as facial expressions and vocal tone. Inspired by Hochschild’s concept of emotional labor, we propose a computational formulation of emotional dissonance—the mismatch between facial and vocal emotion—and use it to guide emotionally aware prompting. Our experiments show that integrating multimodal cues, especially dissonance, improves the quality of generated interventions. We also find that LLM-based evaluators misalign with expert assessments in this domain, highlighting the need for human-centered evaluation. Data and code will be released to support future research.
pdf
bib
abs
From Shortcuts to Balance: Attribution Analysis of Speech-Text Feature Utilization in Distinguishing Original from Machine-Translated Texts
Yongjian Chen
|
Antonio Toral
Neural text-based models for detecting machine-translated texts can rely on named entities (NEs) as classification shortcuts. While masking NEs encourages learning genuine translationese signals, it degrades the classification performance. Incorporating speech features compensates for this loss, but their interaction with NE reliance requires careful investigation. Through systematic attribution analysis across modalities, we find that bimodal integration leads to more balanced feature utilization, reducing the reliance on NEs in text while moderating overemphasis attribution patterns in speech features.
pdf
bib
abs
DEBATE, TRAIN, EVOLVE: Self‐Evolution of Language Model Reasoning
Gaurav Srivastava
|
Zhenyu Bi
|
Meng Lu
|
Xuan Wang
Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on seven reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities. Our framework code and trained models are publicly available at https://github.com/ctrl-gaurav/Debate-Train-Evolve.
pdf
bib
abs
From Chat Logs to Collective Insights: Aggregative Question Answering
Wentao Zhang
|
Woojeong Kim
|
Yuntian Deng
Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet existing approaches typically treat these interactions as independent, missing critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregational queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
pdf
bib
abs
A Text-Based Recommender System that Leverages Explicit Affective State Preferences
Tonmoy Hasan
|
Razvan Bunescu
The affective attitude of liking a recommended item reflects just one category in a wide spectrum of affective phenomena that also includes emotions such as entranced or intrigued, moods such as cheerful or buoyant, as well as more fine-grained affective states, such as “pleasantly surprised by the conclusion”. In this paper, we introduce a novel recommendation task that can leverage a virtually unbounded range of affective states sought explicitly by the user in order to identify items that, upon consumption, are likely to induce those affective states. Correspondingly, we create a large dataset of user preferences containing expressions of fine-grained affective states that are mined from book reviews, and propose ACRec, a Transformer-based architecture that leverages such affective expressions as input. We then use the resulting dataset of affective states preferences, together with the linked users and their histories of book readings, ratings, and reviews, to train and evaluate multiple recommendation models on the task of matching recommended items with affective preferences. Experimental comparisons with a range of state-of-the-art baselines demonstrate ACRec’s superior ability to leverage explicit affective preferences.
pdf
bib
abs
CARE: Multilingual Human Preference Learning for Cultural Awareness
Geyang Guo
|
Tarek Naous
|
Hiromi Wakaki
|
Yukiko Nishimura
|
Yuki Mitsufuji
|
Alan Ritter
|
Wei Xu
Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce
CARE, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with human judgments. We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs, outperforming larger generic preference data. Our analyses reveal that models with stronger initial cultural performance benefit more from alignment, leading to gaps among models developed in different regions with varying access to culturally relevant data. CARE is publicly available at
https://github.com/Guochry/CARE.
pdf
bib
abs
Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Justin Vasselli
|
Eunike Andriani Kardinata
|
Yusuke Sakai
|
Taro Watanabe
Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
pdf
bib
abs
SUE: Sparsity-based Uncertainty Estimation via Sparse Dictionary Learning
Tamás Ficsor
|
Gábor Berend
The growing deployment of deep learning models in real-world applications necessitates not only high predictive accuracy, but also mechanism to identify unreliable predictions, especially in high-stakes scenarios where decision risk must be minimized. Existing methods estimate uncertainty by leveraging predictive confidence (e.g., Softmax Response), structural characteristics of representation space (e.g., Mahalanobis distance), or stochastic variation in model outputs (e.g., Bayesian inference techniques such as Monte Carlo Dropout). In this work, we propose a novel uncertainty estimation (UE) framework based on sparse dictionary learning by identifying dictionary atoms associated with misclassified samples. We leverage pointwise mutual information (PMI) to quantify the association between sparse features and predictive failure. Our method – Sparsity-based Uncertainty Estimation (SUE) – is computationally efficient, offers interpretability via atom-level analysis of the dictionary, has no assumption about the class distribution (unlike Mahalanobis distance). We evaluated SUE on several NLU benchmarks (GLUE and ANLI tasks) and sentiment analysis benchmarks (Twitter, ParaDetox, and Jigsaw). In general, SUE outperforms or matches the performance of other methods. SUE performs particularly well when there is considerable uncertainty in the model, i.e., when the model lacks high precision.
pdf
bib
abs
Planning-Aware Code Infilling via Horizon-Length Prediction
Yifeng Ding
|
Hantian Ding
|
Shiqi Wang
|
Qing Sun
|
Varun Kumar
|
Zijian Wang
Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.
pdf
bib
abs
SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala
Ashmari Pramodya
|
Nirasha Nelki
|
Heshan Shalinda
|
Chamila Liyanage
|
Yusuke Sakai
|
Randil Pushpananda
|
Ruvan Weerasinghe
|
Hidetaka Kamigaito
|
Taro Watanabe
Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
pdf
bib
abs
OG-RAG: Ontology-grounded retrieval-augmented generation for large language models
Kartik Sharma
|
Peeyush Kumar
|
Yunqing Li
While LLMs are widely used for generic tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, and consulting without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology and retrieves a minimal set of hyperedges for a given query using an optimization algorithm. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods. We release the code at [https://github.com/microsoft/ograg2](https://github.com/microsoft/ograg2).
pdf
bib
abs
Convergence and Divergence of Language Models under Different Random Seeds
Finlay Fehlauer
|
Kyle Mahowald
|
Tiago Pimentel
In this paper, we investigate the convergence of language models (LMs) trained under different random seeds, measuring convergence as the expected per-token Kullback–Leibler (KL) divergence across seeds. By comparing LM convergence as a function of model size and training checkpoint, we identify a four-phase convergence pattern: (i) an initial uniform phase, (ii) a sharp-convergence phase, (iii) a sharp-divergence phase, and (iv) a slow-reconvergence phase. Further, we observe that larger models reconverge faster in later training stages, while smaller models never actually reconverge; these results suggest that a certain model size may be necessary to learn stable distributions. Restricting our analysis to specific token frequencies, or part-of-speech (PoS) tags further reveals that convergence is uneven across linguistic categories: frequent tokens and function words converge faster and more reliably than their counterparts (infrequent tokens and content words). Overall, our findings highlight factors that influence the stability of the learned distributions in model training.
pdf
bib
abs
Analyzing and Modeling LLM Response Lengths with Extreme Value Theory: Anchoring Effects and Hybrid Distributions
Liuxuan Jiao
|
Chen Gao
|
Yiqian Yang
|
Chenliang Zhou
|
YiXian Huang
|
Xinlei Chen
|
Yong Li
We present a statistical framework for modeling and controlling large language model (LLM) response lengths using extreme value theory. Analyzing 14,301 GPT-4o responses across temperature and prompting conditions, with cross-validation on Qwen and DeepSeek architectures, we demonstrate that verbosity follows Weibull-type generalized extreme value (GEV) distributions with heavier tails under stochastic generation. Our key contributions include: (1) development of a novel GEV-generalized Pareto (GPD) hybrid model that improves tail fit (R2CDF=0.9993 vs standalone GEV’s 0.998) while maintaining architectural generalizability; (2) quantitative characterization of prompt anchoring effects across models, showing reduced dispersion but increased outliers under randomization; and (3) identification of temperature-dependent response patterns that persist across architectures, with higher temperatures amplifying length variability while preserving extreme-value mechanisms. The hybrid model’s threshold selection method enables precise verbosity control in production systems regardless of model choice. While validated on multiple architectures, generalizability to emerging model families requires further study.
pdf
bib
abs
Language Models Identify Ambiguities and Exploit Loopholes
Jio Choi
|
Mohit Bansal
|
Elias Stengel-Eskin
Studying the responses of large language models (LLMs) to loopholes presents a two-fold opportunity. First, it affords us a lens through which to examine ambiguity and pragmatics in LLMs, since exploiting a loophole requires identifying ambiguity and performing sophisticated pragmatic reasoning. Second, loopholes pose an interesting and novel alignment problem where the model is presented with conflicting goals and can exploit ambiguities to its own advantage. To address these questions, we design scenarios where LLMs are given a goal and an ambiguous user instruction in conflict with the goal, with scenarios covering scalar implicature, structural ambiguities, and power dynamics. We then measure different models’ abilities to exploit loopholes to satisfy their given goals as opposed to the goals of the user. We find that both closed-source and stronger open-source models can identify ambiguities and exploit their resulting loopholes, presenting a potential AI safety risk. Our analysis indicates that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
pdf
bib
abs
Benchmarking LLMs for Translating Classical Chinese Poetry: Evaluating Adequacy, Fluency, and Elegance
Andong Chen
|
Lianzhang Lou
|
Kehai Chen
|
Xuefeng Bai
|
Yang Xiang
|
Muyun Yang
|
Tiejun Zhao
|
Min Zhang
Large language models (LLMs) have shown remarkable performance in general translation tasks. However, the increasing demand for high-quality translations that are not only adequate but also fluent and elegant. To assess the extent to which current LLMs can meet these demands, we introduce a suitable benchmark (PoetMT) for translating classical Chinese poetry into English. This task requires not only adequacy in translating culturally and historically significant content but also a strict adherence to linguistic fluency and poetic elegance. Our study reveals that existing LLMs fall short of this task. To address these issues, we propose RAT, a Retrieval-Augmented machine Translation method that enhances the translation process by incorporating knowledge related to classical poetry. Additionally, we propose an automatic evaluation metric based on GPT-4, which better assesses translation quality in terms of adequacy, fluency, and elegance, overcoming the limitations of traditional metrics.
pdf
bib
abs
AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
Alhanoof Althnian
|
Norah A. Alzahrani
|
Shaykhah Z. Alsubaie
|
Eman Albilali
|
Ahmed Abdelali
|
Nouf M. Alotaibi
|
M Saiful Bari
|
Yazeed Alnumay
|
Abdulhamed Alothaimen
|
Maryam Saif
|
Shahad D. Alzaidi
|
Faisal Abdulrahman Mirza
|
Yousef Almushayqih
|
Mohammed Al Saleem
|
Ghadah Alabduljabbar
|
Abdulmohsen Al-Thubaity
|
Areeb Alowisheq
|
Nora Al-Twairesh
The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction- following capabilities of foundation models in the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark perfor- mance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. AraEval datasets can be found at https://huggingface.co/collections/humain-ai/araeval-datasets-687760e04b12a7afb429a4a0.
pdf
bib
abs
QUIDS: Query Intent Description for Exploratory Search via Dual Space Modeling
Yumeng Wang
|
Xiuying Chen
|
Suzan Verberne
In exploratory search, users often submit vague queries to investigate unfamiliar topics, but receive limited feedback about how the search engine understood their input. This leads to a self-reinforcing cycle of mismatched results and trial-and-error reformulation. To address this, we study the task of generating user-facing natural language query intent descriptions that surface what the system likely inferred the query to mean, based on post-retrieval evidence. We propose QUIDS, a method that leverages dual-space contrastive learning to isolate intent-relevant information while suppressing irrelevant content. QUIDS combines a dual-encoder representation space with a disentangling decoder that works together to produce concise and accurate intent descriptions. Enhanced by intent-driven hard negative sampling, the model significantly outperforms state-of-the-art baselines across ROUGE, BERTScore, and human/LLM evaluations. Our qualitative analysis confirms QUIDS’ effectiveness in generating accurate intent descriptions for exploratory search. Our work contributes to improving the interaction between users and search engines by providing feedback to the user in exploratory search settings.
pdf
bib
abs
A Systematic Survey of Automatic Prompt Optimization Techniques
Kiran Ramnath
|
Kang Zhou
|
Sheng Guan
|
Soumya Smruti Mishra
|
Xuan Qi
|
Zhengyuan Shen
|
Shuai Wang
|
Sangmin Woo
|
Sullam Jeoung
|
Yawei Wang
|
Haozhu Wang
|
Han Ding
|
Yuzhe Lu
|
Zhichao Xu
|
Yun Zhou
|
Balasubramaniam Srinivasan
|
Qiaojing Yan
|
Yueyan Chen
|
Haibo Ding
|
Panpan Xu
|
Lin Lee Cheong
Since the advent of large language models (LLMs), prompt engineering has been a crucial step for eliciting desired responses for various Natural Language Processing (NLP) tasks. However, prompt engineering remains an impediment for end users due to rapid advances in models, tasks, and associated best practices. To mitigate this, Automatic Prompt Optimization (APO) techniques have recently emerged that use various automated techniques to help improve the performance of LLMs on various tasks. In this paper, we present a comprehensive survey summarizing the current progress and remaining challenges in this field. We provide a formal definition of APO, a 5-part unifying framework, and then proceed to rigorously categorize all relevant works based on their salient features therein. We hope to spur further research guided by our framework.
pdf
bib
abs
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation
Beiduo Chen
|
Yang Janet Liu
|
Anna Korhonen
|
Barbara Plank
The recent rise of reasoning-tuned Large Language Models (LLMs)—which generate chains of thought (CoTs) before giving the final answer—has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance.Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a *reverse* paradigm: producing explanations based on given answers. In contrast, CoTs provide a *forward* reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions.Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
pdf
bib
abs
MemInsight: Autonomous Memory Augmentation for LLM Agents
Rana Salama
|
Jason Cai
|
Michelle Yuan
|
Anna Currey
|
Monica Sunkara
|
Yi Zhang
|
Yassine Benajiba
Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.
pdf
bib
abs
Breaking the Noise Barrier: LLM-Guided Semantic Filtering and Enhancement for Multi-Modal Entity Alignment
Chenglong Lu
|
Chenxiao Li
|
Jingwei Cheng
|
Yongquan Ji
|
Guoqing Chen
|
Fu Zhang
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multimodal knowledge graphs (MMKGs). However, the intrinsic noise within modalities, such as the inconsistency in visual modality and redundant attributes, has not been thoroughly investigated. Excessive noise not only weakens semantic representation but also increases the risk of overfitting in attention-based fusion methods. To address this, we propose LGEA, a novel LLMguided MMEA framework that prioritizes noise reduction before fusion. Specifically, LGEA introduces two key strategies: (1) fine-grained visual filtering to remove irrelevant images at the semantic level, and (2) contextual summarization of attribute information to enhance entity semantics. To our knowledge, we are the first work to apply LLMs for both visual filtering and attribute-level semantic enhancement in MMEA. Experiments on multiple benchmarks, including the noisy FB YG dataset, show that LGEA sets a new state-of-the-art (SOTA) in robust multi-modal alignment, highlighting the potential of noise-aware strategies as a promising direction for future MMEA research.
pdf
bib
abs
ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
Zeinab Sadat Taghavi
|
Ali Modarressi
|
Yunpu Ma
|
Hinrich Schuetze
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present Impliret, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.
pdf
bib
abs
No Need for Explanations: LLMs can implicitly learn from mistakes in-context
Lisa Alazraki
|
Maximilian Mozes
|
Jon Ander Campos
|
Tan Yi-Chern
|
Marek Rei
|
Max Bartolo
Showing incorrect answers to Large Language Models (LLMs) is a popular strategy to improve their performance in reasoning-intensive tasks. It is widely assumed that, in order to be helpful, the incorrect answers must be accompanied by comprehensive rationales, explicitly detailing where the mistakes are and how to correct them. However, in this work we present a counterintuitive finding: we observe that LLMs perform *better* in math reasoning tasks when these rationales are eliminated from the context and models are left to infer on their own what makes an incorrect answer flawed. This approach also substantially outperforms chain-of-thought prompting in our evaluations. These results are consistent across LLMs of different sizes and varying reasoning abilities. To gain an understanding of *why* LLMs learn from mistakes more effectively without explicit corrective rationales, we perform a thorough analysis, investigating changes in context length and answer diversity between different prompting strategies, and their effect on performance. We also examine evidence of overfitting to the in-context rationales when these are provided, and study the extent to which LLMs are able to autonomously infer high-quality corrective rationales given only incorrect answers as input. We find evidence that, while incorrect answers are more beneficial for LLM learning than additional diverse *correct* answers, explicit corrective rationales over-constrain the model, thus limiting those benefits.
pdf
bib
abs
MoVa: Towards Generalizable Classification of Human Morals and Values
Ziyu Chen
|
Junfei Sun
|
Chenxi Li
|
Tuan Dung Nguyen
|
Jing Yao
|
Xiaoyuan Yi
|
Xing Xie
|
Chenhao Tan
|
Lexing Xie
Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.
pdf
bib
abs
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration
Yue Fan
|
Handong Zhao
|
Ruiyi Zhang
|
Yu Shen
|
Xin Eric Wang
|
Gang Wu
Graphical User Interface (GUI) action grounding, mapping language instructions to actionable elements on GUI screens, is important for assisting users in interactive tutorials, task automation, accessibility support, etc. Most recent works of GUI action grounding use large GUI datasets to fine-tune Multimodal Large Language Models (MLLMs). However, the fine-tuning data is inherently limited to specific GUI environments, leading to significant performance degradation in novel environments due to the generalization challenges in the GUI domain. Therefore, we argue that GUI action grounding models should be further aligned with novel environments before deployment to optimize their performance. To address this, we first propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration and then continuously fine-tune GUI grounding models with the collected data. To ensure the GUI action grounding models generalize to various screens within the target novel environment after the continuous fine-tuning, we equip GUI-Bee with a novel Q-value-Incentive In-Context Reinforcement Learning (Q-ICRL) algorithm that optimizes exploration efficiency and exploration data quality. In the experiment, we introduce NovelScreenSpot to test how well the data can help align GUI action grounding models to novel environments. Furthermore, we conduct an ablation study to validate the Q-ICRL method in enhancing the efficiency of GUI-Bee.
pdf
bib
abs
Revealing and Mitigating the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
Wenyuan Zhang
|
Shuaiyi Nie
|
Jiawei Sheng
|
Zefeng Zhang
|
Xinghua Zhang
|
Yongquan He
|
Tingwen Liu
Large language model (LLM) role-playing has gained widespread attention. Authentic character knowledge is crucial for constructing realistic LLM role-playing agents. However, existing works usually overlook the exploration of LLMs’ ability to detect characters’ known knowledge errors (KKE) and unknown knowledge errors (UKE) while playing roles, which would lead to low-quality automatic construction of character trainable corpus. In this paper, we propose RoleKE-Bench to evaluate LLMs’ ability to detect errors in KKE and UKE. The results indicate that even the latest LLMs struggle to detect these two types of errors effectively, especially when it comes to familiar knowledge. We experimented with various reasoning strategies and propose an agent-based reasoning method, Self-Recollection and Self-Doubt (S2RD), to explore further the potential for improving error detection capabilities.
pdf
bib
abs
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu
|
Sipeng Zheng
|
Börje F. Karlsson
|
Zongqing Lu
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a new large-scale multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring complex dialogues with contextual dependencies that force models to track, ground, and recall information across multiple turns and disparate visual regions. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing we present DiagNote, equipped with multimodal grounding and reasoning capabilities. DiagNote adopts a novel dual-module architecture that explicitly separates reasoning from grounding: a reasoning module (Deliberate) performs step-by-step Chain-of-Thought, while a grounding module (Gaze) provides precise visual focus by predicting bounding box annotations. These modules interact iteratively, enabling DiagNote to dynamically refine its understanding. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.
pdf
bib
abs
Graph-Based Multi-Trait Essay Scoring
Shengjie Li
|
Vincent Ng
While virtually all existing work on Automated Essay Scoring (AES) models an essay as a word sequence, we put forward the novel view that an essay can be modeled as a graph and subsequently propose GAT-AES, a graph-attention network approach to AES. GAT-AES models the interactions among essay traits in a principled manner by (1) representing each essay trait as a trait node in the graph and connecting each pair of trait nodes with directed edges, and (2) allowing neighboring nodes to influence each other by using a convolutional operator to update node representations. Unlike competing approaches, which can only model one-hop dependencies, GAT-AES allows us to easily model multi-hop dependencies. Experimental results demonstrate that GAT-AES achieves the best multi-trait scoring results to date on the ASAP++ dataset. Further analysis shows that GAT-AES outperforms not only alternative graph neural networks but also approaches that use trait-attention mechanisms to model trait dependencies.
pdf
bib
abs
Benchmarking LLMs on Semantic Overlap Summarization
John Salvador
|
Naman Bansal
|
Mousumi Akter
|
Souvika Sarkar
|
Anupam Das
|
Santu Karmaker
Semantic Overlap Summarization (SOS) is a multi-document summarization task focused on extracting the common information shared cross alternative narratives which is a capability that is critical for trustworthy generation in domains such as news, law, and healthcare. We benchmark popular Large Language Models (LLMs) on SOS and introduce PrivacyPolicyPairs (3P), a new dataset of 135 high-quality samples from privacy policy documents, which complements existing resources and broadens domain coverage. Using the TELeR prompting taxonomy, we evaluate nearly one million LLM-generated summaries across two SOS datasets and conduct human evaluation on a curated subset. Our analysis reveals strong prompt sensitivity, identifies which automatic metrics align most closely with human judgments, and provides new baselines for future SOS research
pdf
bib
abs
N-CORE: N-View Consistency Regularization for Disentangled Representation Learning in Nonverbal Vocalizations
Siddhant Bikram Shah
|
Kristina T. Johnson
Nonverbal vocalizations are an essential component of human communication, conveying rich information without linguistic content. However, their computational analysis is hindered by a lack of lexical anchors in the data, compounded by biased and imbalanced data distributions. While disentangled representation learning has shown promise in isolating specific speech features, its application to nonverbal vocalizations remains unexplored. In this paper, we introduce N-CORE, a novel backbone-agnostic framework designed to disentangle intertwined features like emotion and speaker information from nonverbal vocalizations by leveraging N views of audio samples to learn invariance to specific transformations. N-CORE achieves competitive performance compared to state-of-the-art methods for emotion and speaker classification on the VIVAE, ReCANVo, and ReCANVo-Balanced datasets. We further propose an emotion perturbation function that disrupts affective information while preserving speaker information in audio signals for emotion-invariant speaker classification. Our work informs research directions on paralinguistic speech processing, including clinical diagnoses of atypical speech and longitudinal analysis of communicative development. Our code is available at https://github.com/SiddhantBikram/N-CORE.
pdf
bib
abs
Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction
Jinwook Park
|
Kangil Kim
Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, *probability distribution collapse*, as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, *collapse-relaxing neural parameterization*, to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.
pdf
bib
abs
Spatial Layouts in News Homepages Capture Human Preferences
Alexander Spangher
|
Michael Vu
|
Arda Kaz
|
Naitian Zhou
|
Ben Welsh
Information prioritization plays an important role in the way we perceive and understand the world. Homepage layouts, which are daily and manually curated by expert human news editors, serve as a tangible proxy for this prioritization. In this work, we present NewsHomepages, a novel and massive dataset of over 3,000 news website homepages, including local, national, and topic-specific outlets, captured twice daily over a five-year period. We develop a scalable pairwise preference model to capture ranked preferences between news items and confirm that these preferences are stable and learnable: our models infer editorial preference with over 0.7 F1 score (based on human trials). To demonstrate the importance of these learned preferences, we (1) perform a novel analysis showing that outlets across the political spectrum share surprising preference agreements and (2) apply our models to rank-order a collection of local city council policies passed over a ten-year period in San Francisco, assessing their “newsworthiness”. Our findings lay the groundwork for leveraging implicit cues to deepen our understanding of human informational preference.
pdf
bib
abs
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Taebaek Hwang
|
Minseo Kim
|
Gisang Lee
|
Seonuk Kim
|
Hyunjun Eun
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at [https://github.com/tabtoyou/KRETA](https://github.com/tabtoyou/KRETA).
pdf
bib
abs
ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection
Jeonghye Kim
|
Sojeong Rhee
|
Minbeom Kim
|
Dohyung Kim
|
Sangmook Lee
|
Youngchul Sung
|
Kyomin Jung
Recent advances in LLM agents have largely built on reasoning backbones like ReAct, which interleave thought and action in complex environments. However, ReAct often produces ungrounded or incoherent reasoning steps, leading to misalignment between the agent’s actual state and goals. Our analysis finds that this stems from ReAct’s inability to maintain consistent internal beliefs and goal alignment, causing compounding errors and hallucinations. To address this, we introduce ReflAct, a novel backbone that shifts reasoning from merely planning next actions to continuously reflecting on the agent’s state relative to its goal. By explicitly grounding decisions in states and enforcing ongoing goal alignment, ReflAct dramatically improves strategic reliability. This design delivers substantial empirical gains: ReflAct surpasses ReAct by 27.7% on average, achieving a 93.3% success rate in ALFWorld. Notably, ReflAct even outperforms ReAct with added enhancement modules (e.g., Reflexion, WKM), showing that strengthening the core reasoning backbone is key to reliable agent performance.
pdf
bib
abs
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Shudong Liu
|
Hongwei Liu
|
Junnan Liu
|
Linchen Xiao
|
Songyang Gao
|
Chengqi Lyu
|
Yuzhe Gu
|
Wenwei Zhang
|
Derek F. Wong
|
Songyang Zhang
|
Kai Chen
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of meta error patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate evaluation protocols and reinforcement learning research.
pdf
bib
abs
A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making
Xiao Wu
|
Ting-Zhu Huang
|
Liang-Jian Deng
|
Yanyuan Qiao
|
Imran Razzak
|
Yutong Xie
Medical decision-making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi-agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre-assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge-driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real-world medical benchmarks demonstrate that KAMAC significantly outperforms both single-agent and advanced multi-agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross-specialty expertise. Our code is publicly available at: https://github.com/XiaoXiao-Woo/KAMAC.
pdf
bib
abs
Castle: Causal Cascade Updates in Relational Databases with Large Language Models
Yongye Su
|
Yucheng Zhang
|
Zeru Shi
|
Bruno Ribeiro
|
Elisa Bertino
This work introduces Castle, the first framework for schema-only cascade update generation using large language models (LLMs). Despite recent advances in LLMs for Text2SQL code generation, existing approaches focus primarily on SELECT queries, neglecting the challenges of SQL update operations and their ripple effects. Traditional CASCADE UPDATE constraints are static and unsuitable for modern, denormalized databases, which demand dynamic, context-aware updates. Castle enables natural language instructions to trigger multi-column, causally consistent SQL UPDATE statements, without revealing table content to the model. By framing UPDATE SQL generation as a divide-and-conquer task with LLMs’ reasoning capacity, Castle can determine not only which columns must be directly updated, but also how those updates propagate through the schema, causing cascading updates — all via nested queries and substructures that ensure data confidentiality. We evaluate it on real-world causal update scenarios, demonstrating its ability to produce accurate SQL updates, and thereby highlighting the reasoning ability of LLMs in automated DBMS.
pdf
bib
abs
Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies
Vishnu Raja
|
Adithya V Ganesan
|
Anand Syamkumar
|
Ritwik Banerjee
|
H. Schwartz
State-of-the-art automatic speech recognition (ASR) models like Whisper perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncrasy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) *normative* models trained on typical speech (no personalization), (b) *idiosyncratic* models completely personalized to individuals, (c) *dysarthric-normative* models trained on other dysarthric speakers, and (d) *dysarthric-idiosyncratic* models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than the idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs. 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results, reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations. [GitHub: VishnuRaja98/Dysarthric-Speech-Transcription](https://github.com/VishnuRaja98/Dysarthric-Speech-Transcription)
pdf
bib
abs
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Kinjal Basu
|
Ibrahim Abdelaziz
|
Kiran Kate
|
Mayank Agarwal
|
Maxwell Crouse
|
Yara Rizk
|
Kelsey Bradford
|
Asim Munawar
|
Sadhana Kumaravel
|
Saurabh Goyal
|
Xin Wang
|
Luis A. Lastras
|
Pavan Kapanipathi
The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs’ fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on a variety of models show that the best-performing model (GPT-4o) achieves a full sequence match accuracy of 28% and a win-rate of 60%, necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress.
pdf
bib
abs
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
Md. Atabuzzaman
|
Ali Asgarov
|
Chris Thomas
Large Vision-Language Models (LVLMs) have achieved strong performance on vision-language tasks, particularly Visual Question Answering (VQA). While prior work has explored unimodal biases in VQA, the problem of selection bias in Multiple-Choice Question Answering (MCQA), where models may favor specific option tokens (e.g., “A”) or positions, remains underexplored. In this paper, we investigate both the presence and nature of selection bias in LVLMs through fine-grained MCQA benchmarks spanning easy, medium, and hard difficulty levels, defined by the semantic similarity of the options. We further propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts and applies confidence-adaptive corrections to the model’s output. Our method mitigates bias without retraining and is compatible with frozen LVLMs. Extensive experiments across several state-of-the-art models reveal consistent selection biases that intensify with task difficulty, and show that our mitigation approach significantly reduces bias while improving accuracy in challenging settings. This work offers new insights into the limitations of LVLMs in MCQA and presents a practical approach to improve their robustness in fine-grained visual reasoning. Datasets and code are available at: https://github.com/Atabuzzaman/Selection-Bias-of-LVLMs
pdf
bib
abs
Can Large Language Models Unlock Novel Scientific Research Ideas?
Sandeep Kumar
|
Tirthankar Ghosal
|
Vinayak Goyal
|
Asif Ekbal
The widespread adoption of Large Language Models (LLMs) and publicly available ChatGPT have marked a significant turning point in the integration of Artificial Intelligence (AI) into people’s everyday lives. This study explores the capability of LLMs in generating novel research ideas based on information from research papers. We conduct a thorough examination of 4 LLMs in five domains (e.g., Chemistry, Computer, Economics, Medical, and Physics). We found that the future research ideas generated by Claude-2 and GPT-4 are more aligned with the author’s perspective than GPT-3.5 and Gemini. We also found that Claude-2 generates more diverse future research ideas than GPT-4, GPT-3.5, and Gemini 1.0. We further performed a human evaluation of the novelty, relevancy, and feasibility of the generated future research ideas. This investigation offers insights into the evolving role of LLMs in idea generation, highlighting both its capability and limitations. Our work contributes to the ongoing efforts in evaluating and utilizing language models for generating future research ideas. We make our datasets and codes publicly available.
pdf
bib
abs
Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly
Wenya Xie
|
Shaochen Zhong
|
Hoang Anh Duy Le
|
Zhaozhuo Xu
|
Jianwen Xie
|
Zirui Liu
Large Reasoning Models (LRMs) are often bottlenecked by the high cost of output tokens. We show that a significant portion of these tokens are useless self-repetitions — what we call “word salad” — that exhaust the decoding budget without adding value. Interestingly, we observe that LRMs are self-aware when trapped in these loops: the hidden states of ‘‘ tokens trailing each reasoning chunk exhibit patterns that allow us to detect word salad behavior on-the-fly via a single linear classifier. Once detected, a simple chop appended by a straightforward regeneration prompt yields substantial length savings with minimal quality loss. Our work offers WordSaladChopper (WSC) — a lightweight, turnkey component for LRM that is minimally invasive to its reasoning trajectory. Given its low overhead, strong savings, and the lack of semantic value of word salad tokens, we believe it is not too far-fetched to argue that WSC — or a similar component — is a must-have for all LRM applications with user experience in mind.
pdf
bib
abs
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Pramit Sahoo
|
Maharaj Brahma
|
Maunendra Sankar Desarkar
Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment (CITATION) and produce biased generations (CITATION) due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises ~8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI, project webpage, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation.
pdf
bib
abs
SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities
Shuyang Cao
|
Kaijian Zou
|
Lu Wang
Recently, researchers have turned to synthetic tasks for evaluation of large language models’ long-context capabilities, as they offer more flexibility than realistic benchmarks in scaling both input length and dataset size. However, existing synthetic tasks typically target narrow skill sets such as retrieving information from massive input, limiting their ability to comprehensively assess model capabilities. Furthermore, existing benchmarks often pair each task with a different input context, creating confounding factors that prevent fair cross-task comparison. To address these limitations, we introduce SYNC, a new evaluation suite of synthetic tasks spanning domains including graph understanding and translation. Each domain includes three tasks designed to test a wide range of capabilities—from retrieval, to multi-hop tracking, and to global context understanding that that requires chain-of-thought (CoT) reasoning. Crucially, all tasks share the same context, enabling controlled comparisons of model performance. We evaluate 14 LLMs on SYNC and observe substantial performance drops on more challenging tasks, underscoring the benchmark’s difficulty. Additional experiments highlight the necessity of CoT reasoning and demonstrate that poses a robust challenge for future models.
pdf
bib
abs
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Chester Palen-Michel
|
Maxwell Pickering
|
Maya Kruse
|
Jonne Sälevä
|
Constantine Lignos
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets.OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies.We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER.We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER.We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task.OpenNER is released at https://github.com/bltlab/open-ner.
pdf
bib
abs
Mondrian: A Framework for Logical Abstract (Re)Structuring
Elizabeth Grace Orwig
|
Shinwoo Park
|
Hyundong Jin
|
Yo-Sub Han
The well-known rhetorical framework, ABT (And, But, Therefore), mirrors natural human cognition in structuring an argument’s logical progression - apropos to academic communication. However, distilling the complexities of research into clear and concise prose requires careful sequencing of ideas and formulating clear connections between them. This presents a quiet inequitability for contributions from authors who struggle with English proficiency or academic writing conventions. We see this as impetus to introduce: Mondrian, a framework that identifies the key components of an abstract and reorients itself to properly reflect the ABT logical progression. The framework is composed of a deconstruction stage, reconstruction stage, and rephrasing. We introduce a novel metric for evaluating deviation from ABT structure, named EB-DTW, which accounts for both ordinality and a non-uniform distribution of importance in a sequence. Our overall approach aims to improve the comprehensibility of academic writing, particularly for non-native English speakers, along with a complementary metric. The effectiveness of Mondrian is tested with automatic metrics and extensive human evaluation, and demonstrated through impressive quantitative and qualitative results, with organization and overall coherence of an abstract improving by an average of 27.71% and 24.71%.
pdf
bib
abs
Case-Based Decision-Theoretic Decoding with Quality Memories
Hiroyuki Deguchi
|
Masaaki Nagata
Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding.However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain.To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data.CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja↔En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.
pdf
bib
abs
PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process
Xinliang Frederick Zhang
|
Nicholas Beauchamp
|
Lu Wang
Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semanticmemory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.
pdf
bib
abs
Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations
Ananth Agarwal
|
Jasper Jian
|
Christopher D Manning
|
Shikhar Murty
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify syntactic mechanisms linearly encoded in activations; however, no comprehensive study has yet established whether a model’s probing accuracy reliably predicts its downstream syntactic performance. Adopting a “mechanisms vs. outcomes” framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.
pdf
bib
abs
Image Difference Captioning via Adversarial Preference Optimization
Zihan Huang
|
Junda Wu
|
Rohan Surana
|
Tong Yu
|
David Arbour
|
Ritwik Sinha
|
Julian McAuley
Image Difference Captioning (IDC) aims to generate natural language descriptions that highlight subtle differences between two visually similar images. While recent advances leverage pre-trained vision-language models to align fine-grained visual differences with textual semantics, existing supervised approaches often overfit to dataset-specific language patterns and fail to capture accurate preferences on IDC, which often indicates fine-grained and context-aware distinctions. To address these limitations, we propose an adversarial direct preference optimization (ADPO) framework for IDC, which formulates IDC as a preference optimization problem under the Bradley-Terry-Luce model, directly aligning the captioning policy with pairwise difference preferences via Direct Preference Optimization (DPO). To model more accurate and diverse IDC preferences, we introduce an adversarially trained hard negative retriever that selects counterfactual captions, This results in a minimax optimization problem, which we solve via policy-gradient reinforcement learning, enabling the policy and retriever to improve jointly. Experiments on benchmark IDC datasets show that our approach outperforms existing baselines, especially in generating fine-grained and accurate difference descriptions.
pdf
bib
abs
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
Mohammad Ramezanali
|
Mo Vazifeh
|
Paolo Santi
We introduce **seqBench**, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. **seqBench** allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, **seqBench**’s fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on **seqBench**’s structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the **seqBench** datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
pdf
bib
abs
NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery
Minki Hong
|
Jangho Choi
|
Jihie Kim
Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.
pdf
bib
abs
SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Anjiang Wei
|
Yuheng Wu
|
Yingjia Wan
|
Tarun Suresh
|
Huanmi Tan
|
Zhanke Zhou
|
Sanmi Koyejo
|
Ke Wang
|
Alex Aiken
We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems.Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a puzzle using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-based and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. Our error analysis reveals systematic failures such as satisfiability bias, context inconsistency, and condition omission, highlighting limitations of current LLMs in search-based logical reasoning. Our code and data are publicly available at https://github.com/Anjiang-Wei/SATBench
pdf
bib
abs
Data Descriptions from Large Language Models with Influence Estimation
Chaeri Kim
|
Jaeyeon Bae
|
Taehwan Kim
Deep learning models have been successful in many areas, but understanding their behavior remains a challenge. Most prior explainable AI (XAI) approaches have focused on interpreting how models make predictions. In contrast, we introduce a novel approach that identifies textual descriptions most beneficial for model training. By analyzing which descriptions contribute most effectively to the model training, our method has the potential to provide insights into how the model prioritizes and utilizes information for decision-making. To achieve this, we propose a pipeline that generates textual descriptions using large language models, incorporates external knowledge bases, and refines them through influence estimation and CLIP score. Furthermore, leveraging the phenomenon of cross-modal transferability, we propose a novel benchmark task named cross-modal transfer classification to examine the effectiveness of our textual descriptions. In zero-shot experiments, we demonstrate that our textual descriptions improve classification accuracy compared to baselines, leading to consistent performance gains across nine image classification datasets. Additionally, understanding which descriptions contribute most to model performance can shed light on how the model utilizes textual information in its decision-making.
pdf
bib
abs
EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking
Anjiang Wei
|
Jiannan Cao
|
Ran Li
|
Hongyu Chen
|
Yuhui Zhang
|
Ziheng Wang
|
Yuan Liu
|
Thiago S. F. X. Teixeira
|
Diyi Yang
|
Ke Wang
|
Alex Aiken
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model’s ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
pdf
bib
abs
MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing
Shiqi Wang
|
Qi Wang
|
Runliang Niu
|
He Kong
|
Yi Chang
Large language models (LLMs) require continual knowledge updates to keep pace with the evolving world. While various model editing methods have been proposed, most face critical challenges in the context of lifelong learning due to two fundamental limitations: (1) Edit Overshooting - parameter updates intended for a specific fact spill over to unrelated regions, causing interference with previously retained knowledge; and (2) Knowledge Entanglement - polysemantic neurons’ overlapping encoding of multiple concepts makes it difficult to isolate and edit a single fact. In this paper, we propose MicroEdit, a neuron-level editing method that performs minimal and controlled interventions within LLMs. By leveraging a sparse autoencoder (SAE), MicroEdit disentangles knowledge representations and activates only a minimal set of necessary neurons for precise parameter updates. This targeted design enables fine-grained control over the editing scope, effectively mitigating interference and preserving unrelated knowledge. Extensive experiments show that MicroEdit outperforms prior methods and robustly handles lifelong knowledge editing across QA and Hallucination settings on LLaM and Mistral.
pdf
bib
abs
Do Large Language Models Understand Word Senses?
Domenico Meconi
|
Simone Stirpe
|
Federico Martelli
|
Leonardo Lavalle
|
Roberto Navigli
Understanding the meaning of words in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word senses remains underexplored. In this paper, we address this gap by evaluating both i) the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task, and ii) the ability of two top-performing open- and closed-source LLMs to understand word senses in three generative settings: definition generation, free-form explanation, and example generation. Notably, we find that, in the WSD task, leading models such as GPT-4o and DeepSeek-V3 achieve performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of difficulty. In the generation tasks, results reveal that LLMs can explain the meaning of words in context up to 98% accuracy, with the highest performance observed in the free-form explanation task, which best aligns with their generative capabilities.We release our code and data at: https://github.com/Babelscape/LLM-WSD.
pdf
bib
abs
Diverse, not Short: A Length-Controlled Data Selection Strategy for Improving Response Diversity of Language Models
Vijeta Deshpande
|
Debasmita Ghose
|
John D Patterson
|
Roger E. Beaty
|
Anna Rumshisky
Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled data selection strategy that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective “diversity teachers” for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.
pdf
bib
abs
Uncovering the Bigger Picture: Comprehensive Event Understanding Via Diverse News Retrieval
Yixuan Tang
|
Yuanyuan Shi
|
Yiqun Sun
|
Anthony Kum Hoe Tung
Access to diverse perspectives is essential for understanding real-world events, yet most news retrieval systems prioritize textual relevance, leading to redundant results and limited viewpoint exposure. We propose NEWSCOPE, a two-stage framework for diverse news retrieval that enhances event coverage by explicitly modeling semantic variation at the sentence level. The first stage retrieves topically relevant content using dense retrieval, while the second stage applies sentence-level clustering and diversity-aware re-ranking to surface complementary information. To evaluate retrieval diversity, we introduce three interpretable metrics, namely Average Pairwise Distance, Positive Cluster Coverage, and Information Density Ratio, and construct two paragraph-level benchmarks: LocalNews and DSGlobal. Experiments show that NEWSCOPE consistently outperforms strong baselines, achieving significantly higher diversity without compromising relevance. Our results demonstrate the effectiveness of fine-grained, interpretable modeling in mitigating redundancy and promoting comprehensive event understanding. The data and code are available at
https://github.com/tangyixuan/NEWSCOPE.
pdf
bib
abs
Personalized LLM Decoding via Contrasting Personal Preference
Hyungjune Bu
|
ChanJoo Jung
|
Minjae Kang
|
Jaehyung Kim
As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose Contrasting Personal Preference (CoPe), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user’s implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L without relying on external reward models or additional training procedures.
pdf
bib
abs
The Missing Parts: Augmenting Fact Verification with Half Truth Detection
Yixuan Tang
|
Jincheng Wang
|
Anthony Kum Hoe Tung
Fact verification systems typically assess whether a claim is supported by retrieved evidence, assuming that truthfulness depends solely on what is stated. However, many real-world claims are half-truths, factually correct yet misleading due to the omission of critical context. Existing models struggle with such cases, as they are not designed to reason about omitted information. We introduce the task of half-truth detection, and propose PolitiFact-Hidden, a new benchmark with 15k political claims annotated with sentence-level evidence alignment and inferred claim intent. To address this challenge, we present TRACER, a modular re-assessment framework that identifies omission-based misinformation by aligning evidence, inferring implied intent, and estimating the causal impact of hidden content. TRACER can be integrated into existing fact-checking pipelines and consistently improves performance across multiple strong baselines. Notably, it boosts Half-True classification F1 by up to 16 points, highlighting the importance of modeling omissions for trustworthy fact verification. The benchmark and code are available via https://github.com/tangyixuan/TRACER.
pdf
bib
abs
Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations
Yimin Xiao
|
Yongle Zhang
|
Dayeon Ki
|
Calvin Bao
|
Marianna J. Martindale
|
Charlotte Vaughn
|
Ge Gao
|
Marine Carpuat
As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users’ reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.
pdf
bib
abs
Personalization up to a Point: Why Personalized Content Moderation Needs Boundaries, and How We Can Enforce Them
Emanuele Moscato
|
Tiancheng Hu
|
Matthias Orlikowski
|
Paul Röttger
|
Debora Nozza
Personalized content moderation can protect users from harm while facilitating free expression by tailoring moderation decisions to individual preferences rather than enforcing universal rules. However, content moderation that is fully personalized to individual preferences, no matter what these preferences are, may lead to even the most hazardous types of content being propagated on social media. In this paper, we explore this risk using hate speech as a case study. Certain types of hate speech are illegal in many countries. We show that, while fully personalized hate speech detection models increase overall user welfare (as measured by user-level classification performance), they also make predictions that violate such legal hate speech boundaries, especially when tailored to users who tolerate highly hateful content. To address this problem, we enforce legal boundaries in personalized hate speech detection by overriding predictions from personalized models with those from a boundary classifier. This approach significantly reduces legal violations while minimally affecting overall user welfare. Our findings highlight both the promise and the risks of personalized moderation, and offer a practical solution to balance user preferences with legal and ethical obligations.
pdf
bib
abs
MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs
Chong Jun Rong Brian
|
Yixuan Tang
|
Anthony Kum Hoe Tung
Misinformation evolves as it spreads, shifting in language, framing, and moral emphasis to adapt to new audiences. However, current misinformation detection approaches implicitly assume that misinformation is static. We introduce MPCG, a multi-round, persona-conditioned framework that simulates how claims are iteratively reinterpreted by agents with distinct ideological perspectives. Our approach uses an uncensored large language model (LLM) to generate persona-specific claims across multiple rounds, conditioningeach generation on outputs from the previous round, enabling the study of misinformation evolution. We evaluate the generated claims through human and LLM-based annotations, cognitive effort metrics (readability, perplexity), emotion evocation metrics (sentiment analysis, morality), clustering, feasibility, and downstream classification. Results show strong agreement between human and GPT-4o-mini annotations, with higher divergence in fluency judgments. Generated claims require greater cognitive effort than the original claims and consistently reflect persona-aligned emotional and moral framing. Clustering and cosine similarity analyses confirmsemantic drift across rounds while preserving topical coherence. Feasibility results show a 77% feasibility rate, confirming suitability for downstream tasks. Classification results reveal that commonly used misinformation detectors experience macro-F1 performance drops of up to 49.7%. The code is available at https://github.com/bcjr1997/MPCG.
pdf
bib
abs
LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference
Pingjun Hong
|
Beiduo Chen
|
Siyao Peng
|
Marie-Catherine de Marneffe
|
Barbara Plank
There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, *within-label variation* — cases where annotators agree on the same label but provide divergent reasoning — poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LiTEx, a linguistically-informed taxonomy for categorizing free-text explanations in English. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LiTEx yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.
pdf
bib
abs
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
Tommaso Bonomo
|
Luca Gioffré
|
Roberto Navigli
Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA.This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans.Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/sapienzaNLP/LiteraryQA.
pdf
bib
abs
FillerSpeech: Towards Human-Like Text-to-Speech Synthesis with Filler Insertion and Filler Style Control
Seung-Bin Kim
|
Jun-Hyeok Cha
|
Hyung-Seok Oh
|
Heejin Choi
|
Seong-Whan Lee
Recent advancements in speech synthesis have significantly improved the audio quality and pronunciation of synthesized speech. To further advance toward human-like conversational speech synthesis, this paper presents FillerSpeech, a novel speech synthesis framework that enables natural filler insertion and control over filler style. To address this, we construct a filler-inclusive speech data, derived from the open-source large-scale speech corpus. This data includes fillers with pitch and duration information. For the generation and style control of natural fillers, we propose a method that tokenizes the filler style and utilizes cross-attention with the input text. Furthermore, we introduce a large language model-based filler prediction method that enables natural insertion of fillers even when only text input is provided. The experimental results demonstrate that the constructed dataset is valid and that our proposed methods for filler style control and filler prediction are effective.
pdf
bib
abs
Multi-LMentry: Can Multilingual LLMs Solve Elementary Tasks Across Languages?
Luca Moroni
|
Javier Aula-Blasco
|
Simone Conia
|
Irene Baucells
|
Naiara Perez
|
Silvia Paniagua Suárez
|
Anna Sallés
|
Malte Ostendorff
|
Júlia Falcão
|
Guijin Son
|
Aitor Gonzalez-Agirre
|
Roberto Navigli
|
Marta Villegas
As large language models (LLMs) continue to improve, their evaluation increasingly centers on complex, high-level tasks, often at the expense of systematically assessing fundamental capabilities. To address this gap, recent work proposed LMentry, a compact benchmark comprising tasks that are trivial for humans but remain surprisingly difficult for LLMs. However, LMentry is limited to English, leaving its insights linguistically narrow. In this paper, we present Multi-LMentry, a ground-up recreation of LMentry that enables systematic evaluation of LLMs on basic reasoning and understanding tasks across nine diverse languages. Multi-LMentry includes English and expands to Basque, Brazilian Portuguese, Catalan, Galician, German, Italian, Korean, and Spanish, emphasizing the importance of cross-lingual and low-resource settings. To validate that Multi-LMentry is still trivial for humans, we demonstrate that L2 speakers with only elementary proficiency achieve near-perfect scores in a low-resource language, namely, Basque. Through extensive experiments, we reveal that state-of-the-art open-weight multilingual LLMs still fall short of human performance on elementary tasks in many languages. Our results expose new failure modes that remain hidden in monolingual evaluation, underscoring the need for rigorous, language-diverse “unit tests” of core model abilities.
pdf
bib
abs
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query
Yixuan Wang
|
Shiyu Ji
|
Yijun Liu
|
Yuzhuang Xu
|
Yang Xu
|
Qingfu Zhu
|
Wanxiang Che
Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
pdf
bib
abs
PerspectiveMod: A Perspectivist Resource for Deliberative Moderation
Eva Maria Vecchi
|
Neele Falk
|
Carlotta Quensel
|
Iman Jundi
|
Gabriella Lapesa
Human moderators in online discussions face a heterogeneous range of tasks, which go beyond content moderation, or policing. They also support and improve discussion quality, which is challenging to model (and evaluate) in NLP due to its inherent subjectivity and the scarcity of annotated resources. We address this gap by introducing PerspectiveMod, a dataset of online comments annotated for the question: *“Does this comment require moderation, and why?”* Annotations were collected from both expert moderators and trained non-experts. **PerspectiveMod** is unique in its intentional variation across (a) the level of moderation experience embedded in the source data (professional vs. non-professional moderation environments), (b) the annotator profiles (experts vs. trained crowdworkers), and (c) the richness of each moderation judgment, both in terms on fine-grained comment properties (drawn from argumentation and deliberative theory) and in the representation of the individuality of the annotator (socio-demographics and attitudes towards the task). We advance understanding of the task’s complexity by providing interpretation layers that account for its subjectivity. Our statistical analysis highlights the value of collecting annotator perspectives, including their experiences, attitudes, and views on AI, as a foundation for developing more context-aware and interpretively robust moderation tools.
pdf
bib
abs
LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions
Hongyu Sun
|
Yusuke Sakai
|
Haruki Sakajo
|
Shintaro Ozaki
|
Kazuki Hayashi
|
Hidetaka Kamigaito
|
Taro Watanabe
Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct for assessing models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We publicly release the dataset and end-to-end pipeline.
pdf
bib
abs
CodeSSM: Towards State Space Models for Code Understanding
Shweta Verma
|
Abhinav Anand
|
Mira Mezini
Although transformers dominate many code-specific tasks, they have significant limitations. This paper explores State Space Models (SSMs) as a promising alternative for code understanding tasks such as retrieval, classification, and clone detection. We introduce CodeSSM, the first SSM-based model trained on code corpora to assess its effectiveness. Our results demonstrate that SSMs are more sample-efficient and can extrapolate to longer contexts beyond the pretraining length. Extensive experiments show that SSMs offer a viable alternative to transformers, addressing several their limitations. Additionally, CodeSSM reduces memory usage by up to 64% compared to transformers at a context length of 2048, with greater savings as context length grows.The code is available [here](https://github.com/abx04/CodeSSM).
pdf
bib
abs
EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Numaan Naeem
|
Abdellah El Mekki
|
Muhammad Abdul-Mageed
Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students’ grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.
pdf
bib
abs
xCoRe: Cross-context Coreference Resolution
Giuliano Martinelli
|
Bruno Gatti
|
Roberto Navigli
Current coreference resolution systems are typically tailored for short- or medium-sized texts and struggle to scale to very long documents due to architectural limitations and implied memory costs.However, a few available solutions can be applied by inputting documents split into smaller windows. This is inherently similar to what happens in the cross-document setting, in which systems infer coreference relations between mentions that are found in separate documents.In this paper, we unify these two challenging settings under the general framework of cross-context coreference, and introduce xCoRe, a new unified approach designed to efficiently handle short-, long-, and cross-document coreference resolution.xCoRe adopts a three-step pipeline that first identifies mentions, then creates clusters within individual contexts, and finally merges clusters across contexts.In our experiments, we show that our formulation enables joint training on shared long- and cross-document resources, increasing data availability and particularly benefiting the challenging cross-document task.Our model achieves new state-of-the-art results on cross-document benchmarks and strong performance on long-document data, while retaining top-tier results on traditional datasets, positioning it as a robust, versatile solution that can be applied across all end-to-end coreference settings.We release our models and code at http://github.com/sapienzanlp/xcore.
pdf
bib
abs
Retrieval-Augmented Generation with Estimation of Source Reliability
Jeongyeon Hwang
|
Junyoung Park
|
Hyejin Park
|
Dongwoo Kim
|
Sangdon Park
|
Jungseul Ok
Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top-𝜅 reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources’ reliability, highlighting its practical applicability. Our code and data are available at RA-RAG.
pdf
bib
abs
NitiBench: Benchmarking LLM Frameworks on Thai Legal Question Answering Capabilities
Pawitsapak Akarajaradwong
|
Pirat Pothavorn
|
Chompakorn Chaksangchaichot
|
Panuthep Tasawong
|
Thitiwat Nopparatbundit
|
Keerakiat Pratai
|
Sarana Nutanong
Large language models (LLMs) show promise in legal question answering (QA), yet Thai legal QA systems face challenges due to limited data and complex legal structures. We introduce NitiBench, a novel benchmark featuring two datasets: (1) NitiBench-CCL, covering Thai financial laws, and (2) NitiBench-Tax, containing Thailand’s official tax rulings. Our benchmark also consists of specialized evaluation metrics suited for Thai legal QA. We evaluate retrieval-augmented generation (RAG) and long-context LLM (LCLM) approaches across three key dimensions: (1) the benefits of domain-specific techniques like hierarchy-aware chunking and cross-referencing, (2) comparative performance of RAG components, e.g., retrievers and LLMs, and (3) the potential of long-context LLMs to replace traditional RAG systems. Our results reveal that domain-specific components slightly improve over naive methods. At the same time, existing retrieval models still struggle with complex legal queries, and long-context LLMs have limitations in consistent legal reasoning. Our study highlights current limitations in Thai legal NLP and lays a foundation for future research in this emerging domain.
pdf
bib
abs
From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors
Maggie Mi
|
Aline Villavicencio
|
Nafise Sadat Moosavi
Language models often struggle with idiomatic, figurative, or context-sensitive inputs, not because they produce flawed outputs, but because they misinterpret the input from the outset. We propose an input-only method for anticipating such failures using token-level likelihood features inspired by surprisal and the Uniform Information Density hypothesis. These features capture localized uncertainty in input comprehension and outperform standard baselines across five linguistically challenging datasets. We show that span-localized features improve error detection for larger models, while smaller models benefit from global patterns. Our method requires no access to outputs or hidden activations, offering a lightweight and generalizable approach to pre-generation error prediction.
pdf
bib
abs
WojoodRelations: Arabic Relation Extraction Corpus and Modeling
Alaa Aljabari
|
Mohammed Khalilia
|
Mustafa Jarrar
Relation extraction (RE) is a core task in natural language processing, crucial for semantic understanding, knowledge graph construction, and enhancing downstream applications. Existing work on Arabic RE remains limited due to the language’s rich morphology and syntactic complexity, and the lack of large, high-quality datasets. In this paper, we present WojoodRelations, the largest and most diverse Arabic RE corpus to date, containing over 33K sentences (∼550K tokens) annotated with ∼15K relation triples across 40 relation types. The corpus is built on top of Wojood NER dataset with manual relation annotations carried out by expert annotators, achieving a Cohen’s 𝜅 of 0.92, indicating high reliability. In addition, we propose two methods: NLI-RE, which formulates RE as a binary natural language inference problem using relation-aware templates, and GPT-Joint, a few-shot LLM framework for joint entity and RE via relation-aware retrieval. Finally, we benchmark the dataset using both supervised models and in-context learning with LLMs. Supervised models achieve 92.89% F1 for RE, while LLMs obtain 72.73% F1 for joint entity and RE. These results establish strong baselines, highlight key challenges, and provide a foundation for advancing Arabic RE research.
pdf
bib
abs
Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information
Murathan Kurfali
Large Language Models (LLMs) have demonstrated an impressive ability to retrieve and summarize complex information, but their reliability in conflicting contexts remains poorly understood. We introduce an adversarial extension of the Needle-in-a-Haystack framework in which three mutually exclusive “needles” are embedded within long documents. By systematically manipulating factors such as position, repetition, layout, and domain relevance, we evaluate how LLMs handle contradictions. We find that models almost always fail to signal uncertainty and instead confidently select a single answer, exhibiting strong and consistent biases toward repetition, recency, and particular surface forms. We further analyze whether these patterns persist across model families and sizes, and we evaluate both probability-based and generation-based retrieval. Our framework highlights critical limitations in the robustness of current LLMs—including commercial systems—to contradiction. These limitations reveal potential shortcomings in RAG systems’ ability to handle noisy or manipulated inputs and exposes risks for deployment in high-stakes applications.
pdf
bib
abs
Towards Event Extraction with Massive Types: LLM-based Collaborative Annotation and Partitioning Extraction
Wenxuan Liu
|
Zixuan Li
|
Long Bai
|
Yuxin Zuo
|
Daozhu Xu
|
Xiaolong Jin
|
Jiafeng Guo
|
Xueqi Cheng
Developing a general-purpose system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the basic challenge comes from the absence of an efficient and effective annotation framework to construct the corresponding datasets. In this paper, we propose an LLM-based collaborative annotation framework. Through collaboration among multiple LLMs and a subsequent voting process, it refines annotations of triggers from distant supervision and then carries out argument annotation. Finally, we create EEMT, the largest EE dataset to date, featuring over **200,000** samples, **3,465** event types, and **6,297** role types. Evaluation on human-annotated test set demonstrates that the proposed framework achieves the F1 scores of **90.1%** and **85.3%** for event detection and argument extraction, strongly validating its effectiveness. Besides, to alleviate the excessively long prompts caused by massive types, we propose an LLM-based Partitioning method for EE called LLM-PEE. It first recalls candidate event types and then splits them into multiple partitions for LLMs to extract. After fine-tuning on the EEMT training set, the distilled LLM-PEE with 7B parameters outperforms state-of-the-art methods by **5.4%** and **6.1%** in event detection and argument extraction. Besides, it also surpasses mainstream LLMs by **12.9%** on the unseen datasets, which strongly demonstrates the event diversity of the EEMT dataset and the generalization capabilities of the LLM-PEE method.
pdf
bib
abs
Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation
Sherrie Shen
|
Weixuan Wang
|
Alexandra Birch
The faithful transfer of contextually-embedded meaning continues to challenge contemporary machine translation (MT), particularly in the rendering of culture-bound terms—expressions or concepts rooted in specific languages or cultures, resisting direct linguistic transfer. Existing computational approaches to explicitating these terms have focused exclusively on in-text solutions, overlooking paratextual apparatus in the footnotes and endnotes employed by professional translators. In this paper, we formalize Genette’s (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for MT. We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai and evaluate LLMs with and without reasoning traces on choice and content of explicitation. Experiments across intrinsic prompting and agentic retrieval methods establish the difficulty of this task, with human evaluation showing that LLM-generated paratexts improve audience comprehension, though remain considerably less effective than translator-authored ones. Beyond model performance, statistical analysis reveals that even professional translators vary widely in their use of paratexts, suggesting that cultural mediation is inherently open-ended rather than prescriptive. Our findings demonstrate the potential of paratextual explicitation in advancing MT beyond linguistic equivalence, with promising extensions to monolingual explanation and personalized adaptation.
pdf
bib
abs
Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset
Karim Ghonim
|
Andrei Stefan Bejgu
|
Alberte Fernández-Castro
|
Roberto Navigli
Vision-language Models (VLMs), such as CLIP and SigLIP, have become the de facto standard for multimodal tasks, serving as essential building blocks for recent Multimodal Large Language Models, including LLaVA and PaliGemma. However, current evaluations for VLMs remain heavily anchored to ImageNet. In this paper, we question whether ImageNet’s coverage is still sufficiently challenging for modern VLMs, and investigate the impact of adding novel and varied concept categories, i.e., semantically grouped fine-grained synsets. To this end, we introduce Concept-pedia, a novel, large-scale, semantically-annotated multimodal resource covering more than 165,000 concepts. Leveraging a language-agnostic, automatic annotation pipeline grounded in Wikipedia, Concept-pedia expands the range of visual concepts, including diverse abstract categories. Building on Concept-pedia, we also present a manually-curated Visual Concept Recognition evaluation benchmark, Concept-10k, that spans thousands of concepts across a wide range of categories. Our experiments show that current models, although excelling on ImageNet, struggle with Concept-10k. Not only do these findings highlight a persistent bias toward ImageNet-centric concepts, but they also underscore the urgent need for more representative benchmarks. By offering a broader and semantically richer testbed, Concept-10k aims to support the development of multimodal systems that better generalize to the complexities of real-world visual concepts.
pdf
bib
abs
RAED: Retrieval-Augmented Entity Description Generation for Emerging Entity Linking and Disambiguation
Karim Ghonim
|
Pere-Lluís Huguet Cabot
|
Riccardo Orlando
|
Roberto Navigli
Entity Linking and Entity Disambiguation systems aim to link entity mentions to their corresponding entries, typically represented by descriptions within a predefined, static knowledge base. Current models assume that these knowledge bases are complete and up-to-date, rendering them incapable of handling entities not yet included therein. However, in an ever-evolving world, new entities emerge regularly, making these static resources insufficient for practical applications. To address this limitation, we introduce RAED, a model that retrieves external knowledge to improve factual grounding in entity descriptions. Using sources such as Wikipedia, RAED effectively disambiguates entities and bases their descriptions on factual information, reducing the dependence on parametric knowledge. Our experiments show that retrieval not only enhances overall description quality metrics, but also reduces hallucinations. Moreover, despite not relying on fixed entity inventories, RAED outperforms systems that require predefined candidate sets at inference time on Entity Disambiguation. Finally, we show that descriptions generated by RAED provide useful entity representations for downstream Entity Linking models, leading to improved performance in the extremely challenging Emerging Entity Linking task.
pdf
bib
abs
Personalized Language Models via Privacy-Preserving Evolutionary Model Merging
Kyuyoung Kim
|
Jinwoo Shin
|
Jaehyung Kim
Personalization in language models aims to tailor model behavior to individual users or user groups. Prompt-based methods incorporate user preferences into queries, while training-based methods encode them into model parameters. Model merging has also been explored for personalization under limited data. However, existing methods often fail to directly optimize task-specific utility and lack explicit mechanisms for privacy preservation. To address the limitations, we propose Privacy-Preserving Model Merging via Evolutionary Algorithms (PriME), a novel personalization approach that employs gradient-free methods to directly optimize utility while reducing privacy risks. By integrating privacy preservation into the optimization objective, PriME creates personalized modules that effectively capture target user preferences while minimizing privacy risks for data-sharing users. Experiments on the LaMP benchmark show that PriME consistently outperforms a range of baselines, achieving up to a 45% improvement in task performance. Further analysis demonstrates that PriME achieves a superior privacy-utility trade-off compared to a prior state-of-the-art, with enhanced robustness to membership inference attacks and greater utility in capturing user preferences.
pdf
bib
abs
Aligning Text/Speech Representations from Multimodal Models with MEG Brain Activity During Listening
Padakanti Srijith
|
Khushbu Pahwa
|
Radhika Mamidi
|
Bapi Raju Surampudi
|
Manish Gupta
|
Subba Reddy Oota
Although speech language models are expected to align well with brain language processing during speech comprehension, recent studies have found that they fail to capture brain-relevant semantics beyond low-level features. Surprisingly, text-based language models exhibit stronger alignment with brain language regions, as they better capture brain-relevant semantics. However, no prior work has examined the alignment effectiveness of text/speech representations from multimodal models. This raises several key questions: Can speech embeddings from such multimodal models capture brain-relevant semantics through cross-modal interactions? Which modality can take advantage of this synergistic multimodal understanding to improve alignment with brain language processing? Can text/speech representations from such multimodal models outperform unimodal models? To address these questions, we systematically analyze multiple multimodal models, extracting both text- and speech-based representations to assess their alignment with MEG brain recordings during naturalistic story listening. We find that text embeddings from both multimodal and unimodal models significantly outperform speech embeddings from these models. Specifically, multimodal text embeddings exhibit a peak around 200 ms, suggesting that they benefit from speech embeddings, with heightened activity during this time period. However, speech embeddings from these multimodal models still show a similar alignment compared to their unimodal counterparts, suggesting that they do not gain meaningful semantic benefits over text-based representations. These results highlight an asymmetry in cross-modal knowledge transfer, where the text modality benefits more from speech information, but not vice versa.
pdf
bib
abs
STARQA: A Question Answering Dataset for Complex Analytical Reasoning over Structured Databases
Mounica Maddela
|
Lingjue Xie
|
Daniel Preotiuc-Pietro
|
Mausam
Our goal is to assess how well current Text2SQL systems support SQL analysts in their primary work of performing complex analytics on specialized relational databases. Although several benchmarks evaluate Text2SQL models, the complexity of questions (and the output SQL queries) in most datasets is inherently limited – they do not focus on intents involving analytics and reasoning. In response, we present STARQA, the first public human-created dataset focused on complex analytical questions and answers (involving nested joins, time series analytics, statistical operations, and more) on three specialized-domain databases. In addition to standard Text2SQL baselines, we also evaluate a novel approach (Text2SQLCode) that decomposes the task through a combination of SQL and Python: SQL is responsible for data fetch, and Python more naturally performs reasoning. Our results demonstrate that both existing Text2SQL systems and our Text2SQLCode approach find STARQA questions quite challenging, even though Text2SQLCode achieves better performance on the more difficult questions. Further analyses assess the typical errors made by existing systems and charts a research path for pushing the capabilities of real-world systems.
pdf
bib
abs
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
Colin Hong
|
Xu Guo
|
Anand Chaanan Singh
|
Esha Choukse
|
Dmitrii Ustiugov
Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level.Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.
pdf
bib
abs
Long Chain-of-Thought Fine-tuning via Understanding-to-Reasoning Transition
Chenxin An
|
Zhihui Xie
|
Xiaonan Li
|
Ming Zhong
|
Shansan Gong
|
Lei Li
|
Jun Zhang
|
Jingjing Xu
|
Lingpeng Kong
Reasoning models have demonstrated remarkable performance on complex tasks by generating long reasoning traces prior to producing final answers. However, previous research on long-context scaling in language models has generally focused on managing lengthy input prompts instead of producing long outputs. To leverage the strong long context understanding abilities of current models, we introduce Understanding-to-Reasoning Transition (URT) fine-tuning, a sequence-level curriculum learning framework that gradually shifts a model’s focus from interpreting long chain-of-thoughts to generating them. By incorporating partial reasoning steps in the input context, URT naturally exposes the model to diverse prompt lengths during training, preserving its performance on long-context comprehension while developing advanced reasoning capabilities. Experiments on rigorous reasoning benchmarks, including AIME24 and GPQA Diamond, reveal that our approach surpasses standard fine-tuning by over 10%, while maintaining robust performance on the understanding tasks in RULER.
pdf
bib
abs
Exploring Large Language Models for Detecting Mental Disorders
Gleb Kuzmin
|
Petr Strepetov
|
Maksim Stankevich
|
Natalia Chudova
|
Artem Shelmanov
|
Ivan Smirnov
This paper compares the effectiveness of traditional machine learning methods, encoder-based models, and large language models (LLMs) on the task of detecting depression and anxiety. Five Russian-language datasets were considered, each differing in format and in the method used to define the target pathology class. We tested AutoML models based on linguistic features, several variations of encoder-based Transformers such as BERT, and state-of-the-art LLMs as pathology classification models. The results demonstrated that LLMs outperform traditional methods, particularly on noisy and small datasets where training examples vary significantly in text length and genre. However, psycholinguistic features and encoder-based models can achieve performance comparable to language models when trained on texts from individuals with clinically confirmed depression, highlighting their potential effectiveness in targeted clinical applications.
pdf
bib
abs
Efficient Real-time Refinement of Language Model Text Generation
Joonho Ko
|
Jinheon Baek
|
Sung Ju Hwang
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.
pdf
bib
abs
Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs
Daehoon Gwak
|
Minseo Jung
|
Junwoo Park
|
Minho Park
|
ChaeHun Park
|
Junha Hyung
|
Jaegul Choo
Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.
pdf
bib
abs
AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts
Esra Dönmez
|
Maximilian Maurer
|
Gabriella Lapesa
|
Agnieszka Falenska
Distinguishing LLM-generated text from human-written is a key challenge for safe and ethical NLP, particularly in high-stake settings such as persuasive online discourse. While recent work focuses on detection, real-world use cases also demand interpretable tools to help humans understand and distinguish LLM-generated texts. To this end, we present an analysis framework comparing human- and LLM-authored arguments using two easily-interpretable feature sets: general-purpose linguistic features (e.g., lexical richness, syntactic complexity) and domain-specific features related to argument quality (e.g., logical soundness, engagement strategies). Applied to */r/ChangeMyView* arguments by humans and three LLMs, our method reveals clear patterns: LLM-generated counter-arguments show lower type-token and lemma-token ratios but higher emotional intensity — particularly in anticipation and trust. They more closely resemble textbook-quality arguments — cogent, justified, explicitly respectful toward others, and positive in tone. Moreover, counter-arguments generated by LLMs converge more closely with the original post’s style and quality than those written by humans. Finally, we demonstrate that these differences enable a lightweight, interpretable, and highly effective classifier for detecting LLM-generated comments in CMV.
pdf
bib
abs
TounsiBench: Benchmarking Large Language Models for Tunisian Arabic
Souha Ben Hassine
|
Asma Arrak
|
Marouene Addhoum
|
Steven R Wilson
In this work, we introduce the first benchmark for evaluating the capabilities of large language models (LLMs) in understanding and generating responses in Tunisian Arabic. To achieve this, we construct a dataset of Tunisian Arabic instructions and prompt ten widely-used LLMs that claim to support Arabic. We then assess the LLM responses through both human and LLM-based evaluations across four criteria: quality, correctness, relevance, and dialectal adherence. We analyze the agreement and correlation between these judgments and identify GPT-4o as our automated judge model based on its high correlation with human ratings, and generate a final leaderboard using this model. Our error analysis reveals that most LLMs struggle with recognizing and properly responding in Tunisian Arabic. To facilitate further research, we release our dataset, along with gold-standard human-written responses for all 744 instructions, and our evaluation framework, allowing others to benchmark their own models.
pdf
bib
abs
Moral Framing in Politics (MFiP): A new resource and models for moral framing
Ines Rehbein
|
Ines Reinig
|
Simone Paolo Ponzetto
The construct of morality permeates our entire lives and influences our behavior and how we perceive others. It therefore comes at no surprise that morality also plays an important role in politics, as morally framed arguments are perceived as more appealing and persuasive. Thus, being able to identify moral framing in political communication and to detect subtle differences in politicians’ moral framing can provide the basis for many interesting analyses in the political sciences. In the paper, we release MoralFramingInPolitics (MFiP), a new corpus of German parliamentary debates where the speakers’ moral framing has been coded, using the framework of Moral Foundations Theory (MFT). Our fine-grained annotations distinguish different types of moral frames and also include narrative roles, together with the moral foundations for each frame. We then present models for frame type and moral foundation classification and explore the benefits of data augmentation (DA) and contrastive learning (CL) for the two tasks. All data and code will be made available to the research community.
pdf
bib
abs
ReDepress: A Cognitive Framework for Detecting Depression Relapse from Social Media
Aakash Kumar Agarwal
|
Saprativa Bhattacharjee
|
Mauli Rastogi
|
Jemima S. Jacob
|
Biplab Banerjee
|
Rashmi Gupta
|
Pushpak Bhattacharyya
Almost 50% depression patients face the risk of going into relapse. The risk increases to 80% after the second episode of depression. Although, depression detection from social media has attained considerable attention, depression relapse detection has remained largely unexplored due to the lack of curated datasets and the difficulty of distinguishing relapse and non-relapse users. In this work, we present ReDepress, the first clinically validated social media dataset focused on relapse, comprising 204 Reddit users annotated by mental health professionals. Unlike prior approaches, our framework draws on cognitive theories of depression, incorporating constructs such as attention bias, interpretation bias, memory bias and rumination into both annotation and modeling. Through statistical analyses and machine learning experiments, we demonstrate that cognitive markers significantly differentiate relapse and non-relapse groups, and that models enriched with these features achieve competitive performance, with transformer-based temporal models attaining an F1 of 0.86. Our findings validate psychological theories in real-world textual data and underscore the potential of cognitive-informed computational methods for early relapse detection, paving the way for scalable, low-cost interventions in mental healthcare.
pdf
bib
abs
iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models
Michel Olvera
|
Changhong Wang
|
Paraskevas Stamatiadis
|
Gaël Richard
|
Slim Essid
Contrastive Language–Audio Pretraining (CLAP) models learn by aligning audio and text in a shared embedding space, enabling powerful zero-shot recognition. However, their performance is highly sensitive to prompt formulation and language nuances, and they often inherit semantic ambiguities and spurious correlations from noisy pretraining data. While prior work has explored prompt engineering, adapters, and prefix tuning to address these limitations, the use of structured prior knowledge remains largely unexplored. We present iKnow-audio, a framework that integrates knowledge graphs with audio-language models to provide robust semantic grounding. iKnow-audio builds on the Audio-centric Knowledge Graph (AKG), which encodes ontological relations comprising semantic, causal, and taxonomic connections reflective of everyday sound scenes and events. By training knowlege graph embedding models on the AKG and refining CLAP predictions through this structured knowledge, iKnow-audio improves disambiguation of acoustically similar sounds and reduces reliance on prompt engineering. Comprehensive zero-shot evaluations across six benchmark datasets demonstrate consistent gains over baseline CLAP, supported by embedding-space analyses that highlight improved relational grounding. Resources are publicly available at https://github.com/michelolzam/iknow-audio
pdf
bib
abs
EduVidQA: Generating and Evaluating Long-form Answers to Student Questions based on Lecture Videos
Sourjyadip Ray
|
Shubham Sharma
|
Somak Aditya
|
Pawan Goyal
As digital platforms redefine educational paradigms, ensuring interactivity remains vital for effective learning. This paper explores using Multimodal Large Language Models (MLLMs) to automatically respond to student questions from online lectures - a novel question answering task of real world significance. We introduce the EduVidQA Dataset with 5252 question-answer pairs (both synthetic and real-world) from 296 computer science videos covering diverse topics and difficulty levels. To understand the needs of the dataset and task evaluation, we empirically study the qualitative preferences of students, which we provide as an important contribution to this line of work. Our benchmarking experiments consist of 6 state-of-the-art MLLMs, through which we study the effectiveness of our synthetic data for finetuning, as well as showing the challenging nature of the task. We evaluate the models using both text-based and qualitative metrics, thus showing a nuanced perspective of the models’ performance, which is paramount to future work. This work not only sets a benchmark for this important problem, but also opens exciting avenues for future research in the field of Natural Language Processing for Education.
pdf
bib
abs
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Denis Janiak
|
Jakub Binkowski
|
Albert Sawczyn
|
Bogdan Gabrys
|
Ravid Shwartz-Ziv
|
Tomasz Jan Kajdanowicz
Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.
pdf
bib
abs
Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions
Rachneet Singh Sachdeva
|
Rima Hazra
|
Iryna Gurevych
Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.
pdf
bib
abs
CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition
Sina Semnani
|
Han Zhang
|
Xinyan He
|
Merve Tekgurler
|
Monica Lam
Accurate text recognition for historical documents can greatly advance the study and preservation of cultural heritage. Existing vision-language models (VLMs), however, are designed for modern, standardized texts and are not equipped to read the diverse languages and scripts, irregular layouts, and frequent degradation found in historical materials.This paper presents CHURRO, a 3B-parameter open-weight VLM specialized for historical text recognition. The model is trained on CHURRO-DS, the largest historical text recognition dataset to date. CHURRO-DS unifies 155 historical corpora comprising 99,491 pages, spanning 22 centuries of textual heritage across 46 language clusters, including historical variants and dead languages.We evaluate several open-weight and closed VLMs and optical character recognition (OCR) systems on CHURRO-DS and find that CHURRO outperforms all other VLMs. On the CHURRO-DS test set, CHURRO achieves 82.3% (printed) and 70.1% (handwritten) normalized Levenshtein similarity, surpassing the second-best model, Gemini 2.5 Pro, by 1.4% and 6.5%, respectively, while being 15.5 times more cost-effective.By releasing the model and dataset, we aim to enable community-driven research to improve the readability of historical texts and accelerate scholarship.
pdf
bib
abs
Towards Author-informed NLP: Mind the Social Bias
Inbar Pendzel
|
Einat Minkov
Social text understanding is prone to fail when opinions are conveyed implicitly or sarcastically. It is therefore desired to model users’ contexts in processing the texts authored by them. In this work, we represent users within a social embedding space that was learned from the Twitter network at large-scale. Similar to word embeddings that encode lexical semantics, the network embeddings encode latent dimensions of social semantics. We perform extensive experiments on author-informed stance prediction, demonstrating improved generalization through inductive social user modeling, both within and across topics. Similar results were obtained for author-informed toxicity and incivility detection. The proposed approach may pave way to social NLP that considers user embeddings as contextual modality. However, our investigation also reveals that user stances are correlated with the personal socio-demographic traits encoded in their embeddings. Hence, author-informed NLP approaches may inadvertently model and reinforce socio-demographic and other social biases.
pdf
bib
abs
Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models
Sina Semnani
|
Jirayu Burapacheep
|
Arpandeep Khatua
|
Thanawan Atchariyachanvanit
|
Zheng Wang
|
Monica Lam
Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it?We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time.Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%.Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.
pdf
bib
abs
Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains
Junghwan Kim
|
Haotian Zhang
|
David Jurgens
Authorship representation (AR) learning, which models an author’s unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings—mostly in English—leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model’s improved performance.
pdf
bib
abs
DrFrattn: Directly Learn Adaptive Policy from Attention for Simultaneous Machine Translation
Libo Zhao
|
Jing Li
|
Ziqian Zeng
Simultaneous machine translation (SiMT) necessitates a robust read/write (R/W) policy to determine the optimal moments for translation, thereby balancing translation quality and latency. Effective timing in translation can align source and target tokens accurately. The attention mechanism within translation models inherently provides valuable alignment information. Building on this, previous research has attempted to modify the attention mechanism’s structure to leverage its alignment properties during training, employing multi-task learning to derive the read/write policy. However, this multi-task learning approach may compromise the efficacy of the attention mechanism itself. This raises a natural question: why not directly learn the read/write policy from the well-trained attention mechanism? In this study, we propose DrFrattn, a method that directly learns adaptive policies from the attention mechanism. Experimental results across various benchmarks demonstrate that our approach achieves an improved balance between translation accuracy and latency.
pdf
bib
abs
The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology
Fagun Patel
|
Duc Quang Nguyen
|
Sang T. Truong
|
Jody Vaynshtok
|
Sanmi Koyejo
|
Nick Haber
According to the U.S. National Institutes of Health, more than 3.4 million children experience speech disorders that require clinical intervention. The number of speech-language pathologists (SLPs) is roughly 20 times fewer than the number of affected children, highlighting a significant gap in children’s care and a pressing need for technological support that improves the productivity of SLPs. State-of-the-art multimodal language models (MLMs) show promise for supporting SLPs, but their use remains underexplored largely due to a limited understanding of their performance in high-stakes clinical settings. To address this gap, we collaborate with domain experts to develop a taxonomy of real-world use cases of MLMs in speech-language pathologies. Building on this taxonomy, we introduce the first comprehensive benchmark for evaluating MLM across five core use cases, each containing 1,000 manually annotated data points. This benchmark includes robustness and sensitivity tests under various settings, including background noise, speaker gender, and accent. Our evaluation of 15 state-of-the-art MLMs reveals that no single model consistently outperforms others across all tasks. Notably, we find systematic disparities, with models performing better on male speakers, and observe that chain-of-thought prompting can degrade performance on classification tasks with large label spaces and narrow decision boundaries. Furthermore, we study fine-tuning MLMs on domain-specific data, achieving improvements of over 30% compared to base models. These findings highlight both the potential and limitations of current MLMs for speech-language pathology applications, underscoring the need for further research and targeted development.
pdf
bib
abs
NormXLogit: The Head-on-Top Never Lies
Sina Abbasi
|
Mohammad Reza Modarres
|
Mohammad Taher Pilehvar
With new large language models (LLMs) emerging frequently, it is important to consider the potential value of model-agnostic approaches that can provide interpretability across a variety of architectures. While recent advances in LLM interpretability show promise, many rely on complex, model-specific methods with high computational costs. To address these limitations, we propose NormXLogit, a novel technique for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that the norm of word embeddings can be utilized as a measure of token importance. Second, we reveal a significant relationship between a token’s importance and how predictive its representation is of the model’s final output. Extensive analyses indicate that our approach outperforms existing gradient-based methods in terms of faithfulness and offers competitive performance compared to leading architecture-specific techniques.
pdf
bib
abs
Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents
Akriti Jain
|
Pritika Ramu
|
Aparna Garimella
|
Apoorv Saxena
Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of _intent-based chart generation_ from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 <intent, document, charts> tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto 9 points and 17 points in terms of chart data accuracy and chart type respectively over the best baselines.
pdf
bib
abs
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang
|
Yicong Tan
|
Yun Shen
|
Ahmed Salem
|
Michael Backes
|
Savvas Zannettou
|
Yang Zhang
Recently, autonomous agents built on large language models (LLMs) have experienced significant development and are being deployed in real-world applications. Through the usage of tools, these systems can perform actions in the real world. Given the agents’ practical applications and ability to execute consequential actions, such autonomous systems can cause more severe damage than a standalone LLM if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. Our experiments reveal that these attacks can induce failure rates exceeding 80% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination defense methods. Our findings indicate these attacks are more difficult to detect compared to previous overtly harmful attacks, highlighting the substantial risks associated with this vulnerability.
pdf
bib
abs
FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks
Tanawan Premsri
|
Parisa Kordjamshidi
Spatial reasoning is a fundamental aspect of human intelligence. One key concept in spatial cognition is the Frame of Reference (FoR), which identifies the perspective of spatial expressions. Despite its significance, FoR has received limited attention in AI models that need spatial intelligence. There is a lack of dedicated benchmarks and in-depth evaluation of large language models (LLMs) in this area. To address this issue, we introduce the Frame of Reference Evaluation in Spatial Reasoning Tasks (FoREST) benchmark, designed to assess FoR comprehension in LLMs. We evaluate LLMs on answering questions that require FoR comprehension and layout generation in text-to-image models using FoREST. Our results reveal a notable performance gap across different FoR classes in various LLMs, affecting their ability to generate accurate layouts for text-to-image generation. This highlights critical shortcomings in FoR comprehension. To improve FoR understanding, we propose Spatial-Guided prompting, which improves LLMs’ ability to extract essential spatial concepts. Our proposed method improves overall performance across spatial reasoning tasks.
pdf
bib
abs
Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Cross-Lingual Transfer in Sense-Aware Tasks
Roksana Goworek
|
Haim Dubossarsky
Cross-lingual transfer allows models to perform tasks in languages unseen during training and is often assumed to benefit from increased multilinguality. In this work, we challenge this assumption in the context of two underexplored, sense-aware tasks: polysemy disambiguation and lexical semantic change. Through a large-scale analysis across 28 languages, we show that multilingual training is neither necessary nor inherently beneficial for effective transfer. Instead, we find that confounding factors, such as fine-tuning data composition and evaluation artifacts, can better account for the perceived advantages of multilinguality. Our findings call for more rigorous evaluations in multilingual NLP, and more nuanced and sensible choice of models for transfer. We release fine-tuned models and benchmarks to support further research, with implications extending to low-resource and typologically diverse languages.
pdf
bib
abs
Translating Domain-Specific Terminology in Typologically-Diverse Languages: A Study in Tax and Financial Education
Arturo Oncevay
|
Elena Kochkina
|
Keshav Ramani
|
Toyin Aguda
|
Simerjot Kaur
|
Charese Smiley
Domain-specific multilingual terminology is essential for accurate machine translation (MT) and cross-lingual NLP applications. We present a gold-standard terminology resource for the tax and financial education domains, built from curated governmental publications and covering seven typologically diverse languages: English, Spanish, Russian, Vietnamese, Korean, Chinese (traditional and simplified) and Haitian Creole. Using this resource, we assess various MT systems and LLMs on translation quality and term accuracy. We annotate over 3,000 terms for domain-specificity, facilitating a comparison between domain-specific and general term translations, and observe models’ challenges with specialized tax terms. We also analyze the case of terminology-aided translation, and the LLMs’ performance in extracting the translated term given the context. Our results highlight model limitations and the value of high-quality terminologies for advancing MT research in specialized contexts.
pdf
bib
abs
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Tomohiro Sawada
|
Kartik Goyal
Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during the BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targetted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targetted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA bench- marks, machine translation, and open-ended generation reveal that while the targetted deviation from the merge lists exhibit significant degradation in language model performance, the non-targetted merge-list free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.
pdf
bib
abs
Spectral Scaling Laws in Language Models: emphHow Effectively Do Feed-Forward Networks Use Their Latent Space?
Nandan Kumar Jha
|
Brandon Reagen
As Large Language Models (LLMs) scale, the question is not just how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. In this work, we focus on Feed-Forward Networks (FFNs) and recast width selection as a spectral utilization optimization problem. Using a lightweight diagnostic suite: Hard Rank (participation ratio), Soft Rank (Shannon Rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI), we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an Asymmetric Spectral Scaling Law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly, with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.
pdf
bib
abs
TLUE: A Tibetan Language Understanding Evaluation Benchmark
Fan Gao
|
Cheng Huang
|
Yutong Liu
|
Nyima Tashi
|
Xiangxiang Wang
|
Thupten Tsering
|
Ban Ma-bao
|
Renzeng Duojie
|
Gadeng Luosang
|
Rinchen Dongrub
|
Dorje Tashi
|
Xiao Feng Cd
|
Yongbin Yu
|
Hao Wang
Large language models have made tremendous progress in recent years, but low-resource languages, like Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present a Tibetan Language Understanding Evaluation Benchmark, TLUE, which is also the first large-scale benchmark for measuring the proficiency of large language models in the Tibetan language. TLUE comprises two major components: a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and a safety benchmark encompassing 7 subdomains. Finally, we evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most large language models perform below the random baseline, especially highlighting the considerable challenges they face in Tibetan language processing. TLUE provides a crucial foundation for advancing future research in Tibetan language understanding and highlights the importance of promoting greater inclusivity in the development of large language models.
pdf
bib
abs
Retrieving Support to Rank Answers in Open-Domain Question Answering
Zeyu Zhang
|
Alessandro Moschitti
|
Thuy Vu
We introduce a novel Question Answering (QA) architecture that enhances answer selection by retrieving targeted supporting evidence. Unlike traditional methods, which retrieve documents or passages relevant only to a query q, our approach retrieves content relevant to the combined pair (q, a), explicitly emphasizing the supporting relation between the query and a candidate answer a. By prioritizing this relational context, our model effectively identifies paragraphs that directly substantiate the correctness of a with respect to q, leading to more accurate answer verification than standard retrieval systems. Our neural retrieval method also scales efficiently to collections containing hundreds of millions of paragraphs. Moreover, this approach can be used by large language models (LLMs) to retrieve explanatory paragraphs that ground their reasoning, enabling them to tackle more complex QA tasks with greater reliability and interpretability.
pdf
bib
abs
Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems
Adam Zahradník
|
Marek Suppa
Large language models show promising performance on reasoning tasks, yet evaluation methods for low-resource languages remain limited, particularly for complex STEM problem-solving. We introduce Trojsten Benchmark, a Slovak-language dataset of 1,108 high-school competition problems with reference solutions across mathematics, physics, and programming, and a rubric-based LLM grading framework. Using GPT-4 to generate rubrics and grade solutions, we observe 1.05 average absolute deviation from human graders (5-point scale), while benchmarking GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models (Llama 3, Phi-3). We quantify multistep reasoning performance by difficulty, show consistent underperformance on harder items, and demonstrate language sensitivity: accuracy drops on English translations of Slovak statements, evidencing challenges beyond translation. Trojsten Benchmark complements English-centric math datasets (e.g., MATH, GSM8K) by targeting open-response, rubric-gradable reasoning under low-resource linguistic framing. We release code and data to enable reproducible evaluation and human-aligned auto-grading for STEM in under-served languages.
pdf
bib
abs
BRSpeech-DF: A Deep Fake Synthetic Speech Dataset for Portuguese Zero-Shot TTS
Alexandre Costa Ferro Filho
|
Rafaello Virgilli
|
Lucas Alcantara Souza
|
F S de Oliveira
|
Marcelo Henrique Lopes Ferreira
|
Daniel Tunnermann
|
Gustavo Dos Reis Oliveira
|
Anderson Da Silva Soares
|
Arlindo Rodrigues Galvão Filho
The detection of audio deepfakes (ADD) has become increasingly important due to the rapid evolution of generative speech models. However, progress in this field remains uneven across languages, particularly for low-resource languages like Portuguese, which lack high-quality datasets. In this paper, we introduce BRSpeech-DF, the first publicly available ADD dataset for Portuguese, encompassing both Brazilian and European variants. The dataset contains over 458,000 utterances, including a smaller portion of real speech from 62 speakers and a large collection of synthetic samples generated using multiple zero-shot text-to-speech (TTS) models, each conditioned on the original speaker’s voice. By providing this resource, our objective is to support the development of robust, multilingual detection systems, thereby advancing equity in speech forensics and security research. BRSpeech-DF addresses a significant gap in annotated data for underrepresented languages, facilitating more inclusive and generalizable advancements in synthetic speech detection.
pdf
bib
abs
A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs
Shaona Ghosh
|
Amrita Bhattacharjee
|
Yftah Ziser
|
Christopher Parisien
Fine-tuning large language models (LLMs) to meet evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, but its potential for precise, customizable safety adjustments remains underexplored. We propose SafeSteer, a simple and effective method to guide LLM outputs by (i) leveraging category-specific steering vectors for fine-grained control, (ii) applying a gradient-free, unsupervised approach that enhances safety while preserving text quality and topic relevance without forcing explicit refusals, and (iii) eliminating the need for contrastive safe data. Across multiple LLMs, datasets, and risk categories, SafeSteer provides precise control, avoids blanket refusals, and directs models to generate safe, relevant content, aligning with recent findings that simple activation-steering techniques often outperform more complex alternatives.
pdf
bib
abs
Statistical and Neural Methods for Hawaiian Orthography Modernization
Jaden Kapali
|
Keaton Williamson
|
Winston Wu
Hawaiian orthography employs two distinct spelling systems, both of which are used by communities of speakers today. These two spelling systems are distinguished by the presence of the ‘okina letter and kahakō diacritic, which represent glottal stops and long vowels, respectively. We develop several models ranging in complexity to convert between these two orthographies. Our results demonstrate that simple statistical n-gram models surprisingly outperform neural seq2seq models and LLMs, highlighting the potential for traditional machine learning approaches in a low-resource setting.
pdf
bib
abs
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Sriharsh Bhyravajjula
|
Melanie Walsh
|
Anna Preus
|
Maria Antoniak
Whitespace is a critical component of poetic form, reflecting both adherence to standardized forms and rebellion against those forms. Each poem’s whitespace distribution reflects the artistic choices of the poet and is an integral semantic and spatial feature of the poem. Yet, despite the popularity of poetry as both a long-standing art form and as a generation task for large language models (LLMs), whitespace has not received sufficient attention from the NLP community. Using a corpus of 19k English-language published poems from Poetry Foundation, we investigate how 4k poets have used whitespace in their works. We release a subset of 2.8k public-domain poems with preserved formatting to facilitate further research in this area. We compare whitespace usage in the published poems to (1) 51k LLM-generated poems, and (2) 12k unpublished poems posted in an online community. We also explore whitespace usage across time periods, poetic forms, and data sources. Additionally, we find that different text processing methods can result in significantly different representations of whitespace in poetry data, motivating us to use these poems and whitespace patterns to discuss implications for the processing strategies used to assemble pretraining datasets for LLMs.
pdf
bib
abs
Certified Mitigation of Worst-Case LLM Copyright Infringement
Jingyu Zhang
|
Jiacan Yu
|
Marc Marone
|
Benjamin Van Durme
|
Daniel Khashabi
The exposure of large language models (LLMs) to copyrighted material during pre-training raises concerns about unintentional copyright infringement post deployment. This has driven the development of “copyright takedown” methods—post-training approaches aimed at preventing models from generating content substantially similar to copyrighted ones. While current mitigation approaches are somewhat effective for average-case risks, we demonstrate that they overlook worst-case copyright risks exhibited by the existence of long, verbatim quotes from copyrighted sources. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data sketches (Bloom filters), our approach enables scalable copyright screening—even for large-scale real-world corpora. When quotes beyond a length threshold cannot be removed, the system can abstain from responding, offering certified risk reduction. Experimental results show that BloomScrub reduces infringement risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
pdf
bib
abs
Quantifying Logical Consistency in Transformers via Query-Key Alignment
Eduard Tulchinskii
|
Laida Kushnareva
|
Anastasia Voznyuk
|
Andrei Andriiainen
|
Irina Piontkovskaya
|
Evgeny Burnaev
|
Serguei Barannikov
Large language models (LLMs) excel at many NLP tasks, yet their multi-step logical reasoning remains unreliable. Existing solutions such as Chain-of-Thought prompting generate intermediate steps but provide no internal check of their logical coherence. In this paper, we use the “QK-score”, a lightweight metric based on query–key alignments within transformer attention heads, to evaluate the logical reasoning capabilities of LLMs. Our method automatically identifies attention heads that play a key role in distinguishing valid from invalid logical inferences, enabling efficient inference-time evaluation via a single forward pass. It reveals latent reasoning structure in LLMs and provides a scalable mechanistic alternative to ablation-based analysis. Across three benchmarks: ProntoQA-OOD, PARARULE-Plus, and MultiLogicEval, with models ranging from 1.5B to 70B parameters, the selected heads predict logical validity up to 14% better than the model probabilities, and remain robust under distractors and increasing reasoning depth of d≤ 6.
pdf
bib
abs
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Yao Dou
|
Michel Galley
|
Baolin Peng
|
Chris Kedzie
|
Weixin Cai
|
Alan Ritter
|
Chris Quirk
|
Wei Xu
|
Jianfeng Gao
Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce SimulatorArena, a benchmark of 909 annotated human–LLM conversations on two interactive tasks—math tutoring and document creation. SimulatorArena evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman’s 𝜌 of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation. Using the best simulator for each task, we benchmark 18 assistants, including the latest LLMs such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro.
pdf
bib
abs
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han
|
Yoshiki Takashima
|
Shannon Zejiang Shen
|
Chen Liu
|
Yixin Liu
|
Roque K. Thuo
|
Sonia Knowlton
|
Ruzica Piskac
|
Scott J Shapiro
|
Arman Cohan
LLMs are increasingly applied in the legal domain in tasks such as summarizing legal texts and providing basic legal advice. Yet, their capacity to draft full judicial analyses in U.S. court opinions is still largely uncharted, such as generating entire judicial reasoning sections in U.S. court decisions, remain under-explored. Given the continued adoption of LLMs and the significance of law to society at large, measurement of LLM’s legal reasoning capabilities is a pressing task. We propose CourtReasoner, a novel expert-annotated judicial reasoning benchmark for evaluating LLM agents’ capabilities in complex legal reasoning. Sourcing U.S. court opinions, we construct benchmarks that measure the LLMs ability to construct goal-oriented legal reasoning. CourtReasoner measured the agent’s ability to argue both ways in a legal dispute, rather than simple Q/A. Our results show that more than 60% of frontier model outputs contain invalid arguments and more than 53% of frontier model produced irrelevant citations when conducting complex legal reasoning. We also introduce a meta-evaluation benchmark to provide insights into the capabilities of LLMs as evaluators of legal reasoning. We will release our data, code and full annotation guidelines publicly for future research.
pdf
bib
abs
Not Your Typical Government Tipline: LLM-Assisted Routing of Environmental Protection Agency Citizen Tips
Sharanya Majumder
|
Zehua Li
|
Derek Ouyang
|
Kit T Rodolfa
|
Elena Eneva
|
Julian Nyarko
|
Daniel E. Ho
Regulatory agencies often operate with limited resources and rely on tips from the public to identify potential violations. However, processing these tips at scale presents significant operational challenges, as agencies must correctly identify and route relevant tips to the appropriate enforcement divisions. Through a case study, we demonstrate how advances in large language models can be utilized to support overburdened agencies with limited capacities. In partnership with the U.S. Environmental Protection Agency, we leverage previously unstudied citizen tips data from their “Report a Violation” system to develop an LLM-assisted pipeline for tip routing. Our approach filters out 80.5% of irrelevant tips and increases overall routing accuracy from 31.8% to 82.4% compared to the current routing system. At a time of increased focus on government efficiencies, our approach provides a constructive path forward by using technology to empower civil servants.
pdf
bib
abs
Retracing the Past: LLMs Emit Training Data When They Get Lost
Myeongseob Ko
|
Nikhil Reddy Billa
|
Adam Nguyen
|
Charles Fleming
|
Ming Jin
|
Ruoxi Jia
The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing susceptibility to our attacks. Experiments on various unaligned and aligned LLMs demonstrate that our proposed attacks outperform existing baselines in extracting verbatim and near-verbatim training data without requiring prior knowledge of the training data. Our findings highlight persistent memorization risks across various LLMs and offer a more systematic method for assessing these vulnerabilities.
pdf
bib
abs
Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
Linyang He
|
Qiaolin Wang
|
Xilin Jiang
|
Nima Mesgarani
Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones. 2) Despite never seeing text, S3M match or surpass ASR encoders on every linguistic level, demonstrating that rich grammatical and even conceptual knowledge can arise purely from audio. 3) S3M representations peak mid-network and then crash in the final layers, whereas ASR and AudioLLM encoders maintain or improve, reflecting how pre-training objectives reshape late-layer content. 4) Temporal probing further shows that S3Ms encode grammatical cues 500 ms before a word begins, whereas AudioLLMs distribute evidence more evenly—indicating that objectives shape not only where but also when linguistic information is most salient. Together, these findings establish the first large-scale map of contextual syntax and semantics in speech models and highlight both the promise and the limits of current SLM training paradigms.
pdf
bib
abs
Current Semantic-change Quantification Methods Struggle with Semantic Change Discovery in the Wild
Khonzoda Umarova
|
Lillian Lee
|
Laerdon Kim
Methods for lexical semantic-change detection quantify changes in the meaning of words over time. Prior methods have excelled on established benchmarks consisting of pre-selected target words, chosen ahead of time due to the prohibitive cost of manually annotating all words. However, performance measured on small curated wordsets cannot reveal how well these methods perform at discovering semantic changes among the full corpus vocabulary, which is the actual end goal for many applications.In this paper, we implement a top-k setup to evaluate semantic-change discovery despite lacking complete annotations. (At the same time, we also extend the annotations in the commonly used LiverpoolFC and SemEval-EN benchmarks by 85% and 90%, respectively). We deploy our evaluation setup on a battery of semantic-change detection methods under multiple variations.We find that when presented with a natural distribution of instances, all the methods struggle at ranking known large changes higher than other words in the vocabulary. Furthermore, we manually verify that the majority of words with high detected-change scores in LiverpoolFC do not actually experience meaning changes. In fact, for most of the methods, less than a half of the highest-ranked changes were determined to have changed in meaning. Given the large performance discrepancies between existing benchmark results and discovery “in the wild”, we recommend that researchers direct more attention to semantic-change discovery and include it in their suite of evaluations. Our annotations and code for running evaluations are available at https://github.com/khonzoda/semantic-change-discovery-emnlp2025.
pdf
bib
abs
Evaluating Large Language Models for Detecting Antisemitism
Jay Patel
|
Hrudayangam Mehta
|
Jeremy Blackburn
Detecting hateful content is a challenging and important problem. Automated tools, like machine‐learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs’ capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided‐CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs’ utility, explainability, and reliability.
pdf
bib
abs
D-RAG: Differentiable Retrieval-Augmented Generation for Knowledge Graph Question Answering
Guangze Gao
|
Zixuan Li
|
Chunfeng Yuan
|
Jiawei Li
|
Wu Jianzhuo
|
Yuehao Zhang
|
Xiaolong Jin
|
Bing Li
|
Weiming Hu
Knowledge Graph Question Answering (KGQA) aims to answer natural language questions based on knowledge graphs.Recent approaches apply the Retrieval-Augmented Generation (RAG) paradigm to incorporate Large Language Models (LLMs) to this task, where a retriever selects a question-related subgraph and an LLM-based generator is then adopted to predict answers based on the retrieved subgraph. However, the subgraph selection process is non-differentiable, preventing end-to-end training of the retriever and the generator in these approaches, which leads to sub-optimal performance. To overcome this limitation, this paper proposes a Differentiable RAG (D-RAG) approach that jointly optimizes the retriever and the generator for KGQA. Via reformulating the optimization objective as an expectation over a subgraph distribution with respect to answer generation likelihood, D-RAG makes the joint optimization feasible. Specifically, it implements this joint optimization through a differentiable subgraph sampling and prompting module that integrates Gumbel-Softmax reparameterization for sampling and a neural prompt construction process that fuses semantic and structural information. Experimental results on WebQSP and CWQ demonstrate that D-RAG outperforms state-of-the-art approaches.
pdf
bib
abs
Towards Robust Mathematical Reasoning
Thang Luong
|
Dawsen Hwang
|
Hoang H Nguyen
|
Golnaz Ghiasi
|
Yuri Chervonyi
|
Insuk Seo
|
Junsu Kim
|
Garrett Bingham
|
Jonathan Lee
|
Swaroop Mishra
|
Alex Zhai
|
Huiyi Hu
|
Henryk Michalewski
|
Jimin Kim
|
Jeonghyun Ahn
|
Junhwi Bae
|
Xingyou Song
|
Trieu Hoang Trinh
|
Quoc V Le
|
Junehyuk Jung
Finding the right north-star metrics is highly critical for advancing mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focusing on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMOAnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-ProofBench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-ProofBench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://github.com/google-deepmind/superhuman/imobench.
pdf
bib
abs
Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Fine-tuning
Junjie Xing
|
Yeye He
|
Mengyu Zhou
|
Haoyu Dong
|
Shi Han
|
Dongmei Zhang
|
Surajit Chaudhuri
Language models such as GPT and Llama have shown remarkable ability on diverse natural language tasks, yet their performance on complex table tasks (e.g., NL-to-Code, data cleaning, etc.) continues to be suboptimal. To improve their performance, task-specific fine-tuning is often needed, which, however, require expensive human labeling and is prone to over-fitting.In this work, we propose Table-Specialist, a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm to iteratively generate-then-validate training data from language models, to fine-tune stronger Table-Specialist models that can specialize in a given task, without using manually-labeled data.Extensive evaluations of Table-Specialist on Llama, GPT-3.5 and GPT-4 suggest that our Table-Specialist has (1) **strong performance** on diverse table tasks over vanilla language-models – for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) **lower cost** to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency/cost at comparable quality, and (3) **better generalizability** when evaluated across multiple benchmarks, since Table-Specialist is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code is available at [microsoft/Table-Specialist](https://github.com/microsoft/Table-Specialist). Specialist models fine-tuned using Table-Specialist have been integrated into Microsoft Excel for use cases such as automated table data cleaning.
pdf
bib
abs
Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents
Ankan Mullick
|
Sombit Bose
|
Rounak Saha
|
Ayan Kumar Bhowmick
|
Aditya Vempaty
|
Prasenjit Dey
|
Ravi Kokku
|
Pawan Goyal
|
Niloy Ganguly
Analyzing and processing vast amounts of textual data presents significant challenges in efficiently extracting key information.In this paper, we introduce '***Spotlight***’, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike highlights (fragmented key points) and traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document. Datasets and code are available at https://github.com/ankan2/Spotlight-EMNLP2025.
pdf
bib
abs
Argument Summarization and its Evaluation in the Era of Large Language Models
Moritz Altemeyer
|
Steffen Eger
|
Johannes Daxenberger
|
Yanran Chen
|
Tim Altendorf
|
Philipp Cimiano
|
Benjamin Schiller
Large Language Models (LLMs) have revolutionized various Natural Language Generation (NLG) tasks, including Argument Summarization (ArgSum), a key subfield of Argument Mining. This paper investigates the integration of state-of-the-art LLMs into ArgSum systems and their evaluation. In particular, we propose a novel prompt-based evaluation scheme, and validate it through a novel human benchmark dataset. Our work makes three main contributions: (i) the integration of LLMs into existing ArgSum systems, (ii) the development of two new LLM-based ArgSum systems, benchmarked against prior methods, and (iii) the introduction of an advanced LLM-based evaluation scheme. We demonstrate that the use of LLMs substantially improves both the generation and evaluation of argument summaries, achieving state-of-the-art results and advancing the field of ArgSum. We also show that among the four LLMs integrated in (i) and (ii), Qwen-3-32B, despite having the fewest parameters, performs best, even surpassing GPT-4o.
pdf
bib
abs
Computational Analysis of Conversation Dynamics through Participant Responsivity
Margaret Hughes
|
Brandon Roy
|
Elinor Poole-Dayan
|
Deb Roy
|
Jad Kabbara
Growing literature explores toxicity and polarization in discourse, with comparatively less work on characterizing what makes dialogue prosocial and constructive. We explore conversational discourse and investigate a method for characterizing its quality built upon the notion of “responsivity”—whether one person’s conversational turn is responding to a preceding turn. We develop and evaluate methods for quantifying responsivity—first through semantic similarity of speaker turns, and second by leveraging state-of-the-art large language models (LLMs) to identify the relation between two speaker turns. We evaluate both methods against a ground truth set of human-annotated conversations. Furthermore, selecting the better performing LLM-based approach, we characterize the nature of the response—whether it responded to that preceding turn in a substantive way or not. We view these responsivity links as a fundamental aspect of dialogue but note that conversations can exhibit significantly different responsivity structures. Accordingly, we then develop conversation-level derived metrics to address various aspects of conversational discourse. We use these derived metrics to explore other conversations and show that they support meaningful characterizations and differentiations across a diverse collection of conversations.
pdf
bib
abs
AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models
Sangjun Lee
|
Seung-taek Woo
|
Jun-gyu Jin
|
Changhun Lee
|
Eunhyeok Park
To enable broader deployment of Large Language Models (LLMs), it is essential to identify the best-performing model under strict memory constraints. We present AMQ, Automated Mixed-Precision Weight-Only Quantization, a framework that assigns layer-wise quantization bit-widths to optimally balance model quality and memory usage. However, the combinatorial search space, with over 10100 possible configurations, makes conventional black-box optimization infeasible. AMQ overcomes this challenge through four key innovations: (1) **search space pruning** using prior knowledge to exclude unpromising configurations, (2) **quantization proxy** to bypass costly format conversions during search, (3) **quality predictor** to minimize evaluation overhead, and (4) **iterative search-and-update** strategy for fast and stable convergence. By integrating these components, AMQ efficiently explores the quality–efficiency landscape, reaching the Pareto frontier and yielding LLMs that are both compact and high-performing.
pdf
bib
abs
Beyond Averages: Learning with Annotator Disagreement in STS
Alejandro Benito-Santos
|
Adrian Ghajari
This work investigates capturing and modeling disagreement in Semantic Textual Similarity (STS), where sentence pairs are assigned ordinal similarity labels (0–5). Conventional STS systems average multiple annotator scores and focus on a single numeric estimate, overlooking label dispersion. By leveraging the disaggregated SemEval-2015 dataset (Soft-STS-15), this paper proposes and compares two disagreement-aware strategies that treat STS as an ordinal distribution prediction problem: a lightweight truncated Gaussian head for standard regression models, and a cross-encoder trained with a distance-aware objective, refined with temperature scaling. Results show improved performance in distance-based metrics, with the calibrated soft-label model proving best overall and notably more accurate on the most ambiguous pairs. This demonstrates that modeling disagreement benefits both calibration and ranking accuracy, highlighting the value of retaining and modeling full annotation distributions rather than collapsing them to a single mean label.
pdf
bib
abs
Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning Tasks
Wenyang Hu
|
Gregory Kang Ruey Lau
|
Liu Diwen
|
Chen Jizhuo
|
See-Kiong Ng
|
Bryan Kian Hsiang Low
Large Language Models (LLMs), particularly smaller variants, still struggle with complex reasoning tasks. While inference-time prompting can guide reasoning, existing methods often rely on sequential queries. Ensemble approaches offer a promising path to performance gains, especially given recent batch inference speed-ups. This work introduces DIPPER, a novel, training-free framework that transforms a single LLM into an effective inference-time ensemble. By feeding the model an optimized and diverse set of prompts in parallel, DIPPER elicits varied reasoning paths, leading to performance gains. We empirically demonstrate significant improvements on mathematical reasoning benchmarks, such as MATH, where a DIPPER ensemble of three Qwen2-MATH-1.5B instances (via parallel prompting of a single model) outperforms a larger Qwen2-MATH-7B model.
pdf
bib
abs
Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics
Seyedeh Fatemeh Ebrahimi
|
Jaakko Peltonen
Topic models often fail to capture low-prevalence, domain-critical themes—so-called minority topics—such as mental health themes in online comments. While some existing methods can incorporate domain knowledge such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain relevant minority content.
pdf
bib
abs
Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages
Nadine El-Naggar
|
Tatsuki Kuribayashi
|
Ted Briscoe
Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions — typologically plausible word orders tend to be easier for LMs to productively generalize.
pdf
bib
abs
Training compute-optimal transformer encoder models
Megi Dervishi
|
Alexandre Allauzen
|
Gabriel Synnaeve
|
Yann LeCun
Transformer encoders are critical for a wide range of Natural Language Processing (NLP) tasks, yet their compute–efficiency remains poorly understood. We present the first comprehensive empirical investigation of compute-optimal pretraining for encoder transformers using the Masked Language Modeling (MLM) objective. Across hundreds of carefully controlled runs we vary model size, data size, batch size, learning rate, and masking ratio, with increasing compute budget. The compute-optimal data-to-model ratio of Transformer encoder models is 10 to 100 times larger than the ratio of auto-regressive models. Using these recipes, we train OptiBERT, a family of compute-optimal BERT-style models that matches or surpasses leading baselines—including ModernBERT and NeoBERT—on GLUE and MTEB while training with dramatically less FLOPS.
pdf
bib
abs
Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
Hyungyu Shin
|
Jingyu Tang
|
Yoonjoo Lee
|
Nayoung Kim
|
Hyunseung Lim
|
Ji Yong Cho
|
Hwajung Hong
|
Moontae Lee
|
Juho Kim
Peer review underpins scientific progress, but it is increasingly strained by reviewer shortages and growing workloads. Large Language Models (LLMs) can automatically draft reviews now, but determining whether LLM-generated reviews are trustworthy requires systematic evaluation. Researchers have evaluated LLM reviews at either surface-level (e.g., BLEU and ROUGE) or content-level (e.g., specificity and factual accuracy). Yet it remains uncertain whether LLM-generated reviews attend to the same critical facets that human experts weigh—the strengths and weaknesses that ultimately drive an accept-or-reject decision. We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention across predefined facets in paper reviews. Based on the framework, we developed an automatic focus-level evaluation pipeline based on two sets of facets: target (e.g., problem, method, and experiment) and aspect (e.g., validity, clarity, and novelty), leveraging 676 paper reviews from OpenReview that consists of 3,657 strengths and weaknesses identified from human experts. The comparison of focus distributions between LLMs and human experts showed that the off-the-shelf LLMs consistently have a more biased focus towards examining technical validity while significantly overlooking novelty assessment when criticizing papers.Dataset: https://figshare.com/s/d5adf26c802527dd0f62
pdf
bib
abs
Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
Zoe Wanying He
|
Sean Trott
|
Meenakshi Khosla
Recent studies show that deep vision-only and language-only models—trained on disjoint modalities—nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of _where_ in each network this convergence emerges, _what_ visual or linguistic cues support it, _whether_ it captures human preferences in many-to-many image-text scenarios, and _how_ aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice “Pick-a-Pic” task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.
pdf
bib
abs
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models
Artem Vazhentsev
|
Ekaterina Fadeeva
|
Rui Xing
|
Gleb Kuzmin
|
Ivan Lazichny
|
Alexander Panchenko
|
Preslav Nakov
|
Timothy Baldwin
|
Maxim Panov
|
Artem Shelmanov
Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM, because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
pdf
bib
abs
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites
Xintong Wang
|
Yixiao Liu
|
Jingheng Pan
|
Liang Ding
|
Longyue Wang
|
Chris Biemann
Detoxifying offensive language while preserving the speaker’s original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.
pdf
bib
abs
A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs
Artem Shelmanov
|
Ekaterina Fadeeva
|
Akim Tsvigun
|
Ivan Tsvigun
|
Zhuohan Xie
|
Igor Kiselev
|
Nico Daheim
|
Caiqi Zhang
|
Artem Vazhentsev
|
Mrinmaya Sachan
|
Preslav Nakov
|
Timothy Baldwin
LLMs have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information, and users generally lack the tools to detect when this happens. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the transformer architecture in their design, in the form of informative features derived from LLM attention maps and logits. Our experiments show that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma. We publicly release both the code and the pre-trained heads.