19th Conference of the European Chapter of the Association for Computational Linguistics
Rabat, MoroccoMarch 24–29, 2026
Volumes
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 395 papers
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) 52 papers
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations) 45 papers
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop) 72 papers
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track) 72 papers
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 6: Tutorials) 5 papers
- Findings of the Association for Computational Linguistics: EACL 2026 355 papers
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Vera Demberg | Kentaro Inui | Lluís Marquez
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts
Yang Liu | Jiaye Yang | Weikang Li | Jiahui Liang | Yang Li | Lingyong Yan
Yang Liu | Jiaye Yang | Weikang Li | Jiahui Liang | Yang Li | Lingyong Yan
We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Yuxuan Zhu | Antony Kellermann | Akul Gupta | Philip Li | Richard Fang | Rohan Bindu | Daniel Kang
Yuxuan Zhu | Antony Kellermann | Akul Gupta | Philip Li | Richard Fang | Rohan Bindu | Daniel Kang
LLM agents have become increasingly sophisticated, especially in the realm of cybersecurity. Researchers have shown that LLM agents can exploit real-world vulnerabilities when given a description of the vulnerability and toy capture-the-flag problems. However, these agents still perform poorly on real-world vulnerabilities that are unknown to the agent ahead of time (zero-day vulnerabilities).In this work, we show that teams of LLM agents can exploit real-world, zero-day vulnerabilities. Prior agents struggle with exploring many different vulnerabilities and long-range planning when used alone. To resolve this, we introduce HPTSA, a system of agents with a planning agent that can launch subagents. The planning agent explores the system and determines which subagents to call, resolving long-term planning issues when trying different vulnerabilities. We construct a benchmark of 14 real-world vulnerabilities and show that our team of agents improve over prior agent frameworks by up to 4.3×.
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?
Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Alexander Miserlis Hoyle | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Elliott Ash
Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Alexander Miserlis Hoyle | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Elliott Ash
Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.
Early-Exit and Instant Confidence Translation Quality Estimation
Vilém Zouhar | Maike Züfle | Beni Egressy | Julius Cheng | Mrinmaya Sachan | Jan Niehues
Vilém Zouhar | Maike Züfle | Beni Egressy | Julius Cheng | Mrinmaya Sachan | Jan Niehues
Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance. Finally, we show how Instant Confidence COMET can be used to decide which translations a human evaluator should score rather than relying on the COMET score.
GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval
Justus-Jonas Erker | Nils Reimers | Iryna Gurevych
Justus-Jonas Erker | Nils Reimers | Iryna Gurevych
Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper-7B combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring complex reasoning and retrieval capabilities.
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Gyubeum Lim | Yemo Koo | Vijay Krishna Madisetti
Gyubeum Lim | Yemo Koo | Vijay Krishna Madisetti
Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to bridge the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis
Shuhaib Mehri | Xiusi Chen | Heng Ji | Dilek Hakkani-Tür
Shuhaib Mehri | Xiusi Chen | Heng Ji | Dilek Hakkani-Tür
High-quality instruction-tuning data is crucial for developing Large Language Models (LLMs) that can effectively navigate real-world tasks and follow human instructions. While synthetic data generation offers a scalable approach for creating such datasets, it imposes a quality ceiling where models trained on the data cannot outperform the LLM generating it. To overcome this limitation, we introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs. Using this approach, we synthesize REFED, a dataset of 10K instruction-response pairs. Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED demonstrate state-of-the-art performance among similarly sized models, notably reaching a 43.96% length-controlled win-rate on AlpacaEval 2.0. Extensive experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.
T2-RAGBench: Text-and-Table Aware Retrieval-Augmented Generation
Jan Strich | Enes Kutay Isgorur | Maximilian Trescher | Chris Biemann | Martin Semmann
Jan Strich | Enes Kutay Isgorur | Maximilian Trescher | Chris Biemann | Martin Semmann
The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models
Kefan Yu | Qingcheng Zeng | Weihao Xuan | Wanxin Li | Jingyi Wu | Rob Voigt
Kefan Yu | Qingcheng Zeng | Weihao Xuan | Wanxin Li | Jingyi Wu | Rob Voigt
Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker’s intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across three key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.
Hierarchical Text Classification with LLM-Refined Taxonomies
Jonas Golde | Nicolaas Paul Jedema | RaviKiran Krishnan | Phong Le
Jonas Golde | Nicolaas Paul Jedema | RaviKiran Krishnan | Phong Le
Hierarchical text classification (HTC) depends on taxonomies that organize labels into structured hierarchies. However, many real-world taxonomies introduce ambiguities, such as identical leaf names under similar parent nodes, which prevent language models (LMs) from learning clear decision boundaries. In this paper, we present TaxMorph, a framework that uses large language models (LLMs) to transform entire taxonomies through operations such as renaming, merging, splitting, and reordering. Unlike prior work, our method revises the full hierarchy to better match the semantics encoded by LMs. Experiments across three HTC benchmarks show that LLM-refined taxonomies consistently outperform human-curated ones in various settings up to +2.9pp. in F1. To better understand these improvements, we compare how well LMs can assign leaf nodes to parent nodes and vice versa across human-curated and LLM-refined taxonomies. We find that human-curated taxonomies lead to more easily separable clusters in embedding space. However, the LLM-refined taxonomies align more closely with the model’s actual confusion patterns during classification. In other words, even though they are harder to separate, they better reflect the model’s inductive biases. These findings suggest that LLM-guided refinement creates taxonomies that are more compatible with how models learn, improving HTC performance.
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning
Chengsong Huang | Langlin Huang | Jiaxin Huang
Chengsong Huang | Langlin Huang | Jiaxin Huang
In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leverageing task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Binary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations.
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball | Frauke Kreuter | Nina Panickssery
Sarah Ball | Frauke Kreuter | Nina Panickssery
Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. This paper aims to deepen our understanding of how different jailbreak types circumvent safeguards by analyzing model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other, semantically-dissimilar classes. This suggests that diverse jailbreaks may exploit a common internal mechanism. We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model’s perception of prompt harmfulness. These insights pave the way for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.
Out of Style: RAG’s Fragility to Linguistic Variation
Tianyu Cao | Neel Bhandari | Akhila Yerukola | Akari Asai | Maarten Sap
Tianyu Cao | Neel Bhandari | Akhila Yerukola | Akari Asai | Maarten Sap
Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely underexplored. This presents a critical gap for practical deployment, where user queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components. In this work, we systematically analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance. We evaluate two retrieval models and nine LLMs, ranging from 3 to 72 billion parameters, across four information-seeking Question Answering (QA) datasets. Our results reveal that linguistic reformulations significantly impact both retrieval and generation stages, leading to a relative performance drop of up to 40.41% in Recall@5 scores for less formal queries and 38.86% in answer match scores for queries containing grammatical errors. Notably, RAG systems exhibit greater sensitivity to such variations compared to LLM-only generations, highlighting their vulnerability to error propagation due to linguistic shifts. These findings highlight the need for improved robustness techniques to enhance reliability in diverse user interactions.
Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs
Franziska Weeber | Tanise Ceron | Sebastian Padó
Franziska Weeber | Tanise Ceron | Sebastian Padó
Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs’ opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.
H-MEM: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents
Haoran Sun | Shaoning Zeng | Bob Zhang
Haoran Sun | Shaoning Zeng | Bob Zhang
Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory Architecture that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector at a higher level is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.
MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection
Abid Ali | Diego Molla | Usman Naseem
Abid Ali | Diego Molla | Usman Naseem
The abundance of multimodal news in digital form has intensified demand for systems that condense articles and images into concise, faithful digests. Yet most approaches simply conduct unimodal text summarization and attach the most-similar images with the text summary, which leads to redundancy both in processing visual content as well as in selection of images to complement the summary. We propose MULSUM, a two-step framework: (i) a Cross-Vis Aligner that projects image-level embeddings into a shared space and conditions a pre-trained LLM decoder to generate a visually informed text summary, and (ii) a Diversity-Aware Image Selector that, after the summary is produced, maximizes images-relevance to the summary while enforcing pairwise image diversity, yielding a compact, complementary image set. Experimental results on the benchmark MSMO (Multimodal Summarization with Multimodal Output) corpus show that MULSUM consistently outperforms strong baselines on automatic metrics such as ROUGE, while qualitative inspection shows that selected images act as explanatory evidence rather than ornamental add-ons. Human evaluation results shows that our diverse set of selected images was 13% more helpful than mere similarity-based image selection.
How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi | Xuefei Ning | Roy Schwartz | Iryna Gurevych
Federico Marcuzzi | Xuefei Ning | Roy Schwartz | Iryna Gurevych
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups.We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment.We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability.Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression.These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting.Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models
Jasmin Orth | Philipp Mondorf | Barbara Plank
Jasmin Orth | Philipp Mondorf | Barbara Plank
The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
Sangmitra Madhusudan | Kaige Chen | Ali Emami
Sangmitra Madhusudan | Kaige Chen | Ali Emami
When language models correctly parse "The cat that the dog chased meowed,” are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce **CenterBench**, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed”) where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
Automated Screening of Antibacterial Nanoparticle Literature: Dataset Curation and Model Evaluation
Alperen Ozturk | Şaziye Betül Özateş | Sophia Bahar Root | Angela Violi | Nicholas Kotov | J. Scott VanEpps | Emine Sumeyra Turali Emre
Alperen Ozturk | Şaziye Betül Özateş | Sophia Bahar Root | Angela Violi | Nicholas Kotov | J. Scott VanEpps | Emine Sumeyra Turali Emre
Antimicrobial resistance is a growing global health threat, driving interest in nanoparticle-based alternatives to conventional antibiotics. Inorganic nanoparticles (NPs) with intrinsic antibacterial properties show significant promise; however, efficiently identifying relevant studies from the rapidly expanding literature remains a major challenge. This step is crucial for enabling computational approaches that aim to model and predict NP efficacy based on physicochemical and structural features. In this study, we explore the effectiveness of traditional machine learning and deep learning methods in classifying scientific abstracts in the domain of NP-based antimicrobial research. We introduce the “Antibacterial Inorganic NAnoparticles Dataset” AINA of 7,910 articles, curated to distinguish intrinsic antibacterial NPs from studies focusing on drug carriers or surface-bound applications. Our comparative evaluation shows that a fine-tuned BioBERT classifier achieved the highest macro F1 (0.82), while a lightweight SVM model with TF-IDF features remained competitive (0.78), highlighting their utility in low-resource settings. AINA enables reproducible, large-scale identification of intrinsically bactericidal inorganic NPs. By reducing noise from non-intrinsic contexts, this work provides a foundation for mechanism-aware screening, database construction, and predictive modeling in antimicrobial NP research.
Intention Knowledge Graph Construction for User Intention Relation Modeling
Jiaxin Bai | Zhaobo Wang | Junfei Cheng | Dan Yu | Zerui Huang | Weiqi Wang | Xin Liu | Chen Luo | Yanming Zhu | Bo Li | Yangqiu Song
Jiaxin Bai | Zhaobo Wang | Junfei Cheng | Dan Yu | Zerui Huang | Weiqi Wang | Xin Liu | Chen Luo | Yanming Zhu | Bo Li | Yangqiu Song
Understanding user intentions is challenging for online platforms. Recent work on intention knowledge graphs addresses this but often lacks focus on connecting intentions, which is crucial for modeling user behavior and predicting future actions. This paper introduces a framework to automatically generate an intention knowledge graph, capturing connections between user intentions. Using the Amazon m2 dataset, we construct an intention graph with 351 million edges, demonstrating high plausibility and acceptance. Our model effectively predicts new session intentions and enhances product recommendations, outperforming previous state-of-the-art methods and showcasing the approach’s practical utility.
Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
Chunyang Jiang | Paola Merlo
Chunyang Jiang | Paola Merlo
Large language models achieve strong performance on many tasks, but their training makes it hard to see which properties of the input support efficient linguistic rule learning.We ask how three cognitively-inspired principles of input design support sample-efficient linguistic rule induction: analogical structure, contrastive learning, and minimal contextual cue. We also ask how their effects compare to those of LLMs on the same controlled tasks.We implement these principles in structured sentence completion tasks that test English verb alternations.Lightweight models trained on hundreds to one-thousand such examples learn the alternation rules with high F1 on these tasks. Ablation studies show that analogical organisation is the main driver of sample efficiency, and contrastive distractors and minimal context help further gains. We also evaluate zero- and few-shot LLMs on the same tasks. In this controlled setting, the lightweight models reach higher F1 with far fewer task-specific data.We treat this contrast as a comparison between learning regimes rather than a general verdict on LLMs.Our results show that careful input organisation supports sample-efficient learning of linguistic rules and reveals distinct learning signatures for trained lightweight models and prompted LLMs.
JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models’ Detection of Human risky health behavior Content in Jirai Community
Yunze Xiao | Tingyu He | Lionel Z. Wang | Yiming Ma | Xingyu Song | Xiaohang Xu | Mona T. Diab | Irene Li | Ka Chung Ng
Yunze Xiao | Tingyu He | Lionel Z. Wang | Yiming Ma | Xingyu Song | Xiaohang Xu | Mona T. Diab | Irene Li | Ka Chung Ng
In this paper, we present the first cross-lingual dataset that captures a transnational cultural phenomenon, focusing on the Chinese and Japanese "Jirai" subculture and its association with risky health behaviors. Our dataset of more than 15,000 annotated social media posts forms the core of JiraiBench, a benchmark designed to evaluate LLMs on culturally specific content. This unique resource allowed us to uncover an unexpected cross-cultural transfer in which Japanese prompts better handle Chinese content, indicating that cultural context can be more influential than linguistic similarity. Further evidence suggests potential cross-lingual knowledge transfer in fine-tuned models. This work proves the indispensable role of developing culturally informed, cross-lingual datasets for creating effective content moderation tools that can protect vulnerable communities across linguistic borders.
Chandomitra: Towards Generating Structured Sanskrit Poetry from Natural Language Inputs
Manoj Balaji Jagadeeshan | Samarth Bhatia | Pretam Ray | Harshul Raj Surana | Akhil Rajeev P | Priya Mishra | Annarao Kulkarni | Ganesh Ramakrishnan | Prathosh Ap | Pawan Goyal
Manoj Balaji Jagadeeshan | Samarth Bhatia | Pretam Ray | Harshul Raj Surana | Akhil Rajeev P | Priya Mishra | Annarao Kulkarni | Ganesh Ramakrishnan | Prathosh Ap | Pawan Goyal
Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity
Chen Cecilia Liu | Hiba Arnaout | Nils Kovačić | Dana Atzil-Slonim | Iryna Gurevych
Chen Cecilia Liu | Hiba Arnaout | Nils Kovačić | Dana Atzil-Slonim | Iryna Gurevych
Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to a lack of resources. In this work, we introduce , the first dataset designed for this task, spanning four cultures and including 1,729 distress messages, 1,523 cultural signals, and 1,041 support strategies with fine-grained emotional and cultural annotations. Leveraging , we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM-as-a-Judge, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in novice therapists.
Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs
Hussein Abdallah | Ibrahim Abdelaziz | Panos Kalnis | Essam Mansour
Hussein Abdallah | Ibrahim Abdelaziz | Panos Kalnis | Essam Mansour
Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open- world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM’s reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM–GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement.
Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models
David Guzman Piedrahita | Irene Strauss | Rada Mihalcea | Zhijing Jin
David Guzman Piedrahita | Irene Strauss | Rada Mihalcea | Zhijing Jin
As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left–right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy–authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increased favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicitly political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes.
PromptFE: Automated Feature Engineering by Prompting
Yufeng Zou | Jean Utke | Diego Klabjan | Han Liu
Yufeng Zou | Jean Utke | Diego Klabjan | Han Liu
Automated feature engineering (AutoFE) liberates data scientists from the burden of manual feature construction. The semantic information of datasets contains rich context information for feature engineering but has been underutilized in many existing AutoFE works. We present PromptFE, a novel AutoFE framework that leverages large language models (LLMs) to automatically construct features in a compact string format and generate semantic explanations based on dataset descriptions. By learning the performance of constructed features in context, the LLM iteratively improves feature construction. We demonstrate through experiments on real-world datasets the superior performance of PromptFE over state-of-the-art AutoFE methods. We verify the impact of dataset semantic information and provide comprehensive study on the LLM-based feature construction process.
Detecting (Un)answerability in Large Language Models with Linear Directions
Maor Juliet Lavi | Tova Milo | Mor Geva
Maor Juliet Lavi | Tova Milo | Mor Geva
Large language models (LLMs) often respond confidently to questions even when they lack the necessary information, leading to hallucinated answers. In this work, we study the problem of (un)answerability detection, focusing on extractive question answering (QA) where the model should determine if a passage contains sufficient information to answer a given question. We propose a simple approach for identifying a direction in the model’s activation space that captures unanswerability and uses it for classification. This direction is selected by applying activation additions during inference and measuring their impact on the model’s abstention behavior. We show that projecting hidden activations onto this direction yields a reliable score for (un)answerability classification. Experiments on two open-weight LLMs and four extractive QA benchmarks show that our method effectively detects unanswerable questions and generalizes better across datasets than existing prompt-based and classifier-based approaches. Moreover, the obtained directions extend beyond extractive QA to unanswerability that stems from factors, such as lack of scientific consensus and subjectivity. Last, causal interventions show that adding or ablating the directions effectively controls the abstention behavior of the model.
Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning
Sanghwan Bae | Jiwoo Hong | Min Young Lee | Hanbyul Kim | Jeongyeon Nam | Donghyun Kwak
Sanghwan Bae | Jiwoo Hong | Min Young Lee | Hanbyul Kim | Jeongyeon Nam | Donghyun Kwak
Recent advances in reinforcement learning with verifiable rewards (RLVR) show that large language models enhance their reasoning abilities when trained with verifiable signals. However, due to reward sparsity, effectiveness depends heavily on selecting samples of appropriate difficulty. In this work, we present a formal analysis of online difficulty-aware filtering and establish its theoretical foundations. We show that expected policy improvement is lower-bounded by the variance of task-level success probabilities, implying that selecting tasks of intermediate difficulty maximizes learning efficiency. Building on this, we demonstrate that balanced filtering maximizes this lower bound, leading to superior performance and sample efficiency. Evaluations across multiple math reasoning benchmarks validate that balanced filtering consistently enhances convergence speed and final performance, achieving up to +12% gains in less than half the training steps of standard GRPO. By extending our analysis to various reward distributions, we provide a principled foundation for future RLVR curriculum strategies, confirmed through both theoretical analysis and extensive empirical results.
BERT, are you paying attention? Attention regularization with human-annotated rationales
Elize Herrewijnen | Dong Nguyen | Floris Bex | Albert Gatt
Elize Herrewijnen | Dong Nguyen | Floris Bex | Albert Gatt
Attention regularisation aims to supervise the attention patterns in language models like BERT. Various studies have shown that using human-annotated rationales, in the form of highlights that explain why a text has a specific label, can have positive effects on model generalisability. In this work, we ask to what extent attention regularisation with human-annotated rationales improve model performance and model robustness, as well as susceptibility to spurious correlations. We compare regularisation on human rationales with randomly selected tokens, a baseline which has hitherto remained unexplored.Our results suggest that often, attention regularisation with randomly selected tokens yields similar improvements to attention regularisation with human-annotated rationales. Nevertheless, we find that human-annotated rationales surpass randomly selected tokens when it comes to reducing model sensitivity to strong spurious correlations.
Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature–based and concrete exemplar–based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item–specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.
BigTokDetect: A Clinically-Informed Vision–Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok
Minh Duc Chu | Kshitij Pawar | Zihao He | Roxanna Sharifi | Ross M. Sonnenblick | Magdalayna Curry | Laura DAdamo | Lindsay Young | Stuart Murray | Kristina Lerman
Minh Duc Chu | Kshitij Pawar | Zihao He | Roxanna Sharifi | Ross M. Sonnenblick | Magdalayna Curry | Laura DAdamo | Lindsay Young | Stuart Murray | Kristina Lerman
Social media platforms face escalating challenges in detecting harmful content that promotes muscle dysmorphic behaviors and cognitions (bigorexia). This content can evade moderation by camouflaging as legitimate fitness advice and disproportionately affects adolescent males. We address this challenge with BigTokDetect, a clinically informed framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal benchmark dataset of over 2,200 TikTok videos labeled by clinical psychiatrists across five categories and eighteen fine-grained subcategories. Comprehensive evaluation of state-of-the-art vision-language models reveals that while commercial zero-shot models achieve the highest accuracy on broad primary categories, supervised fine-tuning enables smaller open-source models to perform better on fine-grained subcategory detection. Ablation studies show that multimodal fusion improves performance by 5 to 15 percent, with video features providing the most discriminative signals. These findings support a grounded moderation approach that automates detection of explicit harms while flagging ambiguous content for human review, and they establish a scalable framework for harm mitigation in emerging mental health domains.
Do language models accommodate their users? A study of linguistic convergence
Terra Blevins | Susanne Schmalwieser | Benjamin Roth
Terra Blevins | Susanne Schmalwieser | Benjamin Roth
While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.
Auditing Language Model Unlearning via Information Decomposition
Anmol Goel | Alan Ritter | Iryna Gurevych
Anmol Goel | Alan Ritter | Iryna Gurevych
We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
OD-Stega: LLM-Based Relatively Secure Steganography via Optimized Distributions
Yu-Shin Huang | Peter Just | Hanyun Yin | Krishna Narayanan | Ruihong Huang | Chao Tian
Yu-Shin Huang | Peter Just | Hanyun Yin | Krishna Narayanan | Ruihong Huang | Chao Tian
We consider coverless steganography where a Large Language Model (LLM) is used to generate stego-texts in combination with arithmeticic coding. An efficient method should embed secret bits in as few language tokens as possible while keeping the stego-text as natural as possible. We show that this problem is equivalent to maximizing the entropy of a replacement probability distribution of the next token generation, subject to a constraint on the divergence between the new distribution and the original one produced by the LLM. A closed-form solution is provided under either the KL divergence or the total variation constraint. Several important practical issues are also tackled: 1) An often-overlooked tokenization mismatch issue is resolved with a simple prompt selection approach, 2) The combination of the optimized distribution and the vocabulary truncation technique is considered, and 3) The incorporation of the proposed approach with existing (potentially non arithemtic coding based) techniques, e.g., the Discop technique.
Continual learning in natural language processing plays a crucial role in adapting to evolving data and preventing catastrophic forgetting. Despite significant progress, existing methods still face challenges, such as inefficient parameter reuse across tasks, risking catastrophic forgetting when tasks are dissimilar, and the unnecessary introduction of new parameters for each task, which hampers knowledge sharing among similar tasks. To tackle these issues, we propose a Sparse Adapter Fusion Method (SAFM), which dynamically fuses old and new adapters to address these challenges. SAFM operates in two stages: the decision stage and the tuning stage. In the decision stage, SAFM determines whether to incorporate a new adapter, reuse an existing one, or add an empty adapter. The architecture search procedure, designed to prioritize reusing or adding empty adapters, minimizes parameter consumption and maximizes reuse. In the tuning stage, SAFM especially facilitates a layer-wise loss to encourage differentiation between adapters, effectively capturing knowledge within the same task. Experimental results consistently show that SAFM outperforms state-of-the-art (SOTA) methods, achieving comparable performance while utilizing less than 60% of the parameters.
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Zixiao Zhu | Hanzhang Zhou | Zijian Feng | Tianjiao Li | Chua Jia Jim Deryl | Lee Onn Mak | Gee Wah Ng | Kezhi Mao
Zixiao Zhu | Hanzhang Zhou | Zijian Feng | Tianjiao Li | Chua Jia Jim Deryl | Lee Onn Mak | Gee Wah Ng | Kezhi Mao
Prompt optimization (PO) provides a practical way to improve response quality when users lack the time or expertise to manually craft effective prompts. Existing methods typically rely on LLMs’ self-generation ability to optimize prompts. However, due to limited downward compatibility, the instruction-heavy prompts generated by advanced LLMs can overwhelm lightweight inference models and degrade response quality, while also lacking interpretability due to implicit optimization. In this work, we rethink prompt optimization through the lens of explicit and interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, locally deployable prompt optimizer trained on our merit-guided prompt preference dataset generated by a lightweight LLM. MePO avoids online optimization, reduces privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. The code, model and dataset can be found in https://github.com/MidiyaZhu/MePO.
A Survey on Multilingual Mental Disorders Detection from Social Media Data
Ana-Maria Bucur | Marcos Zampieri | Tharindu Ranasinghe | Fabio Crestani
Ana-Maria Bucur | Marcos Zampieri | Tharindu Ranasinghe | Fabio Crestani
The increasing prevalence of mental disorders globally highlights the urgent need for effective digital screening methods that can be used in multilingual contexts. Most existing studies, however, focus on English data, overlooking critical mental health signals that may be present in non-English texts. To address this gap, we present a survey of the detection of mental disorders using social media data beyond the English language. We compile a comprehensive list of 108 datasets spanning 25 languages that can be used for developing NLP models for mental health screening. In addition, we discuss the cultural nuances that influence online language patterns and self-disclosure behaviors, and how these factors can impact the performance of NLP tools. Our survey highlights major challenges, including the scarcity of resources for low- and mid-resource languages and the dominance of depression-focused data over other disorders. By identifying these gaps, we advocate for interdisciplinary collaborations and the development of multilingual benchmarks to enhance mental health screening worldwide.
Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump’s Presidential Campaigns
Ilias Chalkidis | Stephanie Brandl | Paris Aslanidis
Ilias Chalkidis | Stephanie Brandl | Paris Aslanidis
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.
SCoNE: a Self-Correcting and Noise-Augmented Method for Complex Biological and Chemical Named Entity Recognition
Xingyu Zhu | Claire Nédellec | Balazs Nagy | Laszlo Vidacs | Robert Bossy
Xingyu Zhu | Claire Nédellec | Balazs Nagy | Laszlo Vidacs | Robert Bossy
Generative methods have recently gained traction in biological and chemical named entity recognition for their ability to overcome tagging limitations and better capture entity-rich contexts. However, under a few-shot environment, they struggle with the scarcity of annotated data and the structural complexity of biological and chemical entities—particularly nested and discontinuous ones—leading to incorrect recognition and error propagation during generation. To address these challenges, we propose SCoNE, a Self-Correcting and Noise-Augmented Method for Complex Biological and Chemical Named Entity Recognition. Specifically, we introduce a Noise Augmentation Module to enhance training diversity and guide the model to better learn complex entity structures. Besides, we design a Confidence-based Self-Correction Module that identifies low-confidence outputs and revises them to improve generation robustness. Benefiting from these designs, our method outperforms the baselines by 1.80 and 2.73 F1-score on the CHEMDNER and microbial ecology dataset Florilege, highlighting its effectiveness in biological and chemical named entity recognition.
A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
Iwona Christop | Mateusz Czyżnikiewicz | Paweł Skórzewski | Łukasz Bondaruk | Jakub Kubiak | Marcin Lewandowski | Marek Kubis
Iwona Christop | Mateusz Czyżnikiewicz | Paweł Skórzewski | Łukasz Bondaruk | Jakub Kubiak | Marcin Lewandowski | Marek Kubis
The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning
Syeda Nahida Akter | Shrimai Prabhumoye | Matvei Novikov | Seungju Han | Ying Lin | Evelina Bakhturina | Eric Nyberg | Yejin Choi | Mostofa Patwary | Mohammad Shoeybi | Bryan Catanzaro
Syeda Nahida Akter | Shrimai Prabhumoye | Matvei Novikov | Seungju Han | Ying Lin | Evelina Bakhturina | Eric Nyberg | Yejin Choi | Mostofa Patwary | Mohammad Shoeybi | Bryan Catanzaro
Prior work has successfully applied Reinforcement Learning (RL) to mathematical reasoning—where rules and correctness are well-defined. Yet, generalizing these methods to broader reasoning domains remains challenging due to limited data and the lack of verifiable rewards for unstructured domains. In this work, we propose NEMOTRON-CROSSTHINK, a framework that systematically incorporates multi-domain corpora into RL training to improve generalization across diverse reasoning tasks. NEMOTRON-CROSSTHINK addresses key challenges by (1) combining data from varied sources; (2) applying structured templates to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies to utilize multi-source data effectively. This enables scalable and verifiable reward modeling beyond math and demonstrates improved accuracies on both math (MATH-500: +30.1%, AMC23: +27.5%) and non-math reasoning benchmarks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%, AGIEVAL: +15.1%, SUPERGPQA: +3.8%). Moreover, NEMOTRON-CROSSTHINK exhibits significantly improved response efficiency—using 28% fewer tokens for correct answers—highlighting more focused and effective reasoning. Through NEMOTRON-CROSSTHINK, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs. All of our datasets are available on HuggingFace.
Safety of Large Language Models Beyond English: A Systematic Literature Review of Risks, Biases, and Safeguards
Aleksandra Krasnodębska | Katarzyna Dziewulska | Karolina Seweryn | Maciej Chrabaszcz | Wojciech Kusa
Aleksandra Krasnodębska | Katarzyna Dziewulska | Karolina Seweryn | Maciej Chrabaszcz | Wojciech Kusa
As Large Language Models (LLMs) continue to evolve, ensuring their safety across multiple languages has become a critical concern. While LLMs demonstrate impressive capabilities in English, their safety mechanisms may not generalize effectively to other languages, leading to disparities in toxicity detection, bias mitigation, and harm prevention. This systematic review examines the multilingual safety of LLMs by synthesizing findings from recent studies that evaluate their robustness across diverse linguistic and cultural contexts beyond English language. Our review explores the methodologies used to assess multilingual safety, identifies challenges such as dataset availability and evaluation biases. Based on our analysis we highlight gaps in multilingual safety research and provide recommendations for future work. This review aims to contribute to the development of fair and effective safety mechanisms for LLMs across all languages. We provide the extracted data in an interactive Streamlit dashboard, enabling transparent access to the raw data and allowing for continuous updates.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Yuhang Liu | Pengxiang Li | Zishu Wei | Congkai Xie | Xueyu Hu | Xinchen Xu | Shengyu Zhang | Xiaotian Han | Hongxia Yang | Fei Wu
Yuhang Liu | Pengxiang Li | Zishu Wei | Congkai Xie | Xueyu Hu | Xinchen Xu | Shengyu Zhang | Xiaotian Han | Hongxia Yang | Fei Wu
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce InfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. InfiGUIAgent achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks.
Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
Yakup Abrek Er | Ilker Kesen | Gözde Gül Şahin | Aykut Erdem
Yakup Abrek Er | Ilker Kesen | Gözde Gül Şahin | Aykut Erdem
We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation
Bastien Liétard | Gabriel Loiseau
Bastien Liétard | Gabriel Loiseau
Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE’s fine-tuning brings valuable changes to the spatial organization of embeddings.
Do NOT Classify and Count: Hybrid Attribute Control Success Evaluation
Felix Matthias Saaro | Pius von Däniken | Mark Cieliebak | Jan Milan Deriu
Felix Matthias Saaro | Pius von Däniken | Mark Cieliebak | Jan Milan Deriu
Evaluating attribute control success in controllable text generation and related generation tasks typically relies on pretrained classifiers. We show that this widely used classify-and-count approach yields biased and inconsistent results, with estimates varying significantly across classifiers. We frame control success estimation as a quantification task and apply a hybrid Bayesian method that combines classifier predictions with a small number of human labels for calibration. To test our approach, we collected a two-modality test dataset consisting of 600 human-rated samples and 60,000 automatically rated samples. Our experiments show that our approach produces robust estimates of control success across both text and text-to-image generation tasks, offering a principled alternative to current evaluation practices.
Detecting Training Data of Large Language Models via Expectation Maximization
Gyuwan Kim | Yang Li | Evangelia Spiliopoulou | Jie Ma | William Yang Wang
Gyuwan Kim | Yang Li | Evangelia Spiliopoulou | Jie Ma | William Yang Wang
Membership inference attacks (MIAs) aim to determine whether a specific example was used to train a given language model. While prior work has explored prompt-based attacks such as ReCALL, these methods rely heavily on the assumption that using known non-members as prompts reliably suppresses the model’s responses to non-member queries. We propose EM-MIA, a new membership inference approach that iteratively refines prefix effectiveness and membership scores using an expectation-maximization strategy without requiring labeled non-member examples. To support controlled evaluation, we introduce OLMoMIA, a benchmark that enables analysis of MIA robustness under systematically varied distributional overlap and difficulty. Experiments on WikiMIA and OLMoMIA show that EM-MIA outperforms existing baselines, particularly in settings with clear distributional separability. We highlight scenarios where EM-MIA succeeds in practical settings with partial distributional overlap, while failure cases expose fundamental limitations of current MIA methods under near-identical conditions. We release our code and evaluation pipeline to encourage reproducible and robust MIA research.
How effective are VLMs in assisting humans in inferring the quality of mental models from Multimodal short answers?
Pritam Sil | Durgaprasad Karnam | Vinay Reddy Venumuddala | Pushpak Bhattacharyya
Pritam Sil | Durgaprasad Karnam | Vinay Reddy Venumuddala | Pushpak Bhattacharyya
STEM Mental models can play a critical role in assessing students’ conceptual understanding of a topic. They not only offer insights into what students know but also into how effectively they can apply, relate to, and integrate concepts across various contexts. Thus, students’ responses are critical markers of the quality of their understanding and not entities that should be merely graded. However, inferring these mental models from student answers is challenging as it requires deep reasoning skills. We propose MMGrader, an approach that infers the quality of students’ mental models from their multimodal responses using concept graphs as an analytical framework. In our evaluation with 9 openly available models, we found that the best-performing models fall short of human-level performance. This is because they only achieved an accuracy of approximately 40%, a prediction error of 1.1 units, and a scoring distribution fairly aligned with human scoring patterns. With improved accuracy, these can be highly effective assistants to teachers in inferring the mental models of their entire classrooms, enabling them to do so efficiently and help improve their pedagogies more effectively by designing targeted help sessions and lectures that strengthen areas where students collectively demonstrate lower proficiency.
Don’t Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism
Simon Münker | Nils Schwager | Achim Rettinger
Simon Münker | Nils Schwager | Achim Rettinger
The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.
Persona Prompting as a Lens on LLM Social Reasoning
Jing Yang | Moritz Hechtbauer | Elisabeth Khalilov | Evelyn Luise Brinkmann | Vera Schmitt | Nils Feldhus
Jing Yang | Moritz Hechtbauer | Elisabeth Khalilov | Evelyn Luise Brinkmann | Vera Schmitt | Nils Feldhus
For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.
PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media
Michele Joshua Maggini | Paloma Piot | Anxo Pérez | Erik Bran Marino | Lúa Santamaría Montesinos | Ana Lisboa Cotovio | Marta Vázquez Abuín | Javier Parapar | Pablo Gamallo
Michele Joshua Maggini | Paloma Piot | Anxo Pérez | Erik Bran Marino | Lúa Santamaría Montesinos | Ana Lisboa Cotovio | Marta Vázquez Abuín | Javier Parapar | Pablo Gamallo
Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce PartisanLens, the first multilingual dataset of 1617 hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation; PartisanLens supports future research on detecting partisan and conspiratorial narratives in European contexts.
Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition
Lei Xu | Pierre Beckmann | Marco Valentino | Andre Freitas
Lei Xu | Pierre Beckmann | Marco Valentino | Andre Freitas
Lexical Popularity: Quantifying the Impact of Pre-training for LLM Performance
Elena Sofia Ruzzetti | Fabio Massimo Zanzotto | Tommaso Caselli
Elena Sofia Ruzzetti | Fabio Massimo Zanzotto | Tommaso Caselli
Large Language Models (LLMs) excel in numerous and varied tasks. Yet, the mechanisms that underlie this success remain insufficiently understood. In particular, the size and the limited transparency of their pre-training materials make it difficult to state what the properties of the pre-training material are when compared to the test data. In this paper, we investigate whether LLMs learned generalized linguistic abstraction or rely on surface-level features, like lexical patterns, that match their pre-training data. We explore this by examining the relationship between lexical overlap of test data and task performance. We observe that lexical overlap with the pre-training material is mostly beneficial to model performance on tasks requiring functional linguistic knowledge. To further explore the impact of lexical features, we also demonstrate that LLMs are fragile with respect to lexical perturbations that preserve semantics. While we expected models to rely on lexical overlap between test instances and pre-training data for tasks requiring functional knowledge, lexical perturbations reveal that models also exhibit, to a lesser extent, this dependence for tasks requiring formal linguistic knowledge.
Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification
Paul He | Yinya Huang | Mrinmaya Sachan | Zhijing Jin
Paul He | Yinya Huang | Mrinmaya Sachan | Zhijing Jin
Large language models (LLMs) are increasingly applied to tasks involving causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that fail to assess whether a model’s output is formally valid under causal semantics. We propose DoVerifier, a symbolic verification framework that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers that would otherwise be marked incorrect due to superficial differences. Evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness than standard metrics, offering a more rigorous and informative way to evaluate LLMs on causal tasks.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey | Yongjin Yang | Jiarui Liu | Zhijing Jin
Punya Syon Pandey | Yongjin Yang | Jiarui Liu | Zhijing Jin
Game-theoretic interactions between agents with large language models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf’s and Heaps’ Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems.
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns | Letitia Parcalabescu | Stephan Waeldchen | Michael Barlow | Gregor Ziegltrum | Volker Stampa | Bastian Harren | Björn Deiseroth
Thomas F Burns | Letitia Parcalabescu | Stephan Waeldchen | Michael Barlow | Gregor Ziegltrum | Volker Stampa | Bastian Harren | Björn Deiseroth
Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokeniser-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
Large language models achieve state-of-the-art performance but are increasingly costly to fine-tune. Prompt tuning is a parameter-efficient fine-tuning method that addresses parameter-efficiency by learning prompt embeddings, but these embeddings are typically tied to the model’s hidden dimensionality, limiting parameter saving. In this paper, we propose Ultra-Low-dimensional Prompt Tuning (ULPT), a simple yet effective method that optimizes prompts in a low-dimensional space (e.g., 2D) and uses a frozen random matrix for up-projection. ULPT can achieve 98% reduction in the training parameters compared to vanilla prompt tuning while preserving performance. Our extensive experiments across over 20 NLP tasks demonstrate that ULPT consistently outperforms recent parameter-efficient tuning methods using significantly fewer parameters, making it well-suited as a storage-efficient framework for massive LLM customization.
We model Semantic Self-Verification (SSV) as the problem of determining whether a statement accurately characterizes its own semantic properties within a given interpretive framework that formalizes a challenge in AI safety and fairness: can an AI system verify that it has correctly interpreted rules intended to govern its behavior? We prove that SSV, in this specification, is NP-complete by constructing a polynomial-time reduction from 3-Satisfiability (3-SAT). Our reduction maps a 3-SAT formula to an instance of SSV involving ambiguous terms with binary interpretations and semantic constraints derived from logical clauses. This establishes that even simplified forms of semantic self-verification should face computational barriers. The NP-complete lower bound has implications for AI safety and fairness approaches that rely on semantic interpretation of instructions, including but not limited to constitutional AI, alignment via natural language, and instruction-following systems. Any approach where an AI system verify its understanding of directives faces this computational barrier. We argue that more realistic verification scenarios likely face even greater complexity.
STAMP: Selective Task-Aware Mechanism for Text Privacy
Fengwei Tian | Payel Bhattacharjee | Heidi Hanson | Geoffrey D Rubin | Joseph Y. Lo | Ravi Tandon
Fengwei Tian | Payel Bhattacharjee | Heidi Hanson | Geoffrey D Rubin | Joseph Y. Lo | Ravi Tandon
We present STAMP (Selective Task-Aware Mechanism for Text Privacy), a new framework for task-aware text privatization that achieves an improved privacy–utility trade-off. STAMP selectively allocates privacy budgets across tokens by jointly considering (i) each token’s importance to the downstream task (as measured via a task- or query-specific representation), and (ii) its privacy sensitivity (e.g., names, dates, identifiers). This token-level partitioning enables fine-grained, group-wise control over the level of noise applied to different parts of the input, balancing privacy protection with task relevance. To privatize individual token embeddings, we introduce the polar mechanism, which perturbs only the direction of embeddings on the unit sphere while preserving their magnitude. Decoding is performed via cosine nearest-neighbor search, aligning the perturbation geometry with the decoding geometry. Unlike isotropic noise mechanisms, the polar mechanism maintains semantic neighborhoods in the embedding space and better preserves downstream utility. Experimental evaluations on SQuAD, Yelp, and AG News datasets demonstrate that STAMP, when combined with the normalized polar mechanism, consistently achieves superior privacy–utility trade-offs across varying per-token privacy budgets.
Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities
Alberto Purpura | Li Wang | Sahil Badyal | Gene Beaufrand | Adam Faulkner
Alberto Purpura | Li Wang | Sahil Badyal | Gene Beaufrand | Adam Faulkner
Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability.Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.
Utterance-level Detection Framework for LLM-Involved Content Detection in Conversational Setting
Muyang Zhou | Huaxia Rui
Muyang Zhou | Huaxia Rui
As Large Language Models(LLMs) increasingly power chatbots, social media, and other interactive platforms, the ability to detect AI in conversational settings is critical for ensuring transparency and preventing potential misuse. However, existing detection methods focus on static, document-level content, overlooking the dynamic nature of dialogues. To address this, we propose an utterance-level detection framework, which integrates features from individual and combined analysis of dialogue participants’ responses to detect LLM-generated text under conversational setting. Leveraging a transformer-based recurrent architecture and a curated dataset of human-human, human-LLM, and LLM-LLM dialogues, this framework achieves an accuracy of 98.14% with high inference speed, supported by extensive results of experiments on different models and settings. This work provides an effective solution for detecting LLM-generated text in real-time conversations, promoting transparency, and mitigating risks of misuse.
Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
Yifei Shen | Yilun Zhao | Justice Ou | Tinglin Huang | Arman Cohan
Yifei Shen | Yilun Zhao | Justice Ou | Tinglin Huang | Arman Cohan
Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce ClinSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving ClinSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 20 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on ClinSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.
iBERT: Interpretable Embeddings via Sense Decomposition
Vishal Anand | Milad Alshomary | Kathleen McKeown
Vishal Anand | Milad Alshomary | Kathleen McKeown
We present iBERT (interpretable-BERT), an encoder to produce inherently interpretable and controllable embeddings - designed to modularize and expose the discriminative cues present in language, such as semantic or stylistic structure. Each input token is represented as a sparse, non-negative mixture over k context-independent sense vectors, which can be pooled into sentence embeddings or used directly at the token level. This enables modular control over representation, before any decoding or downstream use.To demonstrate our model’s interpretability, we evaluate it on a suite of style-focused tasks. On the STEL benchmark, it improves style representation effectiveness by ~8 points over SBERT-style baselines, while maintaining competitive performance on authorship verification. Because each embedding is a structured composition of interpretable senses, we highlight how specific style attributes get assigned to specific sense vectors. While our experiments center on style, iBERT is not limited to stylistic modeling. Its structural modularity is designed to interpretably decompose whichever discriminative signals are present in the data — enabling generalization even when supervision blends semantic or stylistic factors.
Attacker’s Noise Can Manipulate Your Audio-based LLM in the Real World
Vinu Sankar Sadasivan | Soheil Feizi | Rajiv Mathews | Lun Wang
Vinu Sankar Sadasivan | Soheil Feizi | Rajiv Mathews | Lun Wang
This paper investigates the real-world vulnerabilities of audio-based large language models (ALLMs), such as Qwen2-Audio. We first demonstrate that an adversary can craft stealthy audio perturbations to manipulate ALLMs into exhibiting specific targeted behaviors, such as eliciting responses to wake-keywords (e.g., "Hey Qwen"), or triggering harmful behaviors (e.g., "Change my calendar event"). Subsequently, we show that playing adversarial background noise during user interaction with the ALLMs can significantly degrade the response quality. Crucially, our research illustrates the scalability of these attacks to real-world scenarios, impacting other innocent users when these adversarial noises are played through the air. Further, we discuss the transferability of the attack and potential defensive measures.
Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework
Clea Chataigner | Rebecca Ma | Prakhar Ganesh | Yuhao Chen | Afaf Taik | Elliot Creager | Golnoosh Farnadi
Clea Chataigner | Rebecca Ma | Prakhar Ganesh | Yuhao Chen | Afaf Taik | Elliot Creager | Golnoosh Farnadi
Large language models (LLMs) are highly sensitive to subtle changes in prompt phrasing, posing challenges for reliable auditing. Prior methods often apply unconstrained prompt paraphrasing, which risk missing linguistic and demographic factors that shape authentic user interactions. We introduce AUGMENT (Automated User-Grounded Modeling and Evaluation of Natural Language Transformations), a framework for generating controlled paraphrases, grounded in user behaviors. AUGMENT leverages linguistically informed rules and enforces quality through checks on instruction adherence, semantic similarity, and realism, ensuring paraphrases are both reliable and meaningful for auditing. Through case studies on the BBQ and MMLU datasets, we show that controlled paraphrases uncover systematic weaknesses that remain obscured under unconstrained variation. These results highlight the value of the AUGMENT framework for reliable auditing.
AutoBool: Reinforcement-Learned LLM for Effective Automatic Systematic Reviews Boolean Query Generation
Shuai Wang | Harrisen Scells | Bevan Koopman | Guido Zuccon
Shuai Wang | Harrisen Scells | Bevan Koopman | Guido Zuccon
We present AutoBool, a reinforcement learning (RL) framework that trains large language models (LLMs) to generate effective Boolean queries for medical systematic reviews. Boolean queries are the primary mechanism for literature retrieval in this domain and must achieve high recall while maintaining reasonable precision—a challenging balance that existing prompt-based LLM approaches often struggle to achieve.A major limitation in this space is the lack of ground-truth best Boolean queries for each topic, which makes supervised fine-tuning impractical. AutoBool addresses this challenge by leveraging RL to directly optimize query generation against retrieval performance metrics, without requiring ideal target queries. To support this effort, we create and release the largest dataset of its kind: 65 588 topics in total for training and evaluating the task of automatic Boolean query formulation.Experiments on our new dataset and two established datasets (CLEF TAR and Seed Collection) show that AutoBool significantly outperforms zero-shot/few-shot prompting and matches or exceeds the effectiveness of much larger GPT-based models (e.g., GPT-4o, O3) using smaller backbones. It also approaches effectiveness of expert-authored queries while retrieving 10–16 times fewer documents. Ablation studies reveal the critical roles of model backbone, size, decoding temperature, and prompt design. Code and data are available at https://github.com/ielab/AutoBool.
Improving LLM Domain Certification with Pretrained Guide Models
Jiaqian Zhang | Zhaozhi Qian | Faroq AL-Tam | Ignacio Iacobacci | Muhammad AL-Qurishi | Riad Souissi
Jiaqian Zhang | Zhaozhi Qian | Faroq AL-Tam | Ignacio Iacobacci | Muhammad AL-Qurishi | Riad Souissi
Large language models (LLMs) often generate off-domain or harmful responses when deployed in specialized, high-stakes domains, motivating the need for rigorous LLM domain certification. While the VALID algorithm (Emde et al., 2025) achieves formal domain certificate guarantee using a guide model G trained from scratch on in-domain data, it suffers from poor generalization due to limited training. In this work, we propose PRISM, a novel approach that overcomes this key limitation by leveraging pretrained language models as guide models, enhanced via contrastive fine-tuning to sharply distinguish acceptable from refused content. We explore and experiment variants of PRISM with different loss functions to ensure that the model exploits the rich world knowledge of pretrained models while aligned to the target domain. We show that two variants of PRISM, PRISM-BC and PRISM-GA, achieve superior OOD rejection and tighter certification bounds across eight diverse data regimes and perturbations, establishing a more reliable approach to domain-adherent LLM deployment.
TDFlow: Agentic Workflows for Test Driven Development
Kevin Han | Siddharth Maddikayala | Tim Knappe | Om Patel | Austen Liao | Amir Barati Farimani
Kevin Han | Siddharth Maddikayala | Tim Knappe | Om Patel | Austen Liao | Amir Barati Farimani
We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best baseline) and 94.3% on SWE-Bench Verified. In this work, we further show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results show that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution – with the final frontier for fully autonomous repository repair being accurate reproduction test generation.
Contrastive Learning with Narrative Twins for Modeling Story Salience
Igor Sterner | Alex Lascarides | Frank Keller
Igor Sterner | Alex Lascarides | Frank Keller
Understanding narratives requires identifying which events are most salient for a story’s progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.
ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models
Yachuan Liu | Xiaochun Wei | Lin Shi | Xinnuo Li | Bohan Zhang | Paramveer Dhillon | Qiaozhu Mei
Yachuan Liu | Xiaochun Wei | Lin Shi | Xinnuo Li | Bohan Zhang | Paramveer Dhillon | Qiaozhu Mei
Large language models (LLMs) struggle with ex-ante reasoning—making inferences or predictions without access to future information. Even under explicit temporal cutoffs, they often rely on internalized post-cutoff knowledge. To systematically evaluate this issue, we introduce a benchmark that assesses LLMs’ ex-ante inference ability across four tasks: stock prediction, question answering, Wikipedia event generation, and scientific publication generation. We quantify temporal leakage using a leakage rate metric, which measures models’ reliance on future information beyond cutoff timestamps, and a quality measure that evaluates task performance. Experimental results show that LLMs frequently violate temporal constraints across tasks, revealing persistent challenges in ex-ante reasoning. Our benchmark serves as a rigorous testbed for studying temporal reasoning in time-sensitive contexts and provides complete datasets, results, and evaluation resources to support future research on improving temporal consistency in modern LLMs.
CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
Grace Byun | Rebecca Lipschutz | Sean T. Minton | Abigail Powers | Jinho D. Choi
Grace Byun | Rebecca Lipschutz | Sean T. Minton | Abigail Powers | Jinho D. Choi
Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user–model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.Content warning: This paper discusses sensitive topics such as suicide ideation, self-harm, rape, domestic violence, and child abuse.
Coordinates from Context: Using LLMs to Ground Complex Location References
Tessa Masis | Brendan O'Connor
Tessa Masis | Brendan O'Connor
Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs’ abilities to reason over geospatial data, we evaluate LLMs’ geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.
Discourse Graph Guided Document Translation with Large Language Models
Viet Thanh Pham | Minghan Wang | Hao-Han Liao | Thuy-Trang Vu
Viet Thanh Pham | Minghan Wang | Hao-Han Liao | Thuy-Trang Vu
Adapting large language models to full document translation remains challenging due to the difficulty of capturing long-range dependencies and preserving discourse coherence throughout extended texts. While recent agentic machine translation systems mitigate context window constraints through multi-agent orchestration and persistent memory, they require substantial computational resources and are sensitive to memory retrieval strategies. We introduce TransGraph, a discourse-guided framework that explicitly models inter-chunk relationships through structured discourse graphs and selectively conditions each translation segment on relevant graph neighbourhoods rather than relying on sequential or exhaustive context. Across three document-level MT benchmarks spanning six languages and diverse domains, TransGraph consistently surpasses strong baselines in translation quality and terminology consistency while incurring significantly lower token overhead.
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Patrice Bechard | Chao Wang | Amirhossein Abaskohi | Juan A. Rodriguez | Christopher Pal | David Vazquez | Spandana Gella | Sai Rajeswar | Perouz Taslakian
Patrice Bechard | Chao Wang | Amirhossein Abaskohi | Juan A. Rodriguez | Christopher Pal | David Vazquez | Spandana Gella | Sai Rajeswar | Perouz Taslakian
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams – including synthetic, manually annotated, and real-world samples – to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
Adaptive Helpfulness–Harmlessness Alignment with Preference Vectors
Ren-Wei Liang | Chin Ting Hsu | Chan-Hung Yu | Saransh Agrawal | Shih-Cheng Huang | Chieh-Yen Lin | Shang-Tse Chen | Kuan-Hao Huang | Shao-Hua Sun
Ren-Wei Liang | Chin Ting Hsu | Chan-Hung Yu | Saransh Agrawal | Shih-Cheng Huang | Chieh-Yen Lin | Shang-Tse Chen | Kuan-Hao Huang | Shao-Hua Sun
Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains
Reza Khanmohammadi | Erfan Miahi | Simerjot Kaur | Charese Smiley | Ivan Brugere | Kundan S Thind | Mohammad M. Ghassemi
Reza Khanmohammadi | Erfan Miahi | Simerjot Kaur | Charese Smiley | Ivan Brugere | Kundan S Thind | Mohammad M. Ghassemi
The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.
SearchLLM: Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine
Hoang-Quoc Nguyen-Son | Minh-Son Dao | Koji Zettsu
Hoang-Quoc Nguyen-Son | Minh-Son Dao | Koji Zettsu
With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.
RoZO: Geometry-Aware Zeroth-Order Fine-Tuning on Low-Rank Adapters for Black-Box Large Language Models
Zichen Song | Weijia Li
Zichen Song | Weijia Li
Large language models (LLMs) have achieved remarkable success across a wide range of tasks, yet fine-tuning them efficiently under black-box or memory-constrained settings remains challenging. Parameter-efficient fine-tuning (PEFT) techniques such as LoRA alleviate memory usage by restricting updates to low-rank adapters, while zeroth-order (ZO) optimization further avoids back-propagation by estimating gradients from function evaluations. Recent work, such as LOZO, leverages random low-rank perturbations to reduce the variance of ZO estimates, but it overlooks the intrinsic geometric structure of LoRA adapters and suffers from unstable convergence and limited integration with adaptive optimizers. To address these limitations, we propose RoZO, a Riemannian zeroth-order optimization framework that constrains updates to the tangent space of the LoRA manifold. By exploiting geometry-aware updates with parallel transport, adaptive preconditioning, and trust-region control, RoZO achieves more stable convergence, tighter variance bounds, and superior performance compared to existing ZO methods.
Mitigating Degree Bias in Hypergraphs via Attribute-as-Structure Approach
Ryusei Nishide | Makoto Miwa
Ryusei Nishide | Makoto Miwa
Entity representation learning on hypergraphs is hindered by degree bias, where nodes with sparse connections suffer from limited structural information for aggregation. Prevailing “attribute-as-feature“ approaches, which treat rich textual attributes (e.g., titles, abstracts, keywords) merely as node features, fail to address this structurally rooted problem as they do not create new aggregation pathways. To overcome this limitation, we propose a novel “attribute-as-structure“ approach specifically designed for heterogeneous hypergraphs. Our approach integrates attributes directly into the hypergraph topology as distinct node types, creating new structural pathways to enrich sparsely connected entities while preserving semantic distinctiveness within complex many-to-many hyperedge interactions. We introduce an entity-attribute aware learning framework featuring two key innovations: (1) a specialized heterogeneous hypergraph encoder with dual attention mechanisms—self-attention for entity-entity relationships and cross-type attention for entity-attribute relevance, and (2) Attribute-Attentive Contrastive Learning (AACL), a novel objective that dynamically weighs attribute importance while explicitly aligning entity representations with their structural attributes. Experiments on multiple hypergraph datasets demonstrate consistent improvements in node classification performance, with particularly significant gains for structurally sparse nodes, demonstrating the effectiveness of our approach for degree bias mitigation.
Generative Personality Simulation via Theory-Informed Structured Interview
Pengda Wang | Huiqi Zou | Han Jiang | Hanjie Chen | Tianjun Sun | Xiaoyuan Yi | Ziang Xiao | Frederick L. Oswald
Pengda Wang | Huiqi Zou | Han Jiang | Hanjie Chen | Tianjun Sun | Xiaoyuan Yi | Ziang Xiao | Frederick L. Oswald
Despite their potential as human proxies, LLMs often fail to generate heterogeneous data with human-like diversity, thereby diminishing their value in advancing social science research. To address this gap, we propose a novel method to incorporate psychological insights into LLM simulation through the Personality Structured Interview (PSI). PSI leverages psychometric scale-development procedures to capture personality-related linguistic information from a formal psychological perspective. To systematically evaluate simulation fidelity, we developed a measurement theory grounded evaluation procedure that considers the latent construct nature of personality and evaluates its reliability, structural validity, and external validity. Results from three experiments demonstrate that PSI effectively improves human-like heterogeneity in LLM-simulated personality data and predicts personality-related behavioral outcomes. We further offer a theoretical framework for designing theory-informed structured interviews to enhance the reliability and effectiveness of LLMs in simulating human-like data for broader psychometric research.
Large Language Models (LLMs) have achieved substantial progress in alignment, ensuring safer and more reliable outputs. However, jailbreak attacks can still bypass these safeguards and provoke harmful responses from well-aligned models. While some studies have achieved defenses against jailbreak attacks by modifying output distributions or detecting harmful content, the exact rationale still remains elusive. In this work, we present a novel neuron-level interpretability method that focuses on the role of safety-related knowledge neurons. Unlike existing approaches, our method projects the model’s internal representation into a more consistent and interpretable vocabulary space. We then show that adjusting the activation of safety-related neurons can effectively control the model’s behavior with a mean ASR higher than 97%. Building on this insight, we propose SafeTuning, a fine-tuning strategy that reinforces safety-critical neurons to improve model robustness against jailbreaks. SafeTuning consistently reduces attack success rates across multiple LLMs and outperforms all four baseline defenses. These findings offer a new perspective on understanding and defending against jailbreak attacks.
ELLA: Efficient Lifelong Learning for Adapters in Large Language Models
Shristi Das Biswas | Yue Zhang | Anwesan Pal | Radhika Bhargava | Kaushik Roy
Shristi Das Biswas | Yue Zhang | Anwesan Pal | Radhika Bhargava | Kaushik Roy
Large Language Models (LLMs) suffer from severe catastrophic forgetting when adapted sequentially to new tasks in a continual learning (CL) setting. Existing approaches are fundamentally limited: replay-based methods are impractical and could potentially violate privacy, while strict orthogonality-based methods collapse under scale: each new task is projected onto an orthogonal complement, progressively reducing the residual degrees of freedom and eliminating forward transfer by forbidding overlap in shared representations. In this work, we introduce ELLA, a training framework built on the principle of selective subspace de-correlation. Rather than forbidding all overlap, ELLA explicitly characterizes the structure of past updates and penalizes alignments along their high-energy, task-specific directions, while preserving freedom in the low-energy residual subspaces to enable transfer. Formally, this is realized via a lightweight regularizer on a single aggregated update matrix. This mechanism is proven to be an anisotropic shrinkage operator that bounds interference, yielding a penalty that is both memory- and compute-constant regardless of task sequence length. ELLA requires no data replay, no architectural expansion, and negligible storage. Empirically, it achieves state-of-the-art CL performance on three popular benchmarks spanning both classification and generative tasks, with relative accuracy gains of up to 9.6% and a 35× smaller memory footprint. Furthermore, ELLA scales robustly across architectures and actively enhances the model’s zero-shot generalization performance on unseen tasks, establishing a principled and scalable solution for constructive lifelong LLM adaptation.
LingGen: Scalable Multi-Attribute Linguistic Control via Power-Law Masking
Mohamed Elgaar | Hadi Amiri
Mohamed Elgaar | Hadi Amiri
We present LingGen, a controlled text generation model that allows fine-grained control over a large number of real-valued linguistic attributes. It encodes target attribute values with a dedicated linguistic attribute encoder and conditions the language model by injecting the resulting representation into the language model using the beginning-of-sequence (BOS) embeddings. To improve robustness when controlling different attribute subsets, we introduce P-MASKING, which samples per-example attribute masking rates from a truncated Pareto distribution during training. Across 1-40 control attributes, LingGen achieves the lowest average control error among evaluated methods, while remaining efficient at inference and receiving the highest fluency scores in human evaluation. Ablations show that Pareto-sampled masking and BOS-based injection are effective choices compared to alternative masking and integration variants.
RECIPE-TKG: From Sparse History to Structured Reasoning for LLM-based Temporal Knowledge Graph Completion
Ömer Faruk Akgül | Feiyu Zhu | Yuxin Yang | Rajgopal Kannan | Viktor Prasanna
Ömer Faruk Akgül | Feiyu Zhu | Yuxin Yang | Rajgopal Kannan | Viktor Prasanna
Temporal Knowledge Graphs (TKGs) represent dynamic facts as timestamped relations between entities. While Large Language Models (LLMs) show promise for TKG completion, current approaches typically apply generic pipelines (neighborhood sampling, supervised fine-tuning, uncalibrated inference) without task-specific adaptation to temporal relational reasoning. Through systematic analysis under unified evaluation, we reveal three key failure modes: (1) retrieval strategies miss multi-hop dependencies when target entities are not directly observed in history, (2) standard fine-tuning reinforces memorization over relational generalization, and (3) uncalibrated generation produces contextually implausible entities. We present RECIPE-TKG, a parameter-efficient framework that addresses each limitation through principled, task-specific design: rule-based multi-hop sampling for structural grounding, contrastive fine-tuning to shape relational compatibility, and test-time semantic filtering for contextual alignment. Experiments on four benchmarks show that RECIPE-TKG outperforms prior LLM-based methods across input regimes, achieving up to 22.4% relative improvement in Hits@10, with particularly strong gains when historical evidence is sparse or indirect.
Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth
Michelle Yuan | Weiyi Sun | Amir H. Rezaeian | Jyotika Singh | Sandip Ghoshal | Yao-Ting Wang | Miguel Ballesteros | Yassine Benajiba
Michelle Yuan | Weiyi Sun | Amir H. Rezaeian | Jyotika Singh | Sandip Ghoshal | Yao-Ting Wang | Miguel Ballesteros | Yassine Benajiba
Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent advances from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.
PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR
James Burgess | Jan N. Hansen | Duo Peng | Yuhui Zhang | Alejandro Lozano | Min Woo Sun | Emma Lundberg | Serena Yeung-Levy
James Burgess | Jan N. Hansen | Duo Peng | Yuhui Zhang | Alejandro Lozano | Min Woo Sun | Emma Lundberg | Serena Yeung-Levy
Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers – this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training; they are available on Hugging Face. Finally, our data creation methods are scalable and easily extendable to other scientific domains.
Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation
Bolei Ma | Yong Cao | Indira Sen | Anna-Carolina Haensch | Frauke Kreuter | Barbara Plank | Daniel Hershcovich
Bolei Ma | Yong Cao | Indira Sen | Anna-Carolina Haensch | Frauke Kreuter | Barbara Plank | Daniel Hershcovich
Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.
Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graph for Retrieval-Augmented Generation
Ze Yu Zhang | Zitao Li | Yaliang Li | Bolin Ding | Bryan Kian Hsiang Low
Ze Yu Zhang | Zitao Li | Yaliang Li | Bolin Ding | Bryan Kian Hsiang Low
Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?
Kai Sun | Yin Huang | Srishti Mehra | Mohammad Kachuee | Xilun Chen | Renjie Tao | Zhaojiang Lin | Andrea Jessee | Nirav Shah | Alex L Betty | Yue Liu | Anuj Kumar | Wen-tau Yih | Xin Luna Dong
Kai Sun | Yin Huang | Srishti Mehra | Mohammad Kachuee | Xilun Chen | Renjie Tao | Zhaojiang Lin | Andrea Jessee | Nirav Shah | Alex L Betty | Yue Liu | Anuj Kumar | Wen-tau Yih | Xin Luna Dong
The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.
Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines’ ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.
A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic
Juan Moreno Gonzalez | Bashar Alhafni | Nizar Habash
Juan Moreno Gonzalez | Bashar Alhafni | Nizar Habash
Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts. We make our code and data publicly available.
Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov | Ulyana Isaeva | Anton Emelyanov | Artem Safin | Maria Tikhonova | Alexander Kharitonov | Yulia Lyakh | Petr Surovtsev | Denis Shevelev | Vildan Saburov | Vasily Konovalov | Elisei Rykov | Ivan Sviridov | Amina Miftakhova | Ilseyar Alimova | Alexander Panchenko | Alexander Kapitanov | Alena Fenogenova
Artem Chervyakov | Ulyana Isaeva | Anton Emelyanov | Artem Safin | Maria Tikhonova | Alexander Kharitonov | Yulia Lyakh | Petr Surovtsev | Denis Shevelev | Vildan Saburov | Vasily Konovalov | Elisei Rykov | Ivan Sviridov | Amina Miftakhova | Ilseyar Alimova | Alexander Panchenko | Alexander Kapitanov | Alena Fenogenova
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce MERA Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
Don’t Judge a Book by its Cover: Testing LLMs’ Robustness Under Logical Obfuscation
Abhilekh Borah | Shubhra Ghosh | Kedar Joshi | Aditya Kumar Guru | Kripabandhu Ghosh
Abhilekh Borah | Shubhra Ghosh | Kedar Joshi | Aditya Kumar Guru | Kripabandhu Ghosh
Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.
I know you are different! Towards Persona Driven Knowledge-infused Dialogue Assistant
Shifali Agrahari | Moushumi Mahato | Abhisek Tiwari | Javaid Nabi
Shifali Agrahari | Moushumi Mahato | Abhisek Tiwari | Javaid Nabi
Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
Qifan Yu | Zhenyu He | Sijie Li | Zhou Xun | Jun Zhang | Jingjing Xu | Di He
Qifan Yu | Zhenyu He | Sijie Li | Zhou Xun | Jun Zhang | Jingjing Xu | Di He
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model’s reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers, a standard Transformer with cross-block parameter-sharing architecture, possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose **RELAY** (**RE**asoning through **L**oop **A**lignment iterativel**Y**). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer’s ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model.
Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English-centric, limiting progress in multilingual safety alignment. As a result, low-resource expert models—fine-tuned on their respective instruction datasets—tend to exhibit higher unsafety rates compared to their high-resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low-resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark
Measuring Idiomaticity in Text Embedding Models with epsilon-compositionality
Sondre Wold | Étienne Simon | Erik Velldal | Lilja Øvrelid
Sondre Wold | Étienne Simon | Erik Velldal | Lilja Øvrelid
The principle of compositionality, which concerns the construction of meaning from constituent parts, is a longstanding topic in various disciplines, most commonly associated with formal semantics. In NLP, recent studies have focused on the compositional properties of text embedding models, particularly regarding their sensitivity to idiomatic expression, as idioms have traditionally been seen as non-compositional. In this paper, we argue that it is unclear how previous work relates to formal definitions of the principle. To address this limitation, we take a theoretically motivated approach based on definitions in formal semantics. We present 𝜀-compositionality, a continuous relaxation of compositionality derived from these definitions. We measure 𝜀-compositionality on a dataset containing both idiomatic and non-idiomatic sentences, providing a theoretically motivated assessment of sensitivity to idiomaticity. Our findings indicate that most text embedding models differentiate between idiomatic and non-idiomatic phrases, although to varying degrees.
Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
Andrew Zhao | Reshmi Ghosh | Vitor R. Carvalho | Emily Lawton | Keegan Hines | Gao Huang | Jack W. Stokes
Andrew Zhao | Reshmi Ghosh | Vitor R. Carvalho | Emily Lawton | Keegan Hines | Gao Huang | Jack W. Stokes
Large language model (LLM) systems increasingly power everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on manually well-crafted prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to query poisoning alone: feedback-based attacks raise attack success rate (ASR) by up to ΔASR = 0.48. We introduce a simple fake reward attack that requires no access to the reward model and significantly increases vulnerability. We also propose a lightweight highlighting defense that reduces the fake reward ΔASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Qian Wang | Ziqi Huang | Ruoxi Jia | Paul Debevec | Ning Yu
Qian Wang | Ziqi Huang | Ruoxi Jia | Paul Debevec | Ning Yu
Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle—Explore, Examine, and Enhance—to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief prompt, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.
Computational Benchmarks for Egyptian Arabic Child Directed Speech
Salam Khalifa | Abed Qaddoumi | Nizar Habash | Owen Rambow
Salam Khalifa | Abed Qaddoumi | Nizar Habash | Owen Rambow
We present AraBabyTalk-EGY, an enriched release of the Egyptian Arabic CHILDES corpus, that opens the child-adult interactions genre to modern Arabic NLP research. Starting from the original CHILDES recordings and IPA transcriptions of caregiver-child sessions, we (i) map each IPA token to fully diacritized Arabic script, and (ii) add core part-of-speech tags and lemmas aligned with existing dialectal Arabic morphological resources. These layers yield ~26K annotated tokens suitable for both text- and speech-based NLP tasks. We provide a benchmark on morphological disambiguation and Arabic ASR. We outline lexical and morphosyntactic differences between AraBabyTalk-EGY and general Egyptian Arabic resources, highlighting the value of genre-specific training data for language acquisition studies and Arabic speech technology.
K-LegalDeID: A Benchmark Dataset and KLUEBERT-CRF for De-identification in Korean Court Judgments
Wooseok Choi | Hyungbin Kim | Yon Dohn Chung
Wooseok Choi | Hyungbin Kim | Yon Dohn Chung
The Korean legal system mandates public access to court judgments to ensure judicial transparency. However, this requirement conflicts with privacy protection obligations due to the prevalence of Personally Identifiable Information (PII) in legal documents. To address this challenge, we introduce **K-LegalDeID**, a large-scale benchmark dataset and an efficient KLUEBERT-CRF model for de-identification for Korean court judgments. Our primary contribution is a new large-scale benchmark dataset spanning 39 legal domains, with its quality is validated by a high inter-annotator agreement (IAA) with Fleiss’ Kappa of 0.7352. Our results demonstrate that a lightweight KLUEBERT-CRF model, when trained on our dataset, achieves state-of-the-art performance with an entity-level micro F1 score of 0.9923. Our end-to-end framework offers a practical and computationally efficient solution for real-world legal systems.
Specialization through Collaboration: Understanding Expert Interaction in Mixture-of-Expert Large Language Models
Yuanbo Tang | Naifan Zhang | Yan Tang | Meixuan Chen | Shuhan Huang | Tingyu Cao | Yang Li
Yuanbo Tang | Naifan Zhang | Yan Tang | Meixuan Chen | Shuhan Huang | Tingyu Cao | Yang Li
Mixture-of-Experts (MoE) based large language models (LLMs) have gained popularity due to their multi-task capability, where each input token activates only a subset of "expert" subnetworks. However, whether each expert can truly specialize to a certain task remains poorly understood, while activation analysis shows frequent cross-layer co-activation of experts for the same input, resembling a collaborative behavior. In this paper, we use a dictionary learning approach to show that experts in MoE LLMs form hierarchical and semantically coherent collaborative groups that correspond to specific linguistic and cognitive functions (e.g., mathematical reasoning, syntactic processing), mirroring specialized functional region observed in neuroscience. Furthermore, leveraging these discovered expert groups enables significant model compression with minimal performance degradation, outperforming existing methods by 2.5% while enabling up to 50% expert reduction. These findings provide the first systematic analysis of expert collaboration mechanisms in MoE LLMs, revealing that specialization emerges from joint activation of experts across all layers. We further developed an interactive visualization platform that enables researchers to explore expert collaboration patterns and their semantic associations.
Compact Language Models with Iterative Text Refinement for Health Dialogue Summarization
Kellen Tan Cheng | Ganesh Ramesh | Nafiul Rashid | Geoffrey Jay Tso | Jilong Kuang
Kellen Tan Cheng | Ganesh Ramesh | Nafiul Rashid | Geoffrey Jay Tso | Jilong Kuang
Health wellness agents typically rely on large language models (LLMs) for response generation, where contextual information from a user’s conversation history can be used for response grounding and personalization. High-quality conversation summaries are one such method which can reduce the number of input tokens during response generation, decreasing overhead and inference latency. However, directly purposing LLMs for this task is infeasible due to the scale of the task, the compute overhead, and health data compliance regulations. Furthermore, ground truth for real-world datasets is scarce due to privacy concerns and the high cost of health expert annotators. These factors necessitate the development of small, potentially on-device, language models capable of health dialogue summarization, particularly in the absence of ground truth labels. In this paper, we first present a comprehensive empirical study that benchmarks a variety of state-of-the-art smaller language models to better understand their baseline capabilities. Second, we present an unsupervised method that uses the summaries from multiple models, refined with iterative feedback, to generate high-quality summaries of health dialogues. Experiments illustrate that our method is able to outperform baseline on both open-source and proprietary benchmarks. Notably, our method can be run viably on local compute without a GPU, using just a single Macbook with 16 GB of memory.
Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features
Alberto Testoni | Iacer Calixto
Alberto Testoni | Iacer Calixto
Reliable uncertainty quantification (UQ) is essential when employing large language models (LLMs) in high-risk domains such as clinical question answering (QA). In this work, we evaluate uncertainty estimation methods for clinical QA focusing, for the first time, on eleven clinical specialties and six question types, and across ten open-source LLMs (general-purpose, biomedical, and reasoning models), alongside representative proprietary models. We analyze score-based UQ methods, present a case study introducing a novel lightweight method based on behavioral features derived from reasoning-oriented models, and examine conformal prediction as a complementary set-based approach. Our findings reveal that uncertainty reliability is not a monolithic property, but one that depends on clinical specialty and question type due to shifts in calibration and discrimination. Our results highlight the need to select or ensemble models based on their distinct, complementary strengths and clinical use.
Controlling Reading Ease with Gaze-Guided Text Generation
Andreas Säuberli | Darja Jepifanova | Diego Frassinelli | Barbara Plank
Andreas Säuberli | Darja Jepifanova | Diego Frassinelli | Barbara Plank
The way our eyes move while reading can tell us about the cognitive effort required to process the text. In the present study, we use this fact to generate texts with controllable reading ease. Our method employs a model that predicts human gaze patterns to steer language model outputs towards eliciting certain reading behaviors. We evaluate the approach in an eye-tracking experiment with native and non-native speakers of English. The results demonstrate that the method is effective at making the generated texts easier or harder to read, measured both in terms of reading times and perceived difficulty of the texts. A statistical analysis reveals that the changes in reading behavior are mostly due to features that affect lexical processing. Possible applications of our approach include text simplification for information accessibility and generation of personalized educational material for language learning.
PictureStories: Predicting the Task Adherence of Language Learner Answers to a Picture Story-Based Writing Task
Marie Bexte | Andrew Caines | Diane Nicholls | Paula Buttery | Torsten Zesch
Marie Bexte | Andrew Caines | Diane Nicholls | Paula Buttery | Torsten Zesch
We investigate the automated evaluation of English language learner answers to writing tasks featuring picture stories.This is usually limited to language proficiency only, neglecting the context of the picture. Instead, our analysis focuses on task adherence, which for example allows detection of off-topic answers.Since there is a lack of suitable training and evaluation data, our first step is to build the PictureStories dataset.To this end, we develop a marking rubric that covers task adherence with respect to both form and content. Six annotators mark 713 learner answers written in response to one of five picture stories.Having assembled the dataset, we then explore to what extent task adherence can be predicted automatically. Our experiments assume a scenario where no or just a few labelled answers are available for the picture story which is being marked.For form-focused criteria, we find that it is beneficial to finetune models across tasks.With content-focused criteria, few-shot prompting Qwen emerges as the best-performing method. We examine the trade-off between including the story image vs. example answers in the prompt and find that examples suffice in many cases. While for some LLMs, few-shot prompting results may look promising on the surface, we demonstrate that a much simpler method can do just as well when shown the same examples.
Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models
Vitalii Hirak | Jaap Jumelet | Arianna Bisazza
Vitalii Hirak | Jaap Jumelet | Arianna Bisazza
Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.
Large Language Models as Oracles for Ontology Alignment
Sviatoslav Lushnei | Dmytro Shumskyi | Severyn Shykula | Ernesto Jiménez-Ruiz | Artur d'Avila Garcez
Sviatoslav Lushnei | Dmytro Shumskyi | Severyn Shykula | Ernesto Jiménez-Ruiz | Artur d'Avila Garcez
There are many methods and systems to tackle the ontology alignment problem, yet a major challenge persists in producing high-quality mappings among a set of input ontologies. Adopting a human-in-the-loop approach during the alignment process has become essential in applications requiring very accurate mappings. However, user involvement is expensive when dealing with large ontologies. In this paper, we analyse the feasibility of using Large Language Models (LLM) to aid the ontology alignment problem. LLMs are used only in the validation of a subset of correspondences for which there is high uncertainty. We have conducted an extensive analysis over several tasks of the Ontology Alignment Evaluation Initiative (OAEI), reporting in this paper the performance of several state-of-the-art LLMs using different prompt templates. Using LLMs as resulted in strong performance in the OAEI 2025, achieving the top-2 overall rank in the bio-ml track.
Reasoning or Knowledge: Stratified Evaluation of Biomedical LLMs
Rahul Thapa | Qingyang Wu | Kevin Wu | Harrison G Zhang | Angela Zhang | Eric Wu | Haotian Ye | James Zou
Rahul Thapa | Qingyang Wu | Kevin Wu | Harrison G Zhang | Angela Zhang | Eric Wu | Haotian Ye | James Zou
Medical reasoning in large language models seeks to replicate clinicians’ cognitive processes in interpreting patient data and making diagnostic decisions. However, widely used benchmarks—such as MedQA, MedMCQA, and PubMedQA—mix questions that require multi-step reasoning with those answerable through factual recall, complicating evaluation. We introduce an expert-validated evaluation framework that disentangles knowledge recall from reasoning by training a PubMedBERT-based classifier and applying it to 11 widely used biomedical QA benchmarks. This framework reveals that only 32.8% of questions require multi-step reasoning, indicating that current evaluations largely measure factual recall. Stratified evaluation of biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3) consistently shows lower performance on reasoning-heavy than knowledge-heavy questions (e.g., HuatuoGPT-o1: 56.9% on knowledge vs.44.8% on reasoning). Beyond aggregate accuracy, we assess robustness through adversarial evaluations in which models are prefixed with uncertainty-inducing, incorrect statements; biomedical reasoning models degrade sharply in this setting (e.g., MedReason: 50.4% to 24.4%), with declines especially pronounced on reasoning-heavy questions. Finally, we show that fine-tuning on high-quality, reasoning-heavy examples augmented with adversarial traces, followed by reinforcement learning with GRPO, improves both robustness and accuracy across knowledge and reasoning subsets within our evaluation framework.
Effective QA-Driven Annotation of Predicate–Argument Relations Across Languages
Jonathan Davidov | Aviv Slobodkin | Shmuel Tomi Klein | Reut Tsarfaty | Ido Dagan | Ayal Klein
Jonathan Davidov | Aviv Slobodkin | Shmuel Tomi Klein | Reut Tsarfaty | Ido Dagan | Ayal Klein
Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework — a natural-language formulation of predicate-argument relations — as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French — spanning diverse language families — the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.
Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.
What Breaks Knowledge Graph based RAG? Benchmarking and Empirical Insights into Reasoning under Incomplete Knowledge
Dongzhuoran Zhou | Yuqicheng Zhu | Xiaxia Wang | Hongkuan Zhou | Yuan He | Jiaoyan Chen | Steffen Staab | Evgeny Kharlamov
Dongzhuoran Zhou | Yuqicheng Zhu | Xiaxia Wang | Hongkuan Zhou | Yuan He | Jiaoyan Chen | Steffen Staab | Evgeny Kharlamov
Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks and present BRINK (Benchmark for Reasoning under Incomplete Knowledge) to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal | Matúš Pikuliak | Simon Ostermann | Marian Simko
Ivan Vykopal | Matúš Pikuliak | Simon Ostermann | Marian Simko
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants’ web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credible sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
When the Model Said ‘No Comment’, We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
Gautam Siddharth Kashyap | Mark Dras | Usman Naseem
Gautam Siddharth Kashyap | Mark Dras | Usman Naseem
Large Language Models (LLMs) need to be in accordance with human values—being helpful, harmless, and honest (HHH)—is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by(1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability. Code and data are available at: https://github.com/gskgautam/AlignX
NeuronMoE: Efficient Cross-Lingual Extension via Neuron-Guided Mixture-of-Experts
Rongzhi Li | Hitomi Yanaka
Rongzhi Li | Hitomi Yanaka
From Emotion to Expression: Theoretical Foundations and Resources for Fear Speech
Vigneshwaran Shankaran | Gabriella Lapesa | Claudia Wagner
Vigneshwaran Shankaran | Gabriella Lapesa | Claudia Wagner
Few forces rival fear in their ability to mobilize societies, distort communication, and reshape collective behavior. In computational linguistics, fear is primarily studied as an emotion, but not as a distinct form of speech. Fear speech content is widespread and growing, and often outperforms hate-speech content in reach and engagement because it appears "civiler" and evades moderation. Yet the computational study of fear speech remains fragmented and under-resourced. This can be understood by recognizing that fear speech is a phenomenon shaped by contributions from multiple disciplines. In this paper, we bridge cross-disciplinary perspectives by comparing theories of fear from Psychology, Political science, Communication science, and Linguistics. Building on this, we review existing definitions. We follow up with a survey of datasets from related research areas and propose a taxonomy that consolidates different dimensions of fear for studying fear speech. By reviewing current datasets and defining core concepts, our work offers both theoretical and practical guidance for creating datasets and advancing fear speech research.
Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.
Reassessing Active Learning Adoption in Contemporary NLP: A Community Survey
Julia Romberg | Christopher Schröder | Julius Gonsior | Katrin Tomanek | Fredrik Olsson
Julia Romberg | Christopher Schröder | Julius Gonsior | Katrin Tomanek | Fredrik Olsson
Supervised learning relies on data annotation which usually is time-consuming and therefore expensive. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Research in active learning has made considerable progress, especially with the rise of large language models (LLMs). However, we still know little about how these remarkable advances have translated into real-world applications, or contributed to removing key barriers to active learning adoption. To fill in this gap, we conduct an online survey in the NLP community to collect previously intangible insights on current implementation practices, common obstacles in application, and future prospects in active learning. We also reassess the perceived relevance of data annotation and active learning as fundamental assumptions. Our findings show that data annotation is expected to remain important and active learning to stay highly relevant while benefiting from LLMs. Consistent with a community survey from over 15 years ago, three key challenges yet persist—setup complexity, uncertain cost reduction, and tooling—for which we propose alleviation strategies. We publish an anonymized version of the dataset.
Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
Osama Mohammed Afzal | Preslav Nakov | Tom Hope | Iryna Gurevych
Osama Mohammed Afzal | Preslav Nakov | Tom Hope | Iryna Gurevych
Novelty assessment is a central yet understudied aspect of peer review, particularly in high-volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: (i) content extraction from submissions, (ii) retrieval and synthesis of related work, and (iii) structured comparison for evidence-based assessment. Our method is informed by analysis of human-written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human-annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, substantially outperforming existing LLM-based baselines. It produces detailed, literature-aware analysis and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM-assisted approaches to support more rigorous and transparent peer review without displacing human expertise. The data and the code are available at https://ukplab.github.io/eacl2026-assessing-paper-novelty/
AfriVox: Probing Multilingual and Accent Robustness of Speech LLMs
Busayo Awobade | Mardhiyah Sanni | Tassallah Abdullahi | Chibuzor Okocha | Kelechi Ezema | Devendra Deepak Kayande | Lukman Enegi Ismaila | Tobi Olatunji | Gloria Ashiya Katuka
Busayo Awobade | Mardhiyah Sanni | Tassallah Abdullahi | Chibuzor Okocha | Kelechi Ezema | Devendra Deepak Kayande | Lukman Enegi Ismaila | Tobi Olatunji | Gloria Ashiya Katuka
Recent advances in multimodal and speech-native large language models (LLMs) have delivered impressive speech recognition, translation, understanding, and question-answering capabilities for high-resource languages. However, African languages and non-native French or English accents remain dramatically underrepresented in benchmarks limiting the understanding and applicability of leading LLMs for millions of francophone and anglophone users in low-resource settings. We presents AfriVox, an open-source benchmark (including novel domain-specific and unscripted datasets) across 20 African languages, African-accented French, Arabic, and 100+ African English accents, contrasting leading multimodal speech LLMs with traditional unimodal automatic speech transcription (ASR) and translation (AST) models. Our analysis reveals significant language coverage variation, surprising LLM translation performance gains (e.g. Gemini), robustness concerns with unscripted speech, and substantial performance disparities for "supported" African languages. We profile the strengths, limitations, and language support of each model, and conduct the first targeted fine-tuning of a modern speech LLM (Qwen2.5-Omni) for three Nigerian languages, exceeding SOTA, and achieving up to 54% relative WER reduction and significant BLEU gains, offering practical guidance for implementers seeking to serve local language users.
Historical language models play a crucial role in the study of languages, and can benefit tasks such as named-entity recognition (NER), part-of-speech (PoS) tagging, and post-OCR correction, among others. Despite their relevance, most efforts have been concentrated on English. To the best of our knowledge, no such model exists for historical Portuguese. In this work, we introduce PortOldBERT, the first historical Portuguese encoder language model. We demonstrate its usefulness by comparing PortOldBERT’s performance with Albertina, the encoder on which it is based, across multiple tasks—pseudo-perplexity, NER, PoS tagging, word error rate (WER) prediction, and OCR error detection—and for different historical periods. PortOldBERT consistently outperforms Albertina in historical data, demonstrating its ability to effectively integrate historical linguistic contexts while retaining the ability to process contemporary text.
ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks?
Alessio Cocchieri | Luca Ragazzi | Giuseppe Tagliavini | Gianluca Moro
Alessio Cocchieri | Luca Ragazzi | Giuseppe Tagliavini | Gianluca Moro
Medical multiple-choice question answering (MCQA) benchmarks show that models achieve near-human accuracy, with some benchmarks approaching saturation–leading to claims of clinical readiness. Yet a single accuracy score is a poor proxy for competence: models that change answers under minor input perturbations cannot be considered reliable. We argue that reliability underpins accuracy–only consistent predictions make correctness meaningful. We release ReMedQA, a new benchmark that augments three standard medical MCQA datasets with open-ended answers and systematically perturbed options. Building on this, we introduce ReAcc and ReCon, two reliability metrics: ReAcc measures the proportion of questions answered correctly across all variations, while ReCon measures the proportion answered consistently regardless of correctness. Our evaluation shows that high MCQA accuracy masks low reliability: models remain sensitive to format and perturbation changes, and domain specialization offers no robustness gain. MCQA underestimates smaller models while inflating large ones that exploit structural cues–with some exceeding 50% accuracy even when the original questions are hidden. This shows that, despite near-saturated accuracy, we are not yet done with medical MCQA benchmarks.
Can Activation Steering Generalize Across Languages? A Study on Syllogistic Reasoning in Language Models
Gabriele Maraia | Leonardo Ranaldi | Marco Valentino | Fabio Massimo Zanzotto
Gabriele Maraia | Leonardo Ranaldi | Marco Valentino | Fabio Massimo Zanzotto
Large Language Models (LLMs) often struggle with formal logical reasoning, frequently conflating content plausibility with logical validity. This well-known content effect undermines their capacity to act as reliable deductive reasoners, particularly in multilingual contexts where both linguistic variability and world knowledge may deepen biases. Prior work shows that prompting and tuning interventions can alleviate these issues only partially, leaving models vulnerable to semantic interference.While previous studies have explored activation steering and other test-time interventions, this work has focused predominantly on English.To make reasoning more consistent, robust, and transferable across languages, we investigate the use of activation steering—an inference-time intervention that modulates internal representations towards a cross-lingual reasoning space. Our experiments demonstrate that steering techniques constructed for English-based syllogisms generalise effectively to multilingual datasets, yielding higher formal reasoning accuracy (up to +36%) while minimally affecting language modelling performance. Moreover, steering supports partial transfer to out-of-distribution tasks, highlighting its potential as a scalable mechanism for cross-lingual transferable reasoning. These findings advance the prospect of developing LLMs that can serve as reliable soft reasoners across language landscapes.
SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
Viktoriia Zinkovich | Anton Antonov | Andrei Spiridonov | Denis Shepelev | Andrey Moskalenko | Daria Pugacheva | Elena Tutubalina | Andrey Kuznetsov | Vlad Shakhuro
Viktoriia Zinkovich | Anton Antonov | Andrei Spiridonov | Denis Shepelev | Andrey Moskalenko | Daria Pugacheva | Elena Tutubalina | Andrey Kuznetsov | Vlad Shakhuro
Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases—crucial in real-world applications where users express the same intent in varied ways—remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA—a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing—even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
Knowledge Augmentation Enhances Token Classification for Recipe Understanding
Nuhu Ibrahim | Robert Stevens | Riza Batista-Navarro
Nuhu Ibrahim | Robert Stevens | Riza Batista-Navarro
In this work, we propose an entity type-specific and knowledge-augmented token classification framework designed to improve encoder models’ performance on recipe texts. Our empirical analysis shows that this approach achieves state-of-the-art (SOTA) results on 5 out of 7 benchmark recipe datasets, significantly outperforming traditional token classification methods. We introduce a novel methodology leveraging curated domain-specific knowledge contexts to guide encoder models such as BERT and RoBERTa, which we refer to as RecipeBERT-KA and RecipeRoBERTa-KA. Additionally, we release a newly reprocessed entity type-specific and knowledge-enriched dataset that merges seven widely used food datasets, making it the largest annotated food-related dataset to date. Comparative analysis with SOTA large language models (GPT-4o, Mistral-7B, LLaMA 3-13B and LLaMA 3-70B) highlights the practical advantages of our smaller and specialised models. Finally, we analyse the impact of the different knowledge contexts, our models’ potential for transfer learning, the effect of combining the datasets and scenarios where traditional token classification may still perform competitively, offering nuanced insight into method selection.
Argumentation and Judgement Factors: LLM-based Discovery and Application in Insurance Disputes
Basit Ali | Anubhav Sinha | Nitin Ramrakhiyani | Sachin Pawar | Girish Keshav Palshikar | Manoj Apte
Basit Ali | Anubhav Sinha | Nitin Ramrakhiyani | Sachin Pawar | Girish Keshav Palshikar | Manoj Apte
In this work, we focus on discovery of legal factors for a specific case type under consideration (e.g., vehicle insurance disputes). We refer to these legal factors more explicitly as "Argumentation and Judgement Factors" (AJFs). AJFs encode specific legal knowledge that is important for legal argumentation and judicial decision making. We propose a multi-step approach for discovering a list of AJFs for a given case type using a set of relevant legal documents (e.g., past judgements, relevant acts) and Symbolic Knowledge Distillation (SKD) from a Large Language Model (LLM). We propose a novel geneRatE-CRitic-reviEW (RECREW) prompting strategy for effective SKD. We construct and evaluate the discovered list of AJFs on two different types of cases (auto-insurance and life insurance) and show their utility in a dispute resolution application.
ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts
Tran Quang Hung | Pham Tien Nam | Son T. Luu | Kiet Van Nguyen
Tran Quang Hung | Pham Tien Nam | Son T. Luu | Kiet Van Nguyen
Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions - a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.
PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs
Manuel Frank | Haithem Afli
Manuel Frank | Haithem Afli
Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in semantic textual similarity gold ratings, we show that LLMs generate token-diverse but semantically preserving, paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs and we extended our experiments to 3 multilingual datasets covering 10 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute. We make the code to run PTEB publicly available.
DETECT: Determining Ease and Textual Clarity of German Text Simplifications
Maria Korobeynikova | Alessia Battisti | Lukas Fischer | Yingqiang Gao
Maria Korobeynikova | Alessia Battisti | Lukas Fischer | Yingqiang Gao
Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
MathEDU: Feedback Generation on Problem-Solving Processes for Mathematical Learning Support
Wei-Ling Hsu | Yu-Chien Tang | An-Zi Yen
Wei-Ling Hsu | Yu-Chien Tang | An-Zi Yen
The increasing reliance on Large Language Models (LLMs) across various domains extends to education, where students progressively use generative AI as a tool for learning. While prior work has examined LLMs’ mathematical ability, their reliability in grading authentic student problem-solving processes and delivering effective feedback remains underexplored. This study introduces MathEDU, a dataset consisting of student problem-solving processes in mathematics and corresponding teacher-written feedback. We systematically evaluate the reliability of various models across three hierarchical tasks: answer correctness classification, error identification, and feedback generation. Experimental results show that fine-tuning strategies effectively improve performance in classifying correctness and locating erroneous steps. However, the generated feedback across models shows a considerable gap from teacher-written feedback. Critically, the generated feedback is often verbose and fails to provide targeted explanations for the student’s underlying misconceptions. This emphasizes the urgent need for trustworthy and pedagogy-aware AI feedback in education.
Test-Time Scaling of Reasoning Models for Machine Translation
Zihao Li | Shaoxiong Ji | Jörg Tiedemann
Zihao Li | Shaoxiong Ji | Jörg Tiedemann
Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
How Good Are LLMs at Processing Tool Outputs?
Kiran Kate | Yara Rizk | Poulami Ghosh | Ashu Gulati | Tathagata Chakraborti | Zidane Wright | Mayank Agarwal
Kiran Kate | Yara Rizk | Poulami Ghosh | Ashu Gulati | Tathagata Chakraborti | Zidane Wright | Mayank Agarwal
Most realistic task automation problems require large language models (LLMs) to call tools, which often return complex JSON responses. These responses must be further processed to derive the information necessary for task completion. The ability of LLMs to do so is under-studied. In this paper, we study the tool response processing task and LLMs’ abilities to process structured (JSON) responses. We created a dataset for this task, and evaluated 15 open and closed weight models using multiple prompting approaches. Our results show that JSON processing remains a difficult task even for frontier models across multiple prompting strategies. The optimal response processing strategy depends on both the nature and size of the tool outputs, as well as the complexity of the required reasoning. Variations in processing approaches can lead to performance differences ranging from 3% to 50%.
Tug-of-war between idioms’ figurative and literal interpretations in LLMs
Soyoung Oh | Xinting Huang | Mathis Pink | Michael Hahn | Vera Demberg
Soyoung Oh | Xinting Huang | Mathis Pink | Michael Hahn | Vera Demberg
Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom’s literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve figurative interpretation, while suppressing literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway that prioritizes the figurative interpretation and a parallel direct route that favors literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.
Do LLM hallucination detectors suffer from low-resource effect?
Debtanu Datta | Mohan Kishore Chilukuri | Yash Kumar | Saptarshi Ghosh | Muhammad Bilal Zafar
Debtanu Datta | Mohan Kishore Chilukuri | Yash Kumar | Saptarshi Ghosh | Muhammad Bilal Zafar
LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.
Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling
Anas Belfathi | Nicolas Hernandez | Monceaux Laura | Warren Bonnard | Mary Catherine Lavissière | Christine Jacquin | Richard Dufour
Anas Belfathi | Nicolas Hernandez | Monceaux Laura | Warren Bonnard | Mary Catherine Lavissière | Christine Jacquin | Richard Dufour
Rhetorical Role Labeling (RRL) identifies the functional role of each sentence in a document, a key task for discourse understanding in domains such as law and medicine. While hierarchical models capture local dependencies effectively, they are limited in modeling global, corpus-level features. To address this limitation, we propose two prototype-based methods that integrate local context with global representations. Prototype-Based Regularization (PBR) learns soft prototypes through a distance-based auxiliary loss to structure the latent space, while Prototype-Conditioned Modulation (PCM) constructs corpus-level prototypes and injects them during training and inference.Given the scarcity of RRL resources, we introduce SCOTUS-Law, the first dataset of U.S. Supreme Court opinions annotated with rhetorical roles at three levels of granularity: category, rhetorical function, and step. Experiments on legal, medical, and scientific benchmarks show consistent improvements over strong baselines, with ∼4 Macro-F1 gains on low-frequency roles. We further analyze the implications in the era of Large Language Models and complement our findings with expert evaluation.
Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
Juncheng Wang | Zhe Hu | Chao Xu | Siyue Ren | Yuxiang Feng | Yang Liu | Baigui Sun | Shujun Wang
Juncheng Wang | Zhe Hu | Chao Xu | Siyue Ren | Yuxiang Feng | Yang Liu | Baigui Sun | Shujun Wang
Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts—especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10 points improvement in CLAP score over the AR baseline—establishing a new state of the art in AR text-to-audio generation—while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.
Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space
Andrea Ermellino | Lorenzo Malandri | Fabio Mercorio | Antonio Serino
Andrea Ermellino | Lorenzo Malandri | Fabio Mercorio | Antonio Serino
Ensuring the safety of Large Language Models (LLMs) is a critical alignment challenge. Existing approaches often rely on invasive fine- tuning or external generation-based checks, which can be opaque and resource-inefficient. In this work, we investigate the geometry of safety concepts within pretrained representations, proposing a mechanistic methodology that identifies the layer where safe and unsafe concepts are maximally separable within a pretrained model’s representation space. By leveraging the intrinsic activation space of the optimal layer, we show that safety enforcement can be achieved via a simple linear classifier, avoiding the need for weight modification. We validate our framework across multiple domains (regulation, law, finance, cybersecurity, education, code, human resources, and social media), diverse tasks (safety classification, prompt injection, and toxicity detection), and 16 non-English languages on both encoder and decoder architectures. Our results show that: (i) the separation between safe and unsafe concepts emerges from a single layer direction in the activation space, (ii) monitoring internal representations provides a significantly more robust safeguarding mechanism compared to traditional evaluative or generative guardrail paradigms.
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec | Branislav Pecher | Ivan Srba | Maria Bielikova
Robert Belanec | Branislav Pecher | Ivan Srba | Maria Bielikova
Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 7 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Cost Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.
Decoding the Market’s Pulse: Context-Enriched Agentic Retrieval Augmented Generation for Predicting Post-Earnings Price Shocks
Chenhui Li | Weihai Lu
Chenhui Li | Weihai Lu
Accurately forecasting large stock price movements after corporate earnings announcements is a longstanding challenge. Existing methods–sentiment lexicons, fine-tuned encoders, and standalone LLMs–often **lack temporal-causal reasoning** and are prone to **narrative bias**, echoing overly optimistic managerial tone. We introduce **Context-Enriched Agentic RAG (CARAG)**, a retrieval-augmented framework that deploys a team of cooperative LLM agents, each specializing in a distinct analytical task: evaluating historical performance, assessing the credibility of guidance, or benchmarking against peers.Agents retrieve structured evidence from a Causal-Temporal Knowledge Graph (CTKG) built from financial statements and earnings calls, enabling grounded, context-rich reasoning. This design mitigates LLM hallucinations and produces more objective predictions.Without task-specific training, our system achieves state-of-the-art zero-shot performance across NASDAQ, NYSE, and MAEC datasets, outperforming both larger LLMs and fine-tuned models in macro-F1, MCC, and Sharpe, beating market benchmarks (S P 500 and Nasdaq) for the same forecasting horizon. Code, datasets, prompts, and implementation details are included in the supplementary material to ensure full reproducibility.
LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring
May Bashendy | Walid Massoud | Sohaila Eltanbouly | Salam Albatarni | Marwan Sayed | Abrar Abir | Houda Bouamor | Tamer Elsayed
May Bashendy | Walid Massoud | Sohaila Eltanbouly | Salam Albatarni | Marwan Sayed | Abrar Abir | Houda Bouamor | Tamer Elsayed
Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.
Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling
Benjamin Elder | Anupama Murthi | Jungkoo Kang | Ankita Naik | Kinjal Basu | Kiran Kate | Danish Contractor
Benjamin Elder | Anupama Murthi | Jungkoo Kang | Ankita Naik | Kinjal Basu | Kiran Kate | Danish Contractor
Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool-calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD-SQL into executable API sequences across three formulations—SLOT, SEL, and REST—covering minimal general-purpose operations, domain-specific multi-step tasks, and function-oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human-authored queries, ground-truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7–47%), which improve modestly to 50% under interactive agent settings, highlighting substantial scope for improving LLM tool-calling performance. We release all code and data associated with this paper.
MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection
Arkadiusz Modzelewski | Witold Sosnowski | Eleni Papadopulos | Elisa Sartori | Tiziano Labruna | Giovanni Da San Martino | Adam Wierzbicki
Arkadiusz Modzelewski | Witold Sosnowski | Eleni Papadopulos | Elisa Sartori | Tiziano Labruna | Giovanni Da San Martino | Adam Wierzbicki
The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.
When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training
Felicia Körner | Max Müller-Eberstein | Anna Korhonen | Barbara Plank
Felicia Körner | Max Müller-Eberstein | Anna Korhonen | Barbara Plank
Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important — especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge *during training*. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that *shared concept spaces emerge early and continue to refine*, but that *alignment with them is language-dependent*. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior — like selecting senses for polysemous words or translating instead of copying cross-lingual homographs — rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Qiao Liang | Yanjiang Liu | Weixiang Zhou | Ben He | Yaojie Lu | Hongyu Lin | Jia Zheng | Xianpei Han | Le Sun | Yingfei Sun
Does the prior knowledge of the vision encoder constrain the capability boundary of Multi-modal Large Language Models (MLLMs)? While most existing research treats MLLMs as unified systems optimized through end-to-end training, the impact of vision encoder’s prior knowledge is seldom investigated. In this work, we introduce a novel metric Ranke to quantify the effect of prior knowledge of the vision encoder on MLLM performance. Our analysis reveals a positive correlation between prior knowledge and MLLM performance. Moreover, we find that domain-specific fine-tuning using solely end-to-end visual question answering (VQA) data is insufficient, particularly for entities with low inherent visual prior knowledge. To address this issue, we propose VisPRE (Vision Prior Remediation), a two-stage training framework that explicitly incorporates prior knowledge at the vision encoder level. Experimental results demonstrate that augmenting vision encoder’s prior knowledge substantially boosts the visual understanding capabilities of MLLMs, offering a novel and effective strategy for improving performance, especially in scenarios involving uncommon visual entities.
Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
Kin Kwan Leung | Mouloud Belbahri | Yi Sui | Alex Labach | Xueying Zhang | Stephen Anthony Rose | Jesse C. Cresswell
Kin Kwan Leung | Mouloud Belbahri | Yi Sui | Alex Labach | Xueying Zhang | Stephen Anthony Rose | Jesse C. Cresswell
Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.
Helios: A Foundational Language Model for Smart Energy Knowledge Reasoning and Application
Haoyu Jiang | Fanjie Zeng | Boan Qu | Xiaojie Lin | Wei Zhong
Haoyu Jiang | Fanjie Zeng | Boan Qu | Xiaojie Lin | Wei Zhong
In the global drive toward carbon neutrality, deeply coordinated smart energy systems underpin industrial transformation, yet their interdisciplinary, fragmented, and fast-evolving expertise prevents general-purpose LLMs, lacking domain knowledge and physical-constraint awareness, from delivering precise engineering-aligned inference and generation. To address these challenges, we introduce Helios, the first large language model tailored to the smart energy domain, together with a comprehensive suite of resources to advance LLM research in this field. Specifically, we develop Enersys, a multi-agent collaborative framework for end-to-end dataset construction, through which we produce: (1) the first smart energy knowledge base, EnerBase, to enrich the model’s foundational expertise; (2) the first instruction fine-tuning dataset, EnerInsruct, to strengthen performance on domain-specific downstream tasks; and (3) the first RLHF dataset, EnerReinforce, to align the model with human preferences and industry standards. Leveraging these resources, Helios undergoes large-scale pretraining, SFT, and RLHF. We also release EnerBench, the first benchmark for evaluating LLMs in smart energy scenarios, and demonstrate that our approach significantly enhances domain knowledge mastery, task execution accuracy, and alignment with human preferences.
AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders
Georgii Aparin | Tasnima Sadekova | Alexey Rukhovich | Assel Yermekova | Laida Kushnareva | Vadim Popov | Kristian Kuznetsov | Irina Piontkovskaya
Georgii Aparin | Tasnima Sadekova | Alexey Rukhovich | Assel Yermekova | Laida Kushnareva | Vadim Popov | Kristian Kuznetsov | Irina Piontkovskaya
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper’s false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Vision-Language Models Align with Human Neural Representations in Concept Processing
Anna Bavaresco | Marianne De Heer Kloots | Sandro Pezzelle | Raquel Fernández
Anna Bavaresco | Marianne De Heer Kloots | Sandro Pezzelle | Raquel Fernández
Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during _pretraining_, while others are highly sensitive to the context provided at _inference_. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: [https://github.com/dmg-illc/vl-concept-processing](https://github.com/dmg-illc/vl-concept-processing).
FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning
Minh Ngoc Ta | Dong Cao Van | Duc-Anh Hoang | Minh Le-Anh | Truong Nguyen | My Anh Tran Nguyen | Yuxia Wang | Preslav Nakov | Dinh Viet Sang
Minh Ngoc Ta | Dong Cao Van | Duc-Anh Hoang | Minh Le-Anh | Truong Nguyen | My Anh Tran Nguyen | Yuxia Wang | Preslav Nakov | Dinh Viet Sang
The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between *human-written*, *LLM-generated*, and *human-LLM collaborative* texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset *FAIDSet*. We further introduce a fine-grained detection framework *FAID* to classify text into these three categories, and also to identify the underlying LLM family of the generator. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling LLM families as distinct stylistic entities, we incorporate an adaptation to address distributional shifts without retraining for unseen data. Our experimental results demonstrate that FAID outperforms several baselines, particularly enhancing the generalization accuracy on unseen domains and new LLMs, thus offering a potential solution for improving transparency and accountability in AI-assisted writing. Our data and code are available at [https://github.com/mbzuai-nlp/FAID](https://github.com/mbzuai-nlp/FAID)
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet | Abdellah Fourtassi | Akari Haga | Bastian Bunzeck | Bhargav Shandilya | Diana Galvan-Sosa | Faiz Ghifari Haznitrama | Francesca Padovani | Francois Meyer | Hai Hu | Julen Etxaniz | Laurent Prevot | Linyang He | María Grandury | Mila Marcheva | Negar Foroutan | Nikitas Theodoropoulos | Pouya Sadeghi | Siyuan Song | Suchir Salhan | Susana Zhou | Yurii Paniv | Ziyin Zhang | Arianna Bisazza | Alex Warstadt | Leshem Choshen
Jaap Jumelet | Abdellah Fourtassi | Akari Haga | Bastian Bunzeck | Bhargav Shandilya | Diana Galvan-Sosa | Faiz Ghifari Haznitrama | Francesca Padovani | Francois Meyer | Hai Hu | Julen Etxaniz | Laurent Prevot | Linyang He | María Grandury | Mila Marcheva | Negar Foroutan | Nikitas Theodoropoulos | Pouya Sadeghi | Siyuan Song | Suchir Salhan | Susana Zhou | Yurii Paniv | Ziyin Zhang | Arianna Bisazza | Alex Warstadt | Leshem Choshen
We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
Personality Editing for Language Models through Adjusting Self-Referential Queries
Seojin Hwang | Yumin Kim | Byeongjeong Kim | Donghoon Shin | Hwanhee Lee
Seojin Hwang | Yumin Kim | Byeongjeong Kim | Donghoon Shin | Hwanhee Lee
Large Language Models (LLMs) are integral to applications such as conversational agents and content creation, where precise control over a model’s personality is essential for maintaining tone, consistency, and user engagement. However, prevailing prompt-based or fine-tuning approaches either lack robustness or demand large-scale training data, making them costly and impractical. In this paper, we present PALETTE (Personality Adjustment by LLM SElf-TargeTed quEries), a novel method for personality editing in LLMs. Our approach introduces adjustment queries, where self-referential statements grounded in psychological constructs are treated analogously to factual knowledge, enabling direct editing of personality-related responses. Unlike fine-tuning, PALETTE requires only 12 editing samples to achieve substantial improvements in personality alignment across personality dimensions. Experimental results from both automatic and human evaluations demonstrate that our method enables more stable and well-balanced personality control in LLMs.
Large language models (LLMs) are increasingly adopted for handling structured data, including tabular and relational inputs, despite mostly being pretrained on unstructured text. This raises a key question: how effectively do pretrained representations from language-focused LLMs transfer to tasks involving structured inputs? We address this through controlled experiments using two small open-source LLMs, systematically re-initializing subsets of layers with random weights before fine-tuning on structured datasets and comparing results to unstructured datasets. Our analyses show that, for structured data, most pretrained depth contributes little, with performance often saturating after the first few layers, whereas unstructured tasks benefit more consistently from deeper pretrained representations. Pretraining remains useful mainly in low-resource settings, with its impact diminishing as more training data becomes available.
Finding Culture-Sensitive Neurons in Vision-Language Models
Xiutian Zhao | Rochelle Choenni | Rohit Saxena | Ivan Titov
Xiutian Zhao | Rochelle Choenni | Rohit Saxena | Ivan Titov
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e., neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform diagnostic tests by deactivating the neurons flagged by various identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having limited effects on others. Moreover, we introduce a new margin-based selector—Contrastive Activation Margin (ConAct)—and show that it outperforms probability- and entropy-based methods in identifying neurons associated with cultural selectivity. Finally, our layer-wise analyses reveal that such neurons are not uniformly distributed: they cluster in specific decoder layers in a model-dependent way.
Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions
Léo Labat | Etienne Ollion | François Yvon
Léo Labat | Etienne Ollion | François Yvon
Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, *i.e.* behave like theoretical *polyglots*, or do they answer value-laden MCQs depending on the language of the question, like a *multitude* of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (**MEVS**), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement *within and across* models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links
Serwar Basch | Ilia Kuznetsov | Tom Hope | Iryna Gurevych
Serwar Basch | Ilia Kuznetsov | Tom Hope | Iryna Gurevych
Understanding fine-grained links between documents is crucial for many applications, yet progress is limited by the lack of efficient methods for for data curation. To address this limitation, we introduce a domain-agnostic framework for bootstrapping sentence-level cross-document links from scratch. Our approach (1) generates and validates semi-synthetic datasets of linked documents, (2) uses these datasets to benchmark and shortlist the best-performing linking approaches, and (3) applies the shortlisted methods in large-scale human-in-the-loop annotation of natural text pairs. We apply the framework in two distinct domains – peer review and news – and show that combining retrieval models with LLMs achieves a 73% human approval rate for suggested links, more than doubling the acceptance of strong retrievers alone. Our framework allows users to produce novel datasets that enable systematic study of cross-document understanding, supporting downstream tasks such as media framing analysis and peer review assessment. All code, data, and annotation protocols are released to facilitate future research.
Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
Sukannya Purkayastha | Nils Dycke | Anne Lauscher | Iryna Gurevych
Sukannya Purkayastha | Nils Dycke | Anne Lauscher | Iryna Gurevych
Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform *off-the-shelf* LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.
HalluZig: Hallucination Detection using Zigzag Persistence
Shreyas N. Samaga | Gilberto Gonzalez Arroyo | Tamal K. Dey
Shreyas N. Samaga | Gilberto Gonzalez Arroyo | Tamal K. Dey
The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model’s output, overlooking the failures that occur within the model’s internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model’s layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and can enable early detection of hallucinations.
Large language models (LLMs) have demonstrated strong performance in a wide-range of language tasks without requiring task-specific fine-tuning. However, they remain prone to hallucinations and inconsistencies, and often struggle with complex reasoning, in part due to the limitations of autoregressive generation. We propose to address some of these issues, particularly for structured prediction, by combining LLMs with combinatorial inference to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can best estimate confidence values for downstream symbolic inference, and find that, independent of prompting strategy, incorporating symbolic inference yields more consistent and accurate predictions than prompting alone. Finally, we show that calibration and fine-tuning with structured learning objectives further increases performance on challenging tasks, highlighting that structured learning remains valuable in the era of LLMs.
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
Runpeng Dai | Run Yang | Fan Zhou | Hongtu Zhu
Runpeng Dai | Run Yang | Fan Zhou | Hongtu Zhu
Large Language Models and Vision-Language Models have achieved impressive performance across a wide range of tasks, yet they remain vulnerable to carefully crafted perturbations. In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions (pixels or token embeddings) that are susceptible to such perturbations. To this end, we propose a stability measure called FI, First order local Influence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions. Our extensive analysis across LLMs and VLMs (from 1.5B to 13B parameters) reveals that: (I) A small subset of parameters or input dimensions with high FI values disproportionately contribute to model brittleness. (II) Mitigating the influence of these vulnerable parameters during model merging leads to improved performance.
Martingale Foresight Sampling: A Principled Approach to Inference-Time LLM Decoding
Huayu Li | ZhengXiao He | Siyuan Tian | Jinghao Wen | Ao Li
Huayu Li | ZhengXiao He | Siyuan Tian | Jinghao Wen | Ao Li
Standard autoregressive decoding in large language models (LLMs) is inherently short-sighted, often failing to find globally optimal reasoning paths due to its token-by-token generation process. While inference-time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad-hoc heuristics for valuing paths and pruning the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates LLM decoding as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically-grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path’s predictable advantage, path selection uses Optional Stopping Theory for principled pruning of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path’s quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state-of-the-art methods in accuracy while significantly improving computational efficiency. Code will be released at https://github.com/miraclehetech/EACL2026-Martingale-Foresight-Sampling.
Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour
Ian Berlot-Attwell | Tobias Sesterhenn | Frank Rudzicz | Xujie Si
Ian Berlot-Attwell | Tobias Sesterhenn | Frank Rudzicz | Xujie Si
The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).Our findings suggest that a serious re-examination of the effectiveness of ICL LLM-based library learning is required, as is much stronger standards for evaluation. An equal computational budget must be used for baselines, alongside behavioural analysis.
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park | Kanchana Ranasinghe | Kumara Kahatapitiya | Wonjeong Ryu | Donghyun Kim | Michael S Ryoo
Jongwoo Park | Kanchana Ranasinghe | Kumara Kahatapitiya | Wonjeong Ryu | Donghyun Kim | Michael S Ryoo
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature leverage large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Motivated by this inefficiency, we propose LVNet, a modular and training-free framework featuring a novel Hierarchical Keyframe Selector (HKS) that efficiently selects a minimal set of informative frames tailored to each question. LVNet’s modularity allows easy integration with existing approaches for more efficient LVQA. We achieve state-of-the-art performance among similarly configured models across four benchmark LVQA datasets: EgoSchema, NExT-QA, IntentQA, VideoMME. The code can be found athttps://github.com/jongwoopark7978/LVNet
A Unified View on Emotion Representation in Large Language Models
Aishwarya Maheswaran | Maunendra Sankar Desarkar
Aishwarya Maheswaran | Maunendra Sankar Desarkar
Interest in leveraging Large Language Models (LLMs) for emotional support systems motivates the need to understand how these models comprehend and represent emotions internally. While recent works show the presence of emotion concepts in the hidden state representations, it’s unclear if the model has a robust representation that is consistent across different datasets. In this paper, we present a unified view to understand emotion representation in LLMs, experimenting with diverse datasets and prompts. We then evaluate the reasoning ability of the models on a complex emotion identification task. We find that LLMs have a common emotion representation in the later layers of the model, and the vectors capturing the direction of emotions extracted from these representations can be interchanged among datasets with minimal impact on performance. Our analysis of reasoning with Chain of Thought (CoT) prompting shows the limits of emotion comprehension. Therefore, despite LLMs implicitly having emotion representations, they are not equally skilled at reasoning with them in complex scenarios. This motivates the need for further research to find new approaches.
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Shima Imani | Seungwhan Moon | Lambert Mathias | Lu Zhang | Babak Damavandi
Shima Imani | Seungwhan Moon | Lambert Mathias | Lu Zhang | Babak Damavandi
Reliable mathematical and scientific reasoning remains an open challenge for large vision–language models (VLMs). Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE (Transparent Reasoning And Consistency Evaluation), a framework for analyzing, diagnosing, and improving reasoning in VLMs. At its core, TRACE leverages Auxiliary Reasoning Sets (ARS), compact sub-question–answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS is linked to final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement.
ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs
Mohamed Elaraby | Diane Litman
Mohamed Elaraby | Diane Litman
We introduce Argument Representation Coverage (ARC), a bottom-up evaluation framework that assesses how well summaries preserve structured salient arguments, a crucial issue in summarizing high-stakes domains such as law. ARC provides an interpretable lens by distinguishing between different information types to be covered and by separating omissions from factual errors.Using ARC, we evaluate summaries from eight open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs capture some salient roles, they frequently omit critical information, particularly when arguments are sparsely distributed across the input. Moreover, ARC uncovers systematic patterns—showing how context window positional bias and role-specific preferences shape argument coverage—providing actionable guidance for developing more complete and reliable summarization strategies.
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
Potsawee Manakul | Woody Haosheng Gan | Michael J Ryan | Ali Sartaz Khan | Warit Sirichotedumrong | Kunat Pipatanakul | William Barr Held | Diyi Yang
Potsawee Manakul | Woody Haosheng Gan | Michael J Ryan | Ali Sartaz Khan | Warit Sirichotedumrong | Kunat Pipatanakul | William Barr Held | Diyi Yang
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
Large Language Models (LLMs) are highly effective at adapting to users’ styles, preferences, and contextual signals—a property that underlies much of their practical usefulness, but which can even manifest as sycophancy, i.e., alignment with user-implied beliefs evenwhen these contradict factual accuracy or rational reasoning. Prior work treats sycophancy as a surface-level artefact addressed via inference-time or post-hoc methods. We argue that it is a policy-level failure arising from missing agentic control over agreement under pressure. To make sycophancy amenable to explicit control, we propose learning agentic policies modelling LLMs’ behaviour as a decision-making problem. Our approach equips a single model with an explicit action space that includes answering directly, countering misleading signals, or asking for clarification. The policy is trained to optimise a multi-objective reward that balances task success, sycophancy resistance, and behavioural consistency via a control mechanism that operates through agentic behaviour. We evaluate the method on different benchmarks, showing that the approaches reduce sycophancy, improving performance, and generalise robustly across languages. These findings suggest that mitigating sycophancy requires moving beyond compliance-oriented generation towards agreement-agentic control.
ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity
Seungho Lee | Kyumin Lee
Seungho Lee | Kyumin Lee
While large language models (LLMs) offer great promise, they also pose concrete safety risks. To audit and mitigate these risks, researchers have developed automated red-teaming methods, which generate adversarial prompts to elicit unsafe behavior of target LLMs during evaluation. Recent automated red-teaming methods for LLMs face a persistent trade-off: techniques that increase prompt diversity often reduce the level of the toxicity elicited from the target LLMs, while toxicity-maximizing methods tend to collapse diversity. To address the limitations, we propose ToxiPrompt, a two-stage framework that explicitly separates exploration (diversity) from exploitation (toxicity) and reunifies them with a single selection criterion to balance between diversity and toxicity. Experimental results show that ToxiPrompt outperforms four state-of-the-art baselines in both adversarial prompt diversity and the level of elicited toxicity from target LLMs, improving 14.6% harmonic mean of toxicity and diversity against the best baseline. The approach also performs well for multiple instruction-tuned target LLMs (Llama-2/3, Qwen, Mistral) without re-tuning, achieving up to 55% harmonic mean improvement against the best baseline. Our code is available at https://github.com/seungho715/ToxiPrompt
AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages
Kosei Uemura | Miaoran Zhang | David Ifeoluwa Adelani
Kosei Uemura | Miaoran Zhang | David Ifeoluwa Adelani
Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB—a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.
Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM-Based Auto-Labeled Pipeline for Biomedical Concept Recognition
Shanshan Liu | Noriki Nishida | Fei Cheng | Narumi Tokunaga | Rumana Ferdous Munne | Yuki Yamagata | Kouji Kozaki | Takehito Utsuro | Yuji Matsumoto
Shanshan Liu | Noriki Nishida | Fei Cheng | Narumi Tokunaga | Rumana Ferdous Munne | Yuki Yamagata | Kouji Kozaki | Takehito Utsuro | Yuji Matsumoto
Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at https://github.com/bio-ie-tool/hi-ald.
A Representation Sharpening Framework for Zero Shot Dense Retrieval
Dhananjay Ashok | Suraj Nair | Mutasem Al-Darabsah | Choon Hui Teo | Tarun Agarwal | Jonathan May
Dhananjay Ashok | Suraj Nair | Mutasem Al-Darabsah | Choon Hui Teo | Tarun Agarwal | Jonathan May
Zero-shot dense retrieval is a challenging setting where a document corpus is provided without relevant queries, necessitating a reliance on pretrained dense retrievers (DRs). However, since these DRs are not trained on the target corpus, they struggle to represent semantic differences between similar documents. To address this failing, we introduce a training-free representation sharpening framework that augments a document’s representation with information that helps differentiate it from similar documents in the corpus. On over twenty datasets spanning multiple languages, the representation sharpening framework proves consistently superior to traditional retrieval, setting a new state-of-the-art on the BRIGHT benchmark. We show that representation sharpening is compatible with prior approaches to zero-shot dense retrieval and consistently improves their performance. Finally, we address the performance-cost tradeoff presented by our framework and devise an indexing-time approximation that preserves the majority of our performance gains over traditional retrieval, yet suffers no additional inference-time cost.
Spotlight Your Instructions: Instruction-following with Dynamic Attention Steering
Praveen Venkateswaran | Danish Contractor
Praveen Venkateswaran | Danish Contractor
In many real-world applications, users rely on natural language instructions to guide large language models (LLMs) across a wide range of tasks. These instructions are often complex, diverse, and subject to frequent change. However, LLMs do not always attend to these instructions reliably, and users lack simple mechanisms to emphasize their importance beyond modifying prompt wording or structure. To address this, we present an inference-time method that enables users to emphasize specific parts of their prompt by steering the model’s attention toward them, aligning the model’s perceived importance of different prompt tokens with user intent. Unlike prior approaches that are limited to static instructions, require significant offline profiling, or rely on fixed biases, we dynamically update the proportion of model attention given to the user-specified parts–ensuring improved instruction following without performance degradation. We demonstrate that our approach improves instruction following across a variety of tasks involving multiple instructions and generalizes across models of varying scales.
End-to-end form filling refers to automatically populating fields in a document-style form with the appropriate information derived from external data. Although prevalent and useful, no formal benchmark exists for evaluating systems’ form completion accuracy. Existing datasets focus on parsing, extraction and web form interaction, rather than end-to-end completion of document-style forms. We propose FormGym, a benchmark formulation of the end-to-end form filling task that evaluates form completion and accuracy. We construct FormGym by repurposing three existing datasets and add one new dataset to achieve more challenging, diverse, and realistic test cases. Our studies show baseline vision language agents (VLAs) perform poorly on FormGym in every scenario, primarily due to poor field localization. GUI agents perform better but suffer from high latency and costs. Therefore we also introduce FieldFinder, a field localization tool that enables zero-shot VLAs to find and accurately place text in input fields. We find that VLAs augmented with FieldFinder achieve better performance compared to baselines in all models.
NarraBench: A Comprehensive Framework for Narrative Benchmarking
Sil Hamilton | Matthew Wilkens | Andrew Piper
Sil Hamilton | Matthew Wilkens | Andrew Piper
We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas – including narrative events, style, perspective, and revelation – are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.
FaithLM: Towards Faithful Explanations for Large Language Models
Yu-Neng Chuang | Guanchu Wang | Chia-Yuan Chang | Ruixiang Tang | Shaochen Zhong | Fan Yang | Andrew Wen | Mengnan Du | Xuanting Cai | Vladimir Braverman | Xia Hu
Yu-Neng Chuang | Guanchu Wang | Chia-Yuan Chang | Ruixiang Tang | Shaochen Zhong | Fan Yang | Andrew Wen | Mengnan Du | Xuanting Cai | Vladimir Braverman | Xia Hu
Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.
Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?
Matteo Gay | Coleman Haley | Mario Giulianelli | Edoardo Ponti
Matteo Gay | Coleman Haley | Mario Giulianelli | Edoardo Ponti
The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image–caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.
KAD: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral
Ayoub Hammal | Pierre Zweigenbaum | Caio Corro
Ayoub Hammal | Pierre Zweigenbaum | Caio Corro
Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively.In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as a token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Abeer Badawi | Elahe Rahimi | Md Tahmid Rahman Laskar | Sheri Grach | Lindsay Bertrand | Lames Danok | Prathiba Dhanesh | Jimmy Huang | Frank Rudzicz | Elham Dolatabadi
Abeer Badawi | Elahe Rahimi | Md Tahmid Rahman Laskar | Sheri Grach | Lindsay Bertrand | Lames Danok | Prathiba Dhanesh | Jimmy Huang | Frank Rudzicz | Elham Dolatabadi
Evaluating Large Language Models (LLMs) for mental health support poses unique challenges to reliable evaluation due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, authenticity, and reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale authentic dialogue datasets and judge-reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation in this domain. MentalBench-100k consolidates 10,000 authentic single-session therapeutic conversations from three real-world scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective–Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for the reliable and large-scale evaluation of LLMs in mental health contexts.
DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Benno Uthayasooriyar | Antoine Ly | Franck Vermet | Caio Corro
Benno Uthayasooriyar | Antoine Ly | Franck Vermet | Caio Corro
We propose a novel self-attention mechanism for document understanding that takes into account text block positions in relative polar coordinate system rather than the Cartesian one. Based on this mechanism, we build DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
IDEAlign: Comparing Ideas of Large Language Models to Domain Experts
HyunJi Nam | Lucía Langlois | Jim Malamut | Mei Tan | Dorottya Demszky
HyunJi Nam | Lucía Langlois | Jim Malamut | Mei Tan | Dorottya Demszky
Large language models (LLMs) are increasingly used to produce open-ended, interpretive annotations, yet there is no validated, scalable measure of ***idea-level similarity*** to expert annotations. We (i) introduce the content evaluation of LLM annotations as a core, understudied task, (ii) propose IDEAlign for capturing expert similarity judgments via the odd-one-out tasks, and (iii) benchmark various similarity methods, such as text embeddings, topic models, and LLM-as-a-judge, against these human ratings. Applying this approach to two real-world educational datasets (interpreting math reasoning and feedback generation), we find that most metrics fail to capture the nuanced dimensions of similarity meaningful to experts. LLM-as-a-judge performs best (11–18% improvement over other methods) but still falls short of expert alignment, making it useful as a triage filter rather than a substitute for human review. Our work demonstrates the difficulty of evaluating open-ended LLM annotations at scale, and positions IDEAlign as a reusable protocol for benchmarking on this task, thereby informing responsible deployment of LLMs.
Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning
Yue Zhou | Xiaobo Guo | Belhassen Bayar | Srinivasan H. Sengamedu
Yue Zhou | Xiaobo Guo | Belhassen Bayar | Srinivasan H. Sengamedu
Long-term conversational agents face a fundamental scalability challenge as interactions extend over time: repeatedly processing entire conversation histories becomes computationally prohibitive. Current approaches attempt to solve this through memory frameworks that predominantly fragment conversations into isolated embeddings or graph representations and retrieve relevant ones in a RAG style. While computationally efficient, these methods often treat memory formation minimally and fail to capture the subtlety and coherence of human memory. We introduce Amory, a working memory framework that actively constructs structured memory representations through enhancing agentic reasoning during offline time. Amory organizes conversational fragments into episodic narratives, consolidates memories with momentum, and semanticizes peripheral facts into semantic memory. At retrieval time, the system employs coherence-driven reasoning over narrative structures. Evaluated on the LOCOMO benchmark for long-term reasoning, Amory achieves considerable improvements over previous state-of-the-art, with performance comparable to full context reasoning while reducing response time by 50%. Analysis shows that momentum-aware consolidation significantly enhances response quality, while coherence-driven retrieval provides superior memory coverage compared to embedding-based approaches.
It’s All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models
Cristian Santini | Marieke van Erp | Mehwish Alam
Cristian Santini | Marieke van Erp | Mehwish Alam
Despite the recent advancements in NLP with the advent of Large Language Models (LLMs), Entity Linking (EL) for historical texts remains challenging due to linguistic variation, noisy inputs, and evolving semantic conventions. Existing solutions either require substantial training data or rely on domain-specific rules that limit scalability. In this paper, we present MHEL-LLaMo (Multilingual Historical Entity Linking with Large Language MOdels), an unsupervised ensemble approach combining a Small Language Model (SLM) and an LLM. MHEL-LLaMo leverages a multilingual bi-encoder (BELA) for candidate retrieval and an instruction-tuned LLM for NIL prediction and candidate selection via prompt chaining. Our system uses SLM’s confidence scores to discriminate between easy and hard samples, applying an LLM only for hard cases. This strategy reduces computational costs while preventing hallucinations on straightforward cases. We evaluate MHEL-LLaMo on four established benchmarks in six European languages (English, Finnish, French, German, Italian and Swedish) from the 19th and 20th centuries. Results demonstrate that MHEL-LLaMo outperforms state-of-the-art models without requiring fine-tuning, offering a scalable solution for low-resource historical EL. Our error analysis reveals that 41% of false predictions exhibit semantic proximity to ground truth entities, highlighting the LLM’s accurate disambiguation of historical references.
SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation
Carolin Holtermann | Florian Schneider | Anne Lauscher
Carolin Holtermann | Florian Schneider | Anne Lauscher
Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt’s semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models’ SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.
Gender and Politeness Perception: A Novel Approach for Exploring Annotations Disagreement
Ahmad Aljanaideh
Ahmad Aljanaideh
Politeness is an important social phenomenon which influences the flow of conversations. Several studies proposed models to discover and analyze linguistic cues associated with (im)polite language. However, no prior work computationally studied how politeness perception interacts with other social dimensions such as gender. We propose a model for automatic discovery of linguistic patterns which correlate with disagreement in politeness annotations, specifically focusing on gender differences. The model discovers fine-grained context patterns of words which correlate with disagreement in politeness annotations between men and women annotators. We apply the proposed model on emails annotated for politeness. Results show women rate emails which contain formal cues (e.g. To whom it may concern) more polite than men annotators rate them, while men rate emails exhibiting informal language cues (e.g. haven’t seen my new swing) more polite than women annotators rate them. Our findings highlight the importance of studying politeness through multiple demographic perspectives.
TempViz: On the Evaluation of Temporal Knowledge in Text-to-Image Models
Carolin Holtermann | Nina Krebs | Anne Lauscher
Carolin Holtermann | Nina Krebs | Anne Lauscher
Time alters the visual appearance of entities in our world, like objects, places, and animals. Thus, for accurately generating contextually-relevant images, knowledge and reasoning about time can be crucial (e.g., for generating a landscape in spring vs. in winter). Yet, although substantial work exists on understanding and improving temporal knowledge in natural language processing, research on how temporal phenomena appear and are handled in text-to-image (T2I) models remains scarce. We address this gap with TempViz, the first data set to holistically evaluate temporal knowledge in image generation, consisting of 7.9k prompts and more than 600 reference images. Using TempViz, we study the capabilities of five T2I models across five temporal knowledge categories. Human evaluation shows that temporal competence is generally weak, with no model exceeding 75% accuracy across categories. Towards larger-scale studies, we also examine automated evaluation methods, comparing several established approaches against human judgments. However, none of these approaches provides a reliable assessment of temporal cues - further indicating the pressing need for future research on temporal knowledge in T2I.
ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation
Peiran Li | Jan Fillies | Adrian Paschke
Peiran Li | Jan Fillies | Adrian Paschke
Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
Text Classification Under Class Distribution Shift: A Survey
Adriana Valentina Costache | Silviu-Florin Gheorghe | Eduard Poesina | Paul Irofti | Radu Tudor Ionescu
Adriana Valentina Costache | Silviu-Florin Gheorghe | Eduard Poesina | Paul Irofti | Radu Tudor Ionescu
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e. the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e. learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. We further identify several future work directions, aiming to push the boundaries beyond the state of the art. Finally, we explain how continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
Reasoning’s Razor: Reasoning Improves Accuracy but Hurts Recall at Critical Operating Points in Safety and Hallucination Detection
Atoosa Chegini | Hamid Kazemi | Garrett Souza | Maria Safi | Yang Song | Samy Bengio | Sinead Williamson | Mehrdad Farajtabar
Atoosa Chegini | Hamid Kazemi | Garrett Souza | Maria Safi | Yang Song | Samy Bengio | Sinead Williamson | Mehrdad Farajtabar
Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive use remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks—safety detection and hallucination detection—evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but performs poorly at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.
Instructional Agents: Reducing Teaching Faculty Workload through Multi-Agent Instructional Design
Huaiyuan Yao | Wanpeng Xu | Justin Turnau | Nadia Kellam | Hua Wei
Huaiyuan Yao | Wanpeng Xu | Justin Turnau | Nadia Kellam | Hua Wei
Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model (LLM) framework designed to automate end-to-end course material generation, including syllabus creation, lecture scripts, LaTeX-based slides, and assessments. Unlike existing AI-assisted educational tools that focus on isolated tasks, Instructional Agents simulates role-based collaboration among educational agents to produce cohesive and pedagogically aligned content. The system operates in four modes: Autonomous, Catalog-Guided, Feedback-Guided, and Full Co-Pilot mode, enabling flexible control over the degree of human involvement. We evaluate Instructional Agents across five university-level computer science courses and show that it produces high-quality instructional materials while significantly reducing development time and human workload. By supporting institutions with limited instructional design capacity, Instructional Agents provides a scalable and cost-effective framework to democratize access to high-quality education, particularly in underserved or resource-constrained settings.
Rethinking Reading Order: Toward Generalizable Document Understanding with LLM-based Relation Modeling
Weishi Wang | Hengchang Hu | Daniel Dahlmeier
Weishi Wang | Hengchang Hu | Daniel Dahlmeier
Document understanding requires modeling both structural and semantic relationships between the layout elements within the document, with human-perceived reading order (RO) playing a crucial yet often neglected role compared to heuristic OCR sequences used by most existing models. Previous approaches depend on costly, inconsistent human annotations, limiting scalability and generalization. To bridge the gap, we propose a cost-effective paradigm that leverages large language models (LLMs) to infer global RO and inter-element layout relations without human supervision. By explicitly incorporating RO as structural guidance, our method captures hierarchical, document-level dependencies beyond local adjacency. Experiments on Semantic Entity Recognition, Entity Linking, and Document Question Answering show consistent improvements over baseline methods. Notably, LLM-inferred RO, even when differing from ground-truth adjacency, provides richer global structural priors and yields superior downstream performance. These results and findings demonstrate the scalability and significance of RO-aware modeling, advancing both LLMs and lightweight layout-aware models for robust document understanding. Code, data, and more details will be made publicly available after corporate review, in accordance with SAP’s corporate open-source policy.
Validating Automatic Evaluation of Controllable Counterspeech Generation: Rankings Matter More Than Scores
Yi Zheng | Björn Ross | Walid Magdy
Yi Zheng | Björn Ross | Walid Magdy
Counterspeech generation has emerged as a promising approach to combat online hate speech, with recent work focusing on controlling attributes used in counterspeech, such as strategies or intents. While these attributes are often evaluated automatically using classifiers, a key goal of this evaluation is to compare the performance of different generation models. However, the validity of such evaluation results is questionable when the classifiers themselves have only modest performance. This paper examines the automatic evaluation of counterspeech attributes using a multi-attribute counterspeech dataset containing 2,728 samples. We investigate when automatic evaluation can be trusted for model comparison and address the limitations of current evaluation methodologies. We make concrete recommendations for how to perform classifier validation before model evaluation. Our classifier validation results demonstrate that even limited classifiers can produce trustworthy model rankings. Therefore, we argue that when comparing counterspeech generation models, a classifier’s ability to rank generation models is a more direct measure of its practical utility than traditional classification metrics, e.g., accuracy and F1.
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
Rheeya Uppaal | Phu Mon Htut | Min Bai | Nikolaos Pappas | Zheng Qi | Sandesh Swamy
Rheeya Uppaal | Phu Mon Htut | Min Bai | Nikolaos Pappas | Zheng Qi | Sandesh Swamy
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors.We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools
Ha Min Son | Huan Ren | Xin Liu | Zhe Zhao
Ha Min Son | Huan Ren | Xin Liu | Zhe Zhao
Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer’s success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term *Tool Bridging*, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model’s high-level reasoning and effective low-level execution.
MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments
Roelien C. Timmer | Necva Bölücü | Stephen Wan
Roelien C. Timmer | Necva Bölücü | Stephen Wan
Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type: baseline, proposed method, or variation of proposed method for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research. MetaLead dataset and code repository: https://anonymous.4open.science/r/metalead-7CA3
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue | Reza Abbasi-Asl | Ramtin Pedarsani
Zhiyu Xue | Reza Abbasi-Asl | Ramtin Pedarsani
Generative medical vision-language models (Med-VLMs) are primarily designed to generate complex textual information (e.g., diagnostic reports) from multimodal inputs including vision modality (e.g., medical images) and language modality (e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as Provide detailed instructions for using this CT scan for insurance fraud. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints. Warning: This paper contains content that may be deemed harmful.
HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations
Yujia Hu | Roy Ka-Wei Lee
Yujia Hu | Roy Ka-Wei Lee
Hateful speech detection is a key component of content moderation, yet current evaluation frameworks rarely assess why a text is deemed hateful. We introduce , a four-component metric suite designed to evaluate the reasoning quality of model explanations. It assesses (i) conclusion explicitness, (ii) faithfulness and causal grounding of quoted spans, (iii) protected group identification (policy-configurable), and (iv) logical consistency among these elements. Evaluated on six diverse hate speech datasets, reveals interpretability failures and annotation inconsistencies that are invisible to standard metrics like Accuracy or F1. Moreover, human evaluation shows strong agreement with , validating it as a practical tool for trustworthy and transparent moderation. Disclaimer: This paper contains sensitive content that may be disturbing to some readers.
Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Zhengyang Shan | Aaron Mueller
Zhengyang Shan | Aaron Mueller
We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces “prior collapse”, thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
A Survey on LLM-based Conversational User Simulation
Bo Ni | Yu Wang | Leyao Wang | Branislav Kveton | Franck Dernoncourt | Yu Xia | Hongjie Chen | Reuben Luera | Samyadeep Basu | Subhojyoti Mukherjee | Puneet Mathur | Nesreen K. Ahmed | Junda Wu | Li Li | Huixin Zhang | Ruiyi Zhang | Tong Yu | Sungchul Kim | Jiuxiang Gu | Zhengzhong Tu | Alexa Siu | Zichao Wang | Seunghyun Yoon | Nedim Lipka | Namyong Park | Zihao Lin | Trung Bui | Yue Zhao | Tyler Derr | Ryan A. Rossi
Bo Ni | Yu Wang | Leyao Wang | Branislav Kveton | Franck Dernoncourt | Yu Xia | Hongjie Chen | Reuben Luera | Samyadeep Basu | Subhojyoti Mukherjee | Puneet Mathur | Nesreen K. Ahmed | Junda Wu | Li Li | Huixin Zhang | Ruiyi Zhang | Tong Yu | Sungchul Kim | Jiuxiang Gu | Zhengzhong Tu | Alexa Siu | Zichao Wang | Seunghyun Yoon | Nedim Lipka | Namyong Park | Zihao Lin | Trung Bui | Yue Zhao | Tyler Derr | Ryan A. Rossi
User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.
Prompt-driven Detection of Offensive Urdu Language using Large Language Models
Iffat Maab | Usman Haider | Junichi Yamagishi
Iffat Maab | Usman Haider | Junichi Yamagishi
Offensive language detection poses a significant challenge in modern social spaces, necessitating advanced solutions. Online media platforms have been known to escalate acts of violence and broader conflicts, and thus, an automated system to help counter offensive content is essential. Traditional NLP models have typically dominated the field of hate speech detection, but require careful model design and extensive tuning. Moreover, a notable resource gap exists for addressing offensive languages, particularly those transcribed in non-native scripts, such as Roman Urdu and Urdu. This study explores the potential of pre-trained LLMs in using prompt-based methods using different transcriptions of the Urdu language, particularly their efficacy in detecting offensive content in diverse linguistic contexts. Our study employs state-of-the-art open-source LLMs, including advanced variants of Llama, Qwen, Lughaat, and proprietary GPT-4, which are evaluated through prompting strategies in different under-resourced languages. Our findings show that pre-trained LLMs achieve performance comparable to traditional fine-tuned benchmarks in detecting hateful and offensive content.
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models
Tiejin Chen | Kaishen Wang | Hua Wei
Tiejin Chen | Kaishen Wang | Hua Wei
Multi-modal large language models (MLLMs) have recently shown impressive capabilities but are also highly vulnerable to jailbreak attacks. While white-box methods can generate adversarial visual inputs via gradient-based optimization, such approaches fail in realistic black-box settings where model parameters are inaccessible. Zeroth-order (ZO) optimization offers a natural path for black-box attacks by estimating gradients from queries, yet its application to MLLMs is challenging due to sequence-conditioned objectives, limited feedback, and massive model scales. To address these issues, we propose Zer0-Jack, the first direct black-box jailbreak framework for MLLMs based on ZO optimization. Zer0-Jack focuses on generating malicious images and introduces a patch-wise block coordinate descent strategy that stabilizes gradient estimation and reduces query complexity, enabling efficient optimization on billion-scale models. Experiments show that Zer0-Jack achieves 98.2% success on MiniGPT-4 and 95% on the Harmful Behaviors Multi-modal dataset, while directly jailbreaking commercial models such as GPT-4o. These results demonstrate that ZO optimization can be effectively adapted to jailbreak large-scale multi-modal LLMs. Codes are provided here.
RAGPPI: Retrieval-Augmented Generation Benchmark for Protein–Protein Interactions in Drug Discovery
Youngseung Jeon | Ziwen Li | Thomas Li | JiaSyuan Chang | Morteza Ziyadi | Xiang Anthony Chen
Youngseung Jeon | Ziwen Li | Thomas Li | JiaSyuan Chang | Morteza Ziyadi | Xiang Anthony Chen
Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact–abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.
Don’t Generate, Classify! Low-Latency Prompt Optimization with Structured Complementary Prompt
Hee-Soo Kim | Jun-Young Kim | Jeong-Hwan Lee | Seong-Jin Park | Kang-Min Kim
Hee-Soo Kim | Jun-Young Kim | Jeong-Hwan Lee | Seong-Jin Park | Kang-Min Kim
Large language models (LLMs) have demonstrated strong performance across diverse natural language processing tasks. However, their performance varies significantly across different prompts, requiring careful engineering for consistent results. Manual prompt engineering requires substantial human effort and suffers from limited reproducibility. In contrast, automatic prompt optimization methods reduce manual effort but often depend on costly autoregressive generation, resulting in substantial latency overheads. To address these limitations, we present low-latency prompt optimization (LLPO), a novel framework that reframes prompt engineering as a classification problem. LLPO classifies structured prompt fields from user input through multi-task classification and populates a predefined template to generate an optimized system prompt with minimal latency. In LLM-based automatic evaluations across four question-answering benchmarks, LLPO improves answer quality by up to 26.5% in ∆win rate compared to prior automatic prompt optimization methods, while reducing latency by up to 1,956 times. Human evaluation shows that LLPO receives the highest proportion of top-ranked responses. Furthermore, we analyze the contribution of each structured prompt field to performance, highlighting the robustness of our framework.
CHROMIC: Chronological Reasoning Across Multi-Panel Comics
Bingxuan Hou | Jiayi Lin | Chenyang Zhang | Dapeng Yin | Shuyue Zhu | Qingqing Hong | Mengna Gao | Junli Wang
Bingxuan Hou | Jiayi Lin | Chenyang Zhang | Dapeng Yin | Shuyue Zhu | Qingqing Hong | Mengna Gao | Junli Wang
Large-scale vision–language models (LVLMs) have achieved remarkable progress on various reasoning tasks. However, most studies focus on natural photographic images and pay limited attention to multi-panel visual narratives such as comics. This leaves a clear gap in our understanding of how well LVLMs perform chronological reasoning across comic panels. To address this, we introduce **ChrOMIC**, a new benchmark dataset for **chro**nological reasoning in multi-panel **comic**s. It covers six types of reasoning questions and spans both Western and Japanese comic styles. To ensure high-quality annotations, we customized a human–AI collaborative annotation process tailored to the characteristics of the two comic styles. We further introduce three core tasks: Description Reordering and Panel Reordering, which jointly assess models’ ability to understand chronological order in panel sequences, and Multiple-Choice Question Answering (MCQA), which evaluates narrative-level reasoning. We evaluate a range of open-source and commercial LVLMs on ChrOMIC, and find that even the leading models struggle with panel-based chronological reasoning. Further analysis reveals key limitations, including weak visual action understanding and frequent hallucinations in fine-grained visual interpretation.
GAST: Gradient-aligned Sparse Tuning of Large Language Models with Data-layer Selection
Kai Yao | Zhenghan Song | Kaixin Wu | Mingjie Zhong | Danzhao Cheng | Zhaorui Tan | Yixin Ji | Penglei Gao
Kai Yao | Zhenghan Song | Kaixin Wu | Mingjie Zhong | Danzhao Cheng | Zhaorui Tan | Yixin Ji | Penglei Gao
Parameter-Efficient Fine-Tuning (PEFT) has become a key strategy for adapting large language models, with recent advances in sparse tuning reducing overhead by selectively updating key parameters or subsets of data. Existing approaches generally focus on two distinct paradigms: layer-selective methods aiming to fine-tune critical layers to minimize computational load, and data-selective methods aiming to select effective training subsets to boost training. However, current methods typically overlook the fact that different data points contribute varying degrees to distinct model layers, and they often discard potentially valuable information from data perceived as of low quality. To address these limitations, we propose Gradient-aligned Sparse Tuning (GAST), an innovative method that simultaneously performs selective fine-tuning at both data and layer dimensions as integral components of a unified optimization strategy. GAST specifically targets redundancy in information by employing a layer-sparse strategy that adaptively selects the most impactful data points for each layer, providing a more comprehensive and sophisticated solution than approaches restricted to a single dimension. Experiments demonstrate that GAST consistently outperforms baseline methods, establishing a promising direction for future research in PEFT strategies.
BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
Chuyuan Li | Giuseppe Carenini
Chuyuan Li | Giuseppe Carenini
We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., just), and also aggregates a shared-task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier reasoning model GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibits strong performance in arithmetic aspect of temporal reasoning, but they struggle with long-dependency reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation classification.
Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang | Zizhen Zhu | Yihao Wei | Bing Tian | Junyi Liu | Henan Wang | Wang Xavier | Yaxiao Liu
Chuang Zhang | Zizhen Zhu | Yihao Wei | Bing Tian | Junyi Liu | Henan Wang | Wang Xavier | Yaxiao Liu
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized confidence score. Questions with confidence below a predefined threshold are deferred to the LLM for more accurate resolution. We introduce a reinforcement learning-based training algorithm that aligns the SLM’s confidence through an additional confidence calibration reward. Extensive experiments demonstrate that our method jointly improves the SLM’s reasoning ability and confidence calibration across diverse datasets and model backbones. Compared to using the LLM alone, COREA reduces cost by 21.5% and 16.8% on out-of-domain math and non-math datasets, respectively, with only an absolute pass@1 drop within 2%.
Chat-Ghosting: Methods for Auto-Completion in Dialog Systems
Anubhab Mandal | Sandeep Mishra | Bishal Santra | Tushar Abhishek | Pawan Goyal | Manish Gupta
Anubhab Mandal | Sandeep Mishra | Bishal Santra | Tushar Abhishek | Pawan Goyal | Manish Gupta
Ghosting, the ability to predict a user’s intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available at https://github.com/blitzprecision/Chat-Ghosting.
Attribution-Guided Multi-Object Hallucination and Bias Detection in Vision-Language Models
Sirat Samyoun | Yingtai Xiao | Jian Du
Sirat Samyoun | Yingtai Xiao | Jian Du
Vision-Language Models excel in multi-modal tasks but often hallucinate objects or exhibit linguistic bias by over-repeating object names, especially in complex multi-object scenes. Existing methods struggle with multi-object grounding because language priors frequently dominate visual evidence, causing hallucinated or biased objects to produce attention distributions or similarity scores nearly indistinguishable from those of real objects. We introduce SHAPLENS, a Shapley value–based attribution framework using Kernel SHAP and multi-layer fusion to detect hallucinated and biased objects. Evaluated on ADE and COCO datasets across four leading VLMs, SHAPLENS improves hallucination detection accuracy by 8–12% and F1 by 10–14% over the best baselines. It also achieves up to 6% higher bias detection performance across three distinct bias types on a curated HQH benchmark and exhibits minimal degradation (<0.03%) across partial and perturbed contexts.
Word Surprisal Correlates with Sentential Contradiction in LLMs
Ning Shi | Bradley Hauer | David Basil | John Zhang | Grzegorz Kondrak
Ning Shi | Bradley Hauer | David Basil | John Zhang | Grzegorz Kondrak
Large language models (LLMs) continue to achieve impressive performance on reasoning benchmarks, yet it remains unclear how their predictions capture semantic consistency between sentences. We investigate the important open question of whether word-level surprisal correlates with sentence-level contradiction between a premise and a hypothesis. Specifically, we compute surprisal for hypothesis words across a diverse set of experimental variants, and analyze its association with contradiction labels over multiple datasets and open-source LLMs. Because modern LLMs operate on subword tokens and can not directly produce reliable surprisal estimates, we introduce a token-to-word decoding algorithm that extends theoretically grounded probability estimation to open-vocabulary settings. Experiments show a consistent and statistically significant positive correlation between surprisal and contradiction across models and domains. Our analysis also provides new insights into the capabilities and limitations of current LLMs. Together, our findings suggest that surprisal can localize sentence-level inconsistency at the word level, establishing a quantitative link between lexical uncertainty and sentential semantics. We plan to release our code and data upon publication.
ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
Sharanya Dasgupta | Arkaprabha Basu | Sujoy Nath | Swagatam Das
Sharanya Dasgupta | Arkaprabha Basu | Sujoy Nath | Swagatam Das
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, Large Language Models (LLMs) have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a common underlying issue, "representational misalignment" in their latent activation space. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the LLM’s parameters. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the Reinforcement Learning from Human Feedback (RLHF)-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.
Re2-DocRED: Revisiting Revisited-DocRED for Joint Entity and Relation Extraction
Chen Kim Heng | Shao Wen Tong | Julian Wong Wei Sheng
Chen Kim Heng | Shao Wen Tong | Julian Wong Wei Sheng
Document-level Joint Entity and Relation Extraction (JERE) benchmarks such as DocRED, Re-DocRED, and DocGNRE suffer from pervasive False Negatives (FN), undermining training and evaluation. In this paper, we introduce SiftingLogic – a training-free annotation pipeline that leverages LLMs with user-specifiable reasoning, enriched inverse/co-occurring relation schemas, and novel entity-level constraints to systematically address FN gaps. Applying SiftingLogic and our enriched schema of inverse and co-occurring relations, we add 29,580 verified triplets to Re-DocRED (train/dev, +27%) and over 9,700 verified triplets to DocGNRE test (+49.89%), yielding the enhanced Re2-DocRED dataset. Beyond English datasets, we also apply our SiftingLogic to REDFM Mandarin test set, resulting in a significant increase in triplets from 663 to 1,391 (+109.8%) demonstrating our pipeline’s generalisability across languages and datasets. Our experiments show that recall scores of models trained on existing public datasets drop notably on our revised splits, whereas our enriched training set mitigates this, underscoring persistent FN gaps and motivating our proposed SiftingLogic and Re2-DocRED. To facilitate further research and reproducibility of our work, the Re2-DocRED dataset is released at https://github.com/klassessg/re2-docred.
Where Do LLMs Compose Meaning? A Layerwise Analysis of Compositional Robustness
Nura Aljaafari | Danilo Carvalho | Andre Freitas
Nura Aljaafari | Danilo Carvalho | Andre Freitas
Understanding how large language models (LLMs) process compositional linguistic structures is integral to enhancing their reliability and interpretability. We present Constituent-Aware Pooling (CAP), a methodology grounded in compositionality, mechanistic interpretability, and information theory that intervenes in model activations by pooling token representations into linguistic constituents at various layers. Experiments across eight models (124M-8B parameters) on inverse definition modelling, hypernym and synonym prediction reveal that semantic composition is not localised to specific layers but distributed across network depth. Performance degrades substantially under constituent-based pooling, particularly in early and middle layers, with larger models showing greater sensitivity. We propose an information-theoretic interpretation: transformers’ training objectives incentivise deferred integration to maximise token-level throughput, resulting in fragmented rather than localised composition. These findings highlight fundamental architectural and training constraints requiring specialised approaches to encourage robust compositional processing.
BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models
Bryan Chen Zhengyu Tan | Weihua Zheng | Zhengyuan Liu | Nancy F. Chen | Hwaran Lee | Kenny Tsu Wei Choo | Roy Ka-Wei Lee
Bryan Chen Zhengyu Tan | Weihua Zheng | Zhengyuan Liu | Nancy F. Chen | Hwaran Lee | Kenny Tsu Wei Choo | Roy Ka-Wei Lee
As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce ‘BLEnD-Vis‘, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, ‘BLEnD-Vis‘ constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region → Entity, (ii) an inverted text-only variant (Entity → Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice questions (MCQ) instances, validated through human annotation. ‘BLEnD-Vis‘ reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing. While visual cues often aid performance, low cross-modal consistency highlights the challenges of robustly integrating textual and visual understanding, particularly in lower-resource regions. ‘BLEnD-Vis‘ thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs. Code is available at https://github.com/Social-AI-Studio/BLEnD-Vis.
Document-Level Zero-Shot Relation Extraction with Entity Side Information
Mohan Raj | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam
Mohan Raj | Lay-Ki Soon | Huey Fang Ong | Bhawani Selvaretnam
Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilising Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.
Steering Large Language Models for Machine Translation Personalization
Daniel Scalena | Gabriele Sarti | Arianna Bisazza | Elisabetta Fersini | Malvina Nissim
Daniel Scalena | Gabriele Sarti | Arianna Bisazza | Elisabetta Fersini | Malvina Nissim
Large language models have simplified the production of personalized translations reflecting predefined stylistic constraints. However, these systems still struggle when stylistic requirements are implicitly represented by a set of examples, such as texts produced by a specific human translator. In this work, we explore various strategies for personalizing automatically generated translations when few examples are available, with a focus on the challenging domain of literary translation. We begin by determining the feasibility of the task and how style information is encoded within model representations. Then, we evaluate various prompting strategies and inference-time interventions for steering model generations towards a personalized style, with a particular focus on contrastive steering with sparse autoencoder (SAE) latents to identify salient personalization properties. We demonstrate that contrastive SAE steering yields robust style conditioning and translation quality, resulting in higher inference-time computational efficiency than prompting approaches. We further examine the impact of steering on model activations, finding that layers encoding personalization properties are impacted similarly by prompting and SAE steering, suggesting a similar mechanism at play.
Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties
Eunkyung Choi | Young Jin Suh | Siun Lee | Hongseok Oh | Juheon Kang | Won Hur | Hun Park | Wonseok Hwang
Eunkyung Choi | Young Jin Suh | Siun Lee | Hongseok Oh | Juheon Kang | Won Hur | Hun Park | Wonseok Hwang
How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain, research dedicated to taxation remains scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or not released as open-source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT comprises 300 examples: (1) 100 binary-choice questions, (2) 100 multiple-choice questions, and (3) 100 essay-type questions, all derived from 100 Korean court precedents. PLAT is constructed to evaluate not only LLMs’ understanding of tax law but also their performance in legal cases that require complex reasoning beyond straight forward application of statutes. Our systematic experiments with multiple LLMs reveal that (1) their baseline capabilities are limited, especially in cases involving conflicting issues that require a comprehensive understanding (not only of the statutes but also of the taxpayer’s circumstances), and (2) LLMs struggle particularly with the “AC” stages of “IRAC” even for advanced reasoning models like o3, which actively employ inference-time scaling.
Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing
Shigeng Chen | Linhao Luo | Zhangchi Qiu | Yanan Cao | Carl Yang | Shirui Pan
Shigeng Chen | Linhao Luo | Zhangchi Qiu | Yanan Cao | Carl Yang | Shirui Pan
Knowledge editing (KE) has recently emerged as a promising technique to update specific facts in large language models (LLMs) without full retraining. While existing KE methods show promising results on general-domain benchmarks, their effectiveness in the medical domain remains largely unexplored. Medical knowledge editing poses unique challenges, requiring models not only to memorize new facts but also to internalize and generalize them for reliable and interpretable clinical decision-making. In this work, we propose MedEditBench, a rigorous evaluation framework for assessing medical knowledge editing. Our preliminary results reveal that current KE paradigm, which directly edits simple answers to the LLMs, often leads to superficial updates with poor generalization. To address this, we introduce Self-Generated Rationale Editing (SGR-Edit), which leverages model-generated rationales as editing targets, enabling deeper knowledge integration. Extensive experiments across diverse LLMs and KE methods demonstrate that SGR-Edit consistently improves editing efficacy and generalization. Furthermore, we examine the impact of sequential edits on in-domain medical knowledge, external-domain knowledge, as well as general model capabilities, offering practical insights for deploying KE in real-world medical applications.
Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
Wafaa Mohammed | Vlad Niculae | Chrysoula Zerva
Wafaa Mohammed | Vlad Niculae | Chrysoula Zerva
Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD), specifically minimum Bayes risk decoding, to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.
Cross-lingual and Word-Independent Methods for Quantifying Degree of Grammaticalization
Ryo Nagata | Daichi Mochihashi | Misato Ido | Yusuke Kubota | Naoki Otani | Yoshifumi Kawasaki | Hiroya Takamura
Ryo Nagata | Daichi Mochihashi | Misato Ido | Yusuke Kubota | Naoki Otani | Yoshifumi Kawasaki | Hiroya Takamura
Grammaticalization denotes a diachronic change of the grammatical category from content words to function words. One of the intensively explored directions in this area is to quantify the degree of grammaticalization. There have been a limited number of automated methods for this task and the existing, best-performing method is heavily language- and word-dependent. In this paper, we explore three methods for quantifying the degree of grammaticalization, which are applicable to a wider variety of words and languages. The difficulty here is that training data is not available in the present task. We overcome this difficulty by using Positive-Unlabeled learning (PU-learning) or Cross-Validation-like learning (hereafter, CV-learning). Experiments show that the CV-learning-based method achieves middle to high correlations to human judgments in English deverbal prepositions and Japanese nouns being grammaticalized. With this method, we further explore words possibly being grammaticalized and counterexamples of the unidirectionality hypothesis.
Knowing the Facts but Choosing the Shortcut: Understanding How Large Language Models Compare Entities
Hans Hergen Lehmann | Jae Hee Lee | Steven Schockaert | Stefan Wermter
Hans Hergen Lehmann | Jae Hee Lee | Steven Schockaert | Stefan Wermter
Large Language Models (LLMs) are increasingly used for knowledge-based reasoning tasks, yet understanding when they rely on genuine knowledge versus superficial heuristics remains challenging. We investigate this question through entity comparison tasks by asking models to compare entities along numerical attributes (e.g., “Which river is longer, the Danube or the Nile?”), which offer clear ground truth for systematic analysis. Despite having sufficient numerical knowledge to answer correctly, LLMs frequently make predictions which contradict this knowledge. We identify three heuristic biases that strongly influence model predictions: entity popularity, mention order, and semantic co-occurrence. For smaller models, a simple logistic regression using only these surface cues predicts model choices more accurately than the model’s own numerical predictions, suggesting heuristics largely override principled reasoning. Crucially, we find that larger models (32B parameters) selectively rely on numerical knowledge when it is more reliable, while smaller models (7-8B parameters) show no such discrimination, which explains why larger models outperform smaller ones even when the smaller models possess more accurate knowledge. Chain-of-thought prompting steers all models towards using the numerical features across all model sizes.
Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLMs
Everlyn Asiko Chimoto | Mostafa Elhoushi | Bruce Bassett
Everlyn Asiko Chimoto | Mostafa Elhoushi | Bruce Bassett
Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) across two quantizers (GPTQ, AWQ) on data from 10 different languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 perplexity gain. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static, one-size-fits-all calibration is suboptimal, and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.
LaCoMSA: Language-Consistency Multilingual Self-Alignment with Latent Representation Rewarding
Khanh-Tung Tran | Barry O'Sullivan | Hoang D. Nguyen
Khanh-Tung Tran | Barry O'Sullivan | Hoang D. Nguyen
Large Language Models (LLMs) have achieved impressive performance yet remain inconsistent across languages, often defaulting to high-resource outputs such as English. Existing multilingual alignment methods mitigate these issues through preference optimization but rely on external supervision, such as translation systems or English-biased signal. We propose Multilingual Self-Alignment (MSA), a targeted preference optimization framework that leverages an LLM’s own latent representations as intrinsic supervision signals, rewarding lower-resource language outputs based on their alignment with high-resource (English) counterparts in the "semantic hub". We further introduce Language-Consistency MSA (LaCoMSA), which augments MSA with a final-layer language-consistency factor to prevent off-target generation. Integrated with Direct Preference Optimization, LaCoMSA improves a Llama 3 8B-based model multilingual win rates by up to 6.8% absolute (55.0% relatively) on X-AlpacaEval and achieves consistent gains across benchmarks and models. Our findings demonstrate that LaCoMSA can serve as an effective and scalable mechanism, opening a new venue toward multilingual self-alignment.
Can you map it to English? The Role of Cross-Lingual Alignment in the Multilingual Performance of LLMs
Kartik Ravisankar | HyoJung Han | Sarah Wiegreffe | Marine Carpuat
Kartik Ravisankar | HyoJung Han | Sarah Wiegreffe | Marine Carpuat
Large language models (LLMs) can answer prompts in many languages, despite being trained predominantly on English; yet, the mechanisms driving this generalization remain poorly understood. This work asks: How does an LLM’s ability to align representations of non-English inputs to English impact its performance on natural language understanding (NLU) tasks? We study the role of representation alignment in instance-level task decisions, complementing prior analyses conducted both at the language level and task-independently. We introduce the Discriminative Alignment Index (\DALI) to quantify instance-level alignment across 24 languages other than English and three distinct NLU tasks. Results show that incorrect NLU predictions are strongly associated with lower representation alignment with English in the model’s middle layers. Through activation patching, we show that incorrect predictions in languages other than English can be fixed by patching their parallel English activations in the middle layers, thereby demonstrating the causal role of representation (mis)alignment in cross-lingual correctness.
Recursive numeral systems are highly regular and easy to process
Ponrawee Prasertsom | Andrea Silvi | Jennifer Culbertson | Devdatt Dubhashi | Moa Johansson | Kenny Smith
Ponrawee Prasertsom | Andrea Silvi | Jennifer Culbertson | Devdatt Dubhashi | Moa Johansson | Kenny Smith
Much recent work has shown how cross-linguistic variation is constrained by competing pressures from efficient communication. However, little attention has been paid to the role of the systematicity of forms (*regularity*), a key property of natural language. Here, we demonstrate the importance of regularity in explaining the shape of linguistic systems by looking at recursive numeral systems. Previous work has argued that these systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that *only* natural-language-like systems optimise this trade-off has proven elusive, and existing solutions rely on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Drawing on the Minimum Description Length (MDL) approach, we argue that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and theoretically possible ones, including “optimal” recursive numeral systems from previous work, and that the ad-hoc constraintsnaturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies attempting tomeasure efficiency in language.
Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro | Carlos Gómez-Rodríguez | David Vilares
Ana Ezquerro | Carlos Gómez-Rodríguez | David Vilares
Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.
SEMIROUTER: Sparse-Data Enhanced Routing for Adaptive Multi-LLM System
Zijie Wang | Xinyu Yan | Che Wang | Zeng Zihao | Lei Xiao | Wei Yang Bryan Lim
Zijie Wang | Xinyu Yan | Che Wang | Zeng Zihao | Lei Xiao | Wei Yang Bryan Lim
Large Language Models (LLMs) exhibit remarkable capabilities, but no single model optimally balances serving quality and deployment cost across diverse tasks. Multi-LLM systems address this challenge through intelligent routing mechanisms that dynamically allocate queries to the most appropriate model. However, existing routing methods suffer from two fundamental limitations: (i) dependence on extensive full-response datasets for training, and (ii) poor scalability when incorporating new models, typically necessitating retraining from scratch. In this paper, we propose SemiRouter, a novel LLM routing framework designed for data-sparse and evolving model environments. Our approach combines a data-efficient training methodology with an adaptive architecture that enables seamless integration of new models under limited supervision. As an extension, we also consider energy footprint as a potential deployment cost in our routing decision. Empirical evaluations demonstrate that our method improves data efficiency, adaptability, and routing accuracy compared to existing approaches, providing a scalable solution for dynamic multi-LLM deployment.
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Hyeseon An | Shinwoo Park | Suyeon Woo | Yo-Sub Han
Hyeseon An | Shinwoo Park | Suyeon Woo | Yo-Sub Han
The promise of LLM watermarking rests on a core assumption that a specific watermark proves authorship by a specific model. We demonstrate that this assumption is dangerously flawed. We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model. This enables the seamless misattribution of harmful content, such as disinformation, to reputable sources. The key to our attack is repurposing watermark radioactivity, the unintended inheritance of data patterns during fine-tuning, from a discoverable trait into an attack vector. By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model. This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones. Our code is available at https://github.com/hsannn/ditto.
Boundary-Aware LLM Augmentation for Low-Resource Event Argument Extraction
Zhaoyue Sun | Gabriele Pergola | Yulan He
Zhaoyue Sun | Gabriele Pergola | Yulan He
Event argument extraction (EAE) is a crucial task in information extraction. However, its performance heavily depends on expensive annotated data, making data scarcity a persistent challenge. Data augmentation serves as an effective approach to improving model performance in low-resource settings, yet research on applying LLMs for EAE augmentation remains preliminary. In this study, we pay attention to the boundary sensitivity of EAE and investigate four LLM-based augmentation strategies: argument replacement, adjunction rewriting, their combination, and annotation generation. We conduct comprehensive experiments across four benchmark datasets, employing GPT-4o-Mini and DeepSeek-R1-7B as data generators. Our results show that boundary-aware augmentation consistently leads to greater performance improvements over boundary-agnostic methods. In addition to performance gains, we provide a detailed analysis of augmentation quality from multiple perspectives, including uncertainty reduction, error types, data quality, and data scale. This work offers both empirical evidence and practical guidance for leveraging LLMs to enhance event argument extraction under low-resource conditions.
CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement
Gaifan Zhang | Yi Zhou | Danushka Bollegala
Gaifan Zhang | Yi Zhou | Danushka Bollegala
The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear as how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using an Large Language Model (LLM) encoder, where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised method is learnt to align the LLM-based text embeddings with the Conditional Semantic Textual Similarity (C-STS) task. We find that subtracting the condition embedding will consistently improve the C-STS performance of LLM-based text embeddings and improve the isotropy of the embedding space. Moreover, our supervised projection method significantly improves the performance of LLM-based embeddings despite requiring a small number of embedding dimensions.
Evaluation and LLM-Guided Learning of ICD Coding Rationales
Mingyang Li | Viktor Schlegel | Tingting Mu | Wuraola Oyewusi | Kai Kang | Goran Nenadic
Mingyang Li | Viktor Schlegel | Tingting Mu | Wuraola Oyewusi | Kai Kang | Goran Nenadic
ICD coding is the process of mapping unstructured text from Electronic Health Records (EHRs) to standardised codes defined by the International Classification of Diseases (ICD) system. In order to promote trust and transparency, existing explorations on the explainability of ICD coding models primarily rely on attention-based rationales and qualitative assessments conducted by physicians, yet lack a systematic evaluation across diverse types of rationales using consistent criteria and high-quality rationale-annotated datasets specifically designed for the ICD coding task. Moreover, dedicated methods explicitly trained to generate plausible rationales remain scarce. In this work, we present evaluations of the explainability of rationales in ICD coding, focusing on two fundamental dimensions: faithfulness and plausibility—in short how rationales influence model decisions and how convincing humans find them. For plausibility, we construct a novel, multi-granular rationale-annotated ICD coding dataset, based on the MIMIC-IV database and the updated ICD-10 coding system. We conduct a comprehensive evaluation across three types of ICD coding rationales: entity-level mentions automatically constructed via entity linking, LLM-generated rationales, and rationales based on attention scores of ICD coding models. Building upon the strong plausibility exhibited by LLM-generated rationales, we further leverage them as distant supervision signals to develop rationale learning methods. Additionally, by prompting the LLM with few-shot human-annotated examples from our dataset, we achieve notable improvements in the plausibility of rationale generation in both the teacher LLM and the student rationale learning models.
Evaluating the Effect of Retrieval Augmentation on Social Biases
Tianhui Zhang | Yi Zhou | Danushka Bollegala
Tianhui Zhang | Yi Zhou | Danushka Bollegala
Retrieval Augmented Generation (RAG) is a popular method for injecting up-to-date information into Large Language Model (LLM)-based Natural Language Generation (NLG) systems. While RAG can enhance factual accuracy, its effect on the social biases inherent in LLMs is not well understood. This paper systematically investigates how RAG modulates social biases across three languages (English, Japanese, and Chinese) and four categories (gender, race, age, and religion). By evaluating various generator LLMs on the BBQ benchmark, we analyse how document collections with controlled stereotypical content affect RAG outputs. We find that biases present in the retrieved documents are often significantly amplified in the generated texts, even when the base LLM itself has a low-level of intrinsic bias. These findings raise concerns about the social fairness of RAG systems, underscoring the urgent need for careful bias evaluation before real-world deployment.
Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions
Angana Borah | Rada Mihalcea | Veronica Perez-Rosas
Angana Borah | Rada Mihalcea | Veronica Perez-Rosas
Existing challenges in misinformation exposure and susceptibility vary across demographics, as some populations are more vulnerable to misinformation than others. Large language models (LLMs) introduce new dimensions to these challenges through their ability to generate persuasive content at scale and reinforcing existing biases. Our study introduces PANDORA, a framework that investigates the bidirectional persuasion dynamics between LLMs and humans when exposed to misinformative content. We use a multi-agent LLM framework to analyze the spread of misinformation under persuasion among demographic-oriented LLM agents. Our findings show that demographic factors influence LLM susceptibility, with up to 15 percentage point differences in misinformation correctness across groups. Multi-agent LLMs also exhibit echo chamber behavior, aligning with human-like group polarization patterns. Therefore, this work highlights demographic divides in misinformation dynamics and offers insights for future interventions.
Entropy-Gated Branching for Efficient Test-Time Reasoning
Xianzhi Li | Ethan Callanan | Abdellah Ghassel | Xiaodan Zhu
Xianzhi Li | Ethan Callanan | Abdellah Ghassel | Xiaodan Zhu
Test-time compute methods can significantly improve the reasoning capabilities and problem-solving accuracy of large language models. However, these approaches require substantially more computational resources, with most computation wasted on exploring low-diversity branches where the model already exhibits high confidence. We observe that a small subset of uncertain reasoning steps has a disproportionately large impact on final prediction accuracy, and branching at these points tends to yield higher-quality and more diverse candidate reasoning steps. Therefore, we introduce Entropy-Gated Branching: a novel inference technique that dynamically allocates computational resources by selectively expanding prediction sequences only at points of high uncertainty. Our method leverages entropy as a gating mechanism to identify when branching is most beneficial, coupled with an external feedback model to rank and prune candidate branches. Empirical results on mathematical and financial reasoning benchmarks show that this strategy improves accuracy by 22.6% over standard inference while operating 31%-75% faster across math benchmarks than test-time beam search with higher performance. Our results show that dynamic resource allocation during inference can substantially improve both efficiency and effectiveness, offering a more scalable pathway to enhanced LLM reasoning capabilities. We release our code and tools here[<https://github.com/JXL884/entropy_gated_branching>]
Decomposition-Enhanced Training for Post-Hoc Attributions in Language Models
Sriram Balasubramanian | Samyadeep Basu | Koustava Goswami | Ryan A. Rossi | Varun Manjunatha | Roshan Santhosh | Ruiyi Zhang | Soheil Feizi | Nedim Lipka
Sriram Balasubramanian | Samyadeep Basu | Koustava Goswami | Ryan A. Rossi | Varun Manjunatha | Roshan Santhosh | Ruiyi Zhang | Soheil Feizi | Nedim Lipka
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
INSURE-Dial: A Phase-Aware Conversational Dataset Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni | Alexander Lyzhov | Preetam Joshi | Shiva Chaitanya
Shubham Kulkarni | Alexander Lyzhov | Preetam Joshi | Shiva Chaitanya
Administrative phone tasks drain roughly $1 trillion annually from U.S. healthcare, with over 500 million insurance–benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.
NLP for Social Good: A Survey and Outlook of Challenges, Opportunities and Responsible Deployment
Antonia Karamolegkou | Angana Borah | Eunjung Cho | Sagnik Ray Choudhury | Martina Galletti | Pranav Gupta | Oana Ignat | Priyanka Kargupta | Neema Kotonya | Hemank Lamba | Sun-Joo Lee | Arushi Mangla | Ishani Mondal | Fatima Zahra Moudakir | Deniz Nazar | Poli Nemkova | Dina Pisarevskaya | Naquee Rizwan | Nazanin Sabri | Keenan Samway | Dominik Stammbach | Anna Steinberg Schulten | David Tomás | Steven R Wilson | Bowen Yi | Jessica H Zhu | Arkaitz Zubiaga | Anders Søgaard | Alexander Fraser | Zhijing Jin | Rada Mihalcea | Joel R. Tetreault | Daryna Dementieva
Antonia Karamolegkou | Angana Borah | Eunjung Cho | Sagnik Ray Choudhury | Martina Galletti | Pranav Gupta | Oana Ignat | Priyanka Kargupta | Neema Kotonya | Hemank Lamba | Sun-Joo Lee | Arushi Mangla | Ishani Mondal | Fatima Zahra Moudakir | Deniz Nazar | Poli Nemkova | Dina Pisarevskaya | Naquee Rizwan | Nazanin Sabri | Keenan Samway | Dominik Stammbach | Anna Steinberg Schulten | David Tomás | Steven R Wilson | Bowen Yi | Jessica H Zhu | Arkaitz Zubiaga | Anders Søgaard | Alexander Fraser | Zhijing Jin | Rada Mihalcea | Joel R. Tetreault | Daryna Dementieva
Natural language processing (NLP) now shapes many aspects of our world, yet its potential for positive social impact is underexplored. This paper surveys work in “NLP for Social Good" (NLP4SG) across nine domains relevant to global development and risk agendas, summarizing principal tasks and challenges. We analyze ACL Anthology trends, finding that inclusion and AI harms attract the most research, while domains such as poverty, peacebuilding, and environmental protection remain underexplored. Guided by our review, we outline opportunities for responsible and equitable NLP and conclude with a call for cross-disciplinary partnerships and human-centered approaches to ensure that future NLP technologies advance the public good.
From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLMs
Suyash Fulay | Jocelyn Zhu | Michiel A. Bakker
Suyash Fulay | Jocelyn Zhu | Michiel A. Bakker
Large language models (LLMs) have shown promising accuracy in predicting survey responses and policy preferences, which has increased interest in their potential to represent human interests in various domains. Most existing research has focused on “behavioral cloning”, effectively evaluating how well models reproduce individuals’ expressed preferences. Drawing on theories of political representation, we highlight an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, exercising judgment about what best serves an individual’s interests. This trade-off is closely related to issues of LLM sycophancy, where models can encourage behavior or validate beliefs that may be aligned with a user’s short-term preferences, but is detrimental to their long-term interests. Through a series of experiments simulating votes on various policy issues in the U.S. context, we apply a temporal utility framework that weighs short and long-term interests (simulating a trustee role) and compare voting outcomes to behavior-cloning models (simulating a delegate). We find that trustee-style predictions weighted toward long-term interests produce policy decisions that align more closely with expert consensus on well-understood issues, but also show greater bias toward models’ default stances on topics lacking clear agreement. These findings reveal a fundamental trade-off in designing AI systems to represent human interests. Delegate models better preserve user autonomy but may diverge from well-supported policy positions, while trustee models can promote welfare on well-understood issues yet risk paternalism and bias.
Investigating Language and Retrieval Bias in Multilingual Previously Fact-Checked Claim Detection
Ivan Vykopal | Antonia Karamolegkou | Jaroslav Kopčan | Qiwei Peng | Tomáš Javůrek | Michal Gregor | Marian Simko
Ivan Vykopal | Antonia Karamolegkou | Jaroslav Kopčan | Qiwei Peng | Tomáš Javůrek | Michal Gregor | Marian Simko
Multilingual Large Language Models (LLMs) offer powerful capabilities for cross-lingual fact-checking. However, these models often exhibit language bias, performing disproportionately better on high-resource languages such as English than on low-resource counterparts. We also present and inspect a novel concept - retrieval bias, when information retrieval systems tend to favor certain information over others, leaving the retrieval process skewed. In this paper, we study language and retrieval bias in the context of Previously Fact-Checked Claim Detection (PFCD). We evaluate six open-source multilingual LLMs across 20 languages using a fully multilingual prompting strategy, leveraging the AMC-16K dataset. By translating task prompts into each language, we uncover disparities in monolingual and cross-lingual performance and identify key trends based on model family, size, and prompting strategy. Our findings highlight persistent bias in LLM behavior and offer recommendations for improving equity in multilingual fact-checking. To investigate retrieval bias, we employed multilingual embedding models and look into the frequency of retrieved claims. Our analysis reveals that certain claims are retrieved disproportionately across different posts, leading to inflated retrieval performance for popular claims while under-representing less common ones.
FFE-Hallu: Hallucinations in Fixed Figurative Expressions: A Benchmark of Idioms and Proverbs in the Persian Language
Faezeh Hosseini | Mohammadali Yousefzadeh | Yadollah Yaghoobzadeh
Faezeh Hosseini | Mohammadali Yousefzadeh | Yadollah Yaghoobzadeh
Figurative language, especially fixed figurative expressions (FFEs) such as idioms and proverbs, poses unique challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded and often non-compositional, making them vulnerable to figurative hallucination, the generation or acceptance of plausible-sounding but culturally invalid expressions. We introduce FFE-Hallu, the first comprehensive benchmark for evaluating LLMs’ ability to generate, detect, and translate FFEs in Persian, a linguistically rich but underrepresented language. FFE-Hallu includes 600 carefully curated examples spanning three tasks: FFE generation from meaning, detection of fabricated FFEs (across four controlled categories), and FFE-to-FFE translation from English to Persian. Our evaluation of six state-of-the-art multilingual LLMs reveals persistent weaknesses in both cultural grounding and figurative competence. While models like GPT-4.1 display relative strength in rejecting fabricated FFEs and retrieving authentic ones, most systems struggle to reliably distinguish real FFEs from high-quality fabrications and often hallucinate in translation. This work shows that LLMs still have important gaps in understanding and using figurative language, and that specialized benchmarks like FFE-Hallu are needed.
MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval
Delvin Ce Zhang | Suhan Cui | Zhelin Chu | Xianren Zhang | Dongwon Lee
Delvin Ce Zhang | Suhan Cui | Zhelin Chu | Xianren Zhang | Dongwon Lee
Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
Shubham Patle | Sara Ghaboura | Hania Tariq | Mohammad Usman Khan | Omkar Thawakar | Rao Muhammad Anwer | Salman Khan
Shubham Patle | Sara Ghaboura | Hania Tariq | Mohammad Usman Khan | Omkar Thawakar | Rao Muhammad Anwer | Salman Khan
Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across 6 classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems.Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well in clean text, they struggle with calligraphic variation, artistic distortions, and precise visual–text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset and code are publicly available.
ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi | Krisztian Balog | Sally Goldman | Avi Caciularu | Guy Tennenholtz | Jihwan Jeong | Amir Globerson | Craig Boutilier
Ofer Meshi | Krisztian Balog | Sally Goldman | Avi Caciularu | Guy Tennenholtz | Jihwan Jeong | Amir Globerson | Craig Boutilier
The promise of *LLM-based user simulators* to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce *ConvApparel*, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol, using both "good" and "bad" recommenders, enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction.We propose a comprehensive validation framework that combines *statistical alignment*, a *human-likeness score*, and *counterfactual validation* to test for generalization.Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
Yu Wu | Ke Shu | Jonas Fischer | Lidia Pivovarova | David Rosson | Eetu Mäkelä | Mikko Tolonen
Yu Wu | Ke Shu | Jonas Fischer | Lidia Pivovarova | David Rosson | Eetu Mäkelä | Mikko Tolonen
This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.
Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions
Pedro Henrique Luz de Araujo | Michael A. Hedderich | Ali Modarressi | Hinrich Schuetze | Benjamin Roth
Pedro Henrique Luz de Araujo | Michael A. Hedderich | Ali Modarressi | Hinrich Schuetze | Benjamin Roth
Persona-assigned large language models (LLMs) are used in domains such as education, healthcare, and sociodemographic simulation. Yet, they are typically evaluated only in short, single-round settings that do not reflect real-world usage. We introduce an evaluation protocol that combines long persona dialogues (over 100 rounds) and evaluation datasets to create dialogue-conditioned benchmarks that can robustly measure long-context effects. We then investigate the effects of dialogue length on persona fidelity, instruction-following, and safety of seven state-of-the-art open- and closed-weight LLMs. We find that persona fidelity degrades over the course of dialogues, especially in goal-oriented conversations, where models must sustain both persona fidelity and instruction following. We identify a trade-off between fidelity and instruction following, with non-persona baselines initially outperforming persona-assigned models; as dialogues progress and fidelity fades, persona responses become increasingly similar to baseline responses. Our findings highlight the fragility of persona applications in extended interactions and our work provides a protocol to systematically measure such failures.
CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
Paul Grundmann | Jan Frick | Dennis Fast | Thomas Steffek | Felix Gers | Wolfgang Nejdl | Alexander Löser
Paul Grundmann | Jan Frick | Dennis Fast | Thomas Steffek | Felix Gers | Wolfgang Nejdl | Alexander Löser
With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks.However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in the MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
DIVINE : Coordinating Multimodal Disentangled Representations for Oro-Facial Neurological Disorder Assessment
Mohd Mujtaba Akhtar | Girish | Muskaan Singh
Mohd Mujtaba Akhtar | Girish | Muskaan Singh
In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clinical interpretability and generalization. To validate this hypothesis, we propose DIVINE a fully disentangled multimodal framework that operates on representations extracted from state-of-the-art (SOTA) audio and video foundation models, incorporating hierarchical variational bottlenecks, sparse gated fusion, and learnable symptom tokens. DIVINE operates in a multitask learning setup to jointly predict diagnostic categories (Healthy Control,ALS, Stroke) and severity levels (Mild, Moderate, Severe). The model is trained using synchronized audio and video inputs and evaluated on the Toronto NeuroFace dataset under full (audio-video) as well as single-modality (audio-only and video-only) test conditions. Our proposed approach, DIVINE achieves SOTA result, with the DeepSeek-VL2 and TRILLssoncombination reaching 98.26% accuracy and 97.51% F1-score. Under modality-constrained scenarios, the framework performs well, show-ing strong generalization when tested with video-only or audio-only inputs. It consistently yields superior performance compared to uni-modal models and baseline fusion techniques. To the best of our knowledge, DIVINE is the first fully disentangled multimodal frameworkto jointly perform categorical diagnosis and severity estimation for oro-facial neurological disorders using synchronized speech and facialvideo.
Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible
Imry Ziv | Nur Lan | Emmanuel Chemla
Imry Ziv | Nur Lan | Emmanuel Chemla
Are large language models (LLMs) sensitive to the distinction between humanly possible and impossible languages? This question was recently used in a broader debate on whether LLMs and humans share the same innate learning biases. Previous work has answered it in the positive by comparing LLM learning curves on existing language datasets and on "impossible" datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous findings. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole set of natural languages and the whole set of impossible languages, based on cross-linguistic variance in metrics derived from the learning curves. Taken together, these perspectives show that GPT-2 provides no systematic separation between the possible and the impossible.
Bridging Attribution and Open-Set Detection using Graph-Augmented Instance Learning in Synthetic Speech
Mohd Mujtaba Akhtar | Girish | Farhan Sheth | Muskaan Singh
Mohd Mujtaba Akhtar | Girish | Farhan Sheth | Muskaan Singh
We propose a unified framework for not only attributing synthetic speech to its source but also for detecting speech generated by synthesizers that were not encountered during training. This requires methods that move beyond simple detection to support both detailed forensic analysis and open-set generalization. To address this, we introduce SIGNAL, a hybrid framework that combines speech foundation models (SFMs) with graph-based modeling and open-set-aware inference. Our framework integrates Graph Neural Networks (GNNs) and a k-Nearest Neighbor (KNN) classifier, allowing it to capture meaningful relationships between utterances and recognize speech that doesn’t belong to any known generator. It constructs a query-conditioned graph over generator class prototypes, enabling the GNN to reason over relationships among candidate generators, while the KNN branch supports open-set detection via confidence-based thresholding. We evaluate SIGNAL using the DiffSSD dataset, which offers a diverse mix of real speech and synthetic audio from both open-source and commercial diffusion-based TTS systems. To further assess generalization, we also test on the SingFake benchmark. Our results show that SIGNAL consistently improves performance across both tasks, with Mamba-based embeddings delivering especially strong results. To the best of our knowledge, this is the first study to unify graph-based learning and open-set detection for tracing synthetic speech back to its origin.
Detecting Non-Membership in LLM Training Data via Rank Correlations
Pranav Shetty | Mirazul Haque | Zhiqiang Ma | Xiaomo Liu
Pranav Shetty | Mirazul Haque | Zhiqiang Ma | Xiaomo Liu
As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem—verifying that a dataset was not used- has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level non-membership using only grey-box access to model logits. Our key insight is that two models that have not seen a dataset exhibit higher rank correlation in their normalized token log probabilities than when one model has been trained on that data. Using this observation, we construct a correlation-based test that detects non-membership. Empirically, PRISM reliably rules out membership in training data across all datasets tested while avoiding false positives, thus offering a framework for verifying that specific datasets were excluded from LLM training.
Taming Object Hallucinations with Verified Atomic Confidence Estimation
Jiarui Liu | Weihao Xuan | Zhijing Jin | Mona T. Diab
Jiarui Liu | Weihao Xuan | Zhijing Jin | Mona T. Diab
Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (LLaVA-1.5-7B and CogVLM2) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.
DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning
Nithin Sivakumaran | Justin Chen | David Wan | Yue Zhang | Jaehong Yoon | Elias Stengel-Eskin | Mohit Bansal
Nithin Sivakumaran | Justin Chen | David Wan | Yue Zhang | Jaehong Yoon | Elias Stengel-Eskin | Mohit Bansal
Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the distribution of expert tool calls to ensure that every tool is being reliably used to help resolve disagreement. Code: https://github.com/nsivaku/dart.
ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Saptarshi Sengupta | Zhengyu Zhou | Jun Araki | Xingbo Wang | Bingqing Wang | Suhang Wang | Zhe Feng
Saptarshi Sengupta | Zhengyu Zhou | Jun Araki | Xingbo Wang | Bingqing Wang | Suhang Wang | Zhe Feng
Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens wouldexceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMswith the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework that conditions retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., descriptions of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, showcasing its flexibility. With our proposed framework, we aim to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
An Empirical Study of Speculative Decoding for Small Language Models
Luca Mainardi | Selcuk Sandikci | Joaquin Vanschoren
Luca Mainardi | Selcuk Sandikci | Joaquin Vanschoren
Speculative decoding has emerged as a promising approach to accelerate Large Language Model inference. However, existing research has predominantly focused on 7B-70B parameters models, leaving a critical knowledge gap for small language models (1-2B parameters) that are increasingly important for edge computing and agentic AI systems. This paper presents the first comprehensive empirical study of speculative decoding techniques for small language models. We evaluate five distinct method categories across three representative model families and reveal that drafting overhead, rather than draft quality, becomes the primary bottleneck fundamentally limiting acceleration of small models. We demonstrate that traditional independent drafting fails completely due to the suboptimal architecture of available drafters, while self-drafting methods achieve meaningful acceleration only when employing sufficiently efficient draft modules. In contrast, retrieval-based methods with negligible computational overhead yield consistent gains. Based on these insights, we establish practical guidelines for effective small model acceleration.
Lost in Formatting: How Output Formats Skew LLM Performance on Information Extraction
Rishi Ravikumar | Nuhu Ibrahim | Riza Batista-Navarro
Rishi Ravikumar | Nuhu Ibrahim | Riza Batista-Navarro
We investigate how the choice of output format influences the performance of fine-tuned large language models on information extraction tasks. Based on over 280 experiments spanning multiple benchmarks, models and formats, we find that output formatting is a critical yet largely overlooked hyperparameter. Remarkably, in some cases, changing only the output format shifts F1 scores by over 40% despite using the same model. We further observe that no single format consistently dominates across settings, and the optimal choice depends on factors like model family and dataset characteristics. Overall, these results demonstrate that informationally equivalent output formats can produce substantial performance variation, highlighting the need to treat output formatting as a key factor in building accurate and reliable information extraction systems.
Policy-gradient reinforcement learning (PGRL) forms the backbone of current methods used to enhance alignment and reasoning in Large Language Models (LLMs). However, these methods are incompatible with diffusion based language models (dLLMs). Most attempts to apply PGRL to dLLMs, are either not scalable or use unprincipled approximations. This work, introduces PADRE a framework that uses a novel pseudo-likelihood based objective for alignment of dLLMs. Our objective has the same optima as PGRL based optimization, but does not need to evaluate exact likelihood from dLLMs. Experiments on various coding and mathematical reasoning benchmarks show that our method matches or surpasses the performance of recent dLLM training baselines such as diffu-GRPO/d1. Our approach provides a stable and practical alternative for RL-based fine-tuning of reasoning-focused dLLMs.
RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets
Jan Cegin | Branislav Pecher | Ivan Srba | Jakub Simko
Jan Cegin | Branislav Pecher | Ivan Srba | Jakub Simko
LLMs are powerful generators of synthetic data, which are used for training smaller, specific models. This is especially valuable for low-resource languages, where human-labelled data is scarce but LLMs can still produce high-quality text. However, LLMs differ in how useful their outputs are for training. Selecting the best LLM as a generator is challenging because extrinsic evaluation requires costly human annotations (which are often unavailable for low-resource languages), while intrinsic metrics correlate poorly with downstream performance. We introduce Round-robin Synthetic data Evaluation (RoSE), a proxy metric for selecting the best LLM generator without human test sets. RoSE trains a small model on the outputs of a candidate generator (LLM) and then evaluates it on generated synthetic examples from all other candidate LLMs. The final RoSE score is the mean performance of this small model. Across six LLMs, eleven languages, and three tasks (sentiment, topic, intent), RoSE identifies the optimal generator more often than any other intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within 0.76 percentage points of the optimal generator baseline. This result is measured in terms of downstream performance, obtained by training a small model on the chosen generator’s outputs (optimal vs. proxy-metric–selected) and evaluating it on human-labelled test data. Additionally, RoSE is the only metric to achieve a positive correlation with performance on human test data.
RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation
Tianyi Niu | Jaemin Cho | Elias Stengel-Eskin | Mohit Bansal
Tianyi Niu | Jaemin Cho | Elias Stengel-Eskin | Mohit Bansal
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0°, 90°, 180°, and 270°. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench, a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information—including captions, depth maps, and more—or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0°) images, while certain models are able to identify upside-down (180°) images. None can reliably distinguish between 90° and 270° rotated images. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90° and 270° rotations, despite substantially improving the identification of 180° images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.
Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs
Alireza Dehghanpour Farashah | Aditi Khandelwal | Marylou Fauchard | Zhuan Shi | Negar Rostamzadeh | Golnoosh Farnadi
Alireza Dehghanpour Farashah | Aditi Khandelwal | Marylou Fauchard | Zhuan Shi | Negar Rostamzadeh | Golnoosh Farnadi
As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has mainly focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we address the problem of multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes into ten languages through translation—English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian—spanning five language families and varying resource levels. Our experiments show that unlearning in high-resource languages tends to be more stable, with asymmetric transfer observed between typologically related languages. Moreover, analysis of linguistic distances reveals that syntactic similarity is the most predictive factor of cross-lingual unlearning effects.
Beyond Math: Stories as a Testbed for Memorization-Constrained Reasoning in LLMs
Yuxuan Jiang | Francis Ferraro
Yuxuan Jiang | Francis Ferraro
Memorization has been shown to greatly inflate Large Language Models’ (LLMs) performance on domains such as math and logic, where success should primarily rely on applying generalizable reasoning rules. In many real-world applications, however, memorization is not meant to be eliminated but selectively constrained—for example, in story understanding, where background knowledge must be integrated with narrative context. Drawing on the cognitive science distinction between “verbatim” (exact recall) and “gist” (semantic abstraction) memorization, we propose a two-tier framework for analyzing how LLMs reason under different degrees of memory access. The Inductive (prompt-guided) Setting softly steers models to reason through selective, context-relevant recall, while the Restrictive Setting imposes stronger constraints by limiting verbatim memory access. Evaluating GPT-4o, LLaMA3.3-70B, and DeepSeek V3 on six character-centric story understanding benchmarks, we find up to a 45.2% accuracy drop under the Restrictive Setting, revealing strong dependence on surface recall. By contrast, the Inductive Setting maintains performance, indicating that prompting can align LLMs toward memorization-constrained reasoning.
Neural Breadcrumbs: Membership Inference Attacks on LLMs Through Hidden State and Attention Pattern Analysis
Disha Makhija | Manoj Ghuhan Arivazhagan | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Disha Makhija | Manoj Ghuhan Arivazhagan | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Membership inference attacks (MIAs) reveal whether specific data was used to train machine learning models, serving as important tools for privacy auditing and compliance assessment. Recent studies have reported that MIAs perform only marginally better than random guessing against large language models, suggesting that modern pre-training approaches with massive datasets may be free from privacy leakage risks. Our work offers a complementary perspective to these findings by exploring how examining LLMs’ internal representations, rather than just their outputs, may provide additional insights into potential membership inference signals. Our framework, memTrace, follows what we call neural breadcrumbs extracting informative signals from transformer hidden states and attention patterns as they process candidate sequences. By analyzing layer-wise representation dynamics, attention distribution characteristics, and cross-layer transition patterns, we detect potential memorization fingerprints that traditional loss-based approaches may not capture. This approach yields strong membership detection across several model families achieving average AUC scores of 0.85 on popular MIA benchmarks. Our findings suggest that internal model behaviors can reveal aspects of training data exposure even when output-based signals appear protected, highlighting the need for further research into membership privacy and the development of more robust privacy-preserving training techniques for large language models.
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data
Paul Quinlan | Qingguo Li | Xiaodan Zhu
Paul Quinlan | Qingguo Li | Xiaodan Zhu
Large language models are being rapidly applied across many fields such as healthcare, finance, transportation, and energy, among many others. These applications often involve analyzing time-series data alongside contextual information in the form of natural language to support informed decisions. However, current time-series models are limited in their ability to perform reasoning that involves both time-series and their textual content. In this work, we address this gap by introducing Chat-TS, a large language model (LLM) based framework designed to support reasoning over time series and textual data. Unlike traditional models, Chat-TS integrates time-series tokens into LLMs’ vocabulary, enhancing its reasoning ability over both modalities without compromising core natural language capabilities. To support learning and evaluation, we contribute new datasets: the TS Instruct Training Dataset (pairing diverse time-series data with relevant text instructions and responses for instruction tuning), the TS Instruct Question and Answer (QA) Gold Dataset (multiple-choice questions to evaluate multimodal reasoning), and a TS Instruct Quantitative Probing Set (a small subset of TS Instruct QA reasoning tasks alongside math and decision-making questions for LLM evaluation). We design a training strategy to preserve the inherent reasoning capabilities of LLMs while augmenting them for time-series reasoning. Experiments show that Chat-TS achieves state-of-the-art performance in multimodal reasoning tasks by maintaining strong natural language proficiency while improving time-series reasoning.
Beyond Names: How Grammatical Gender Markers Bias LLM-based Educational Recommendations
Luca Benedetto | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
Luca Benedetto | Antonia Donvito | Alberto Lucchetti | Andrea Cappelli | Paula Buttery
This paper investigates gender biases exhibited by LLM-based virtual assistants when providing educational recommendations, focusing on minimal gender indicators. Experimenting on Italian, a language with grammatical gender, we demonstrate that simply changing noun and adjective endings (e.g., from masculine "-o" to feminine "-a") significantly shifts recommendations. More specifically, we find that LLMs i) recommend STEM disciplines less for prompts with feminine grammatical gender and ii) narrow down the set of disciplines recommended to prompts with masculine grammatical gender; these effects persist across multiple commercial LLMs (from OpenAI, Anthropic, and Google). We show that grammatical gender cues alone trigger substantial distributional shifts in educational recommendations, and up to 76% of the bias exhibited when using prompts with proper names is already present with grammatical gender markers alone.Our findings highlight the need for robust bias evaluation and mitigation strategies before deploying LLM-based virtual assistants in student-facing contexts and the risks of using general purpose LLMs for educational applications, especially in languages with grammatical gender.
ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Mathieu Sibue | Andrés Muñoz Garza | Samuel Mensah | Pranav Shetty | Zhiqiang Ma | Xiaomo Liu | Manuela Veloso
Mathieu Sibue | Andrés Muñoz Garza | Samuel Mensah | Pranav Shetty | Zhiqiang Ma | Xiaomo Liu | Manuela Veloso
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
Zhaotian Weng | Haoxuan Li | Xin Eric Wang | Kuan-Hao Huang | Jieyu Zhao
Zhaotian Weng | Haoxuan Li | Xin Eric Wang | Kuan-Hao Huang | Jieyu Zhao
Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs’ causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from a severe lack of causal expressions in widely used training datasets, where causal relationships are rarely explicitly conveyed. We additionally explore fine-tuning strategies with hard negative cases, showing that targeted fine-tuning can improve model’s causal reasoning while maintaining generalization and downstream performance. Our study highlights a key gap in current VLMs and lays the groundwork for future work on causal understanding. We will release the code upon acceptance.
KidsArtBench: Multi-Dimensional Children’s Art Evaluation with Attribute-Aware MLLMs
Mingrui Ye | Chanjin Zheng | Zengyi Yu | Chenyu Xiang | Zhixue Zhao | Zheng Yuan | Helen Yannakoudakis
Mingrui Ye | Chanjin Zheng | Zengyi Yu | Chenyu Xiang | Zhixue Zhao | Zheng Yuan | Helen Yannakoudakis
Multimodal Large Language Models (MLLMs) show progress across many visual–language tasks; however, their capacity to evaluate artistic expression remains limited: aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children’s artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children’s artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach – where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric – with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. Our results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.
Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions
Navita Goyal | Hal Daumé Iii
Navita Goyal | Hal Daumé Iii
Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for precisely controlling large language models. While steering efficacy has been widely studied, evaluations of whether interventions alter *only* the intended property remain limited, especially with respect to unintended changes in behaviors related to the target property. We call this notion specificity. We propose a framework that distinguishes three dimensions of specificity: general (preserving fluency and unrelated abilities), control (preserving related control properties), and robustness (preserving control properties under distribution shifts). We study two safety-critical use cases: steering models to reduce overrefusal and faithfulness hallucinations, and show that while steering achieves high efficacy and largely maintains general and control specificity, it consistently fails to preserve robustness specificity. In the case of overrefusal steering, for example, all steering methods reduce overrefusal without harming general abilities and refusal on harmful queries; however, they substantially increase vulnerability to jailbreaks. Our work provides the first systematic evaluation of specificity in model steering, showing that standard efficacy and specificity checks are insufficient, because without robustness evaluation, steering methods may appear reliable even when they compromise model safety.
Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of Biomedical Adaptation
Xin Zhao | Naoki Yoshinaga | Yuma Tsuta | Akiko Aizawa
Xin Zhao | Naoki Yoshinaga | Yuma Tsuta | Akiko Aizawa
Multilingual domain adaptation (ML-DA) enables large language models (LLMs) to acquire domain knowledge across languages. Despite many methods, how domain knowledge is acquired within a language and transferred across languages remains, leading to suboptimal performance, particularly in low-resource settings.This work examines the learning dynamics of LLMs during ML-DA. Because prior ML-DA studies often train and evaluate on datasets with mismatched knowledge coverage, we propose AdaXEval, an adaptive evaluation method that constructs multiple-choice QA datasets from the same bilingual domain corpus used for training, thereby enabling direct analysis of multilingual knowledge acquisition.Through continual training of LLMs with diverse data recipes, we track how LLMs acquire domain facts and pinpoint the loss shielding mechanism behind the knowledge memorization and generalization in domain adaptation. Our experiments on multilingual LLMs reveal that cross-lingual transfer remains challenging.The code is released.
Contextual morphologically-guided tokenization for Latin encoder models
Marisa Hudspeth | Patrick J. Burns | Brendan O'Connor
Marisa Hudspeth | Patrick J. Burns | Brendan O'Connor
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources – a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models’ improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Yang Zhang | Amr Mohamed | Hadi Abdine | Guokan Shang | Michalis Vazirgiannis
Yang Zhang | Amr Mohamed | Hadi Abdine | Guokan Shang | Michalis Vazirgiannis
Curriculum learning—organizing training data from easy to hard—has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases, reducing training steps by 18-45% to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to 3.5%. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering—orthogonal to existing data selection methods—provides a practical mechanism for more efficient LLM pretraining.
ObjChangeVR: Object State Change Reasoning from Continuous Egocentric Views in VR Environments
Shiyi Ding | Shaoen Wu | Ying Chen
Shiyi Ding | Shaoen Wu | Ying Chen
Recent advances in multimodal large language models (MLLMs) offer a promising approach for natural language-based scene change queries in virtual reality (VR). Prior work on applying MLLMs for object state understanding has focused on egocentric videos that capture the camera wearer’s interactions with objects. However, object state changes may occur in the background without direct user interaction, lacking explicit motion cues and making them difficult to detect. Moreover, no benchmark exists for evaluating this challenging scenario. To address these challenges, we introduce ObjChangeVR-Dataset, specifically for benchmarking the question-answering task of object state change. We also propose ObjChangeVR, a framework that combines viewpoint-aware and temporal-based retrieval to identify relevant frames, along with cross-view reasoning that reconciles inconsistent evidence from multiple viewpoints. Extensive experiments demonstrate that ObjChangeVR significantly outperforms baseline approaches across multiple MLLMs.
Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
Yiyang Feng | Zeming Chen | Haotian Wu | Jiawei Zhou | Antoine Bosselut
Yiyang Feng | Zeming Chen | Haotian Wu | Jiawei Zhou | Antoine Bosselut
A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model’s parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce Tʀᴀᴄᴋ (*Testing Reasoning Amid Conflicting Knowledge*), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model’s initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), Tʀᴀᴄᴋ introduces multiple, realistic conflicts to mirror real-world complexity. Our results on Tʀᴀᴄᴋ reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. Tʀᴀᴄᴋ provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.
Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance
Jingyi Chen | Zhimeng Guo | Jiyun Chun | Pichao Wang | Andrew Perrault | Micha Elsner
Jingyi Chen | Zhimeng Guo | Jiyun Chun | Pichao Wang | Andrew Perrault | Micha Elsner
Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical contents. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict “neutral” when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely “transcribe” rather than “listen,” relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.
CSPB: Conversational Speech Processing Benchmark for Self-supervised Speech Models
Zili Huang | Matthew Maciejewski | Leibny Paola Garcia Perera | Shinji Watanabe | Sanjeev Khudanpur
Zili Huang | Matthew Maciejewski | Leibny Paola Garcia Perera | Shinji Watanabe | Sanjeev Khudanpur
Recent advances in self-supervised learning (SSL) have led to powerful speech representation models, yet their robustness in real-world conversational settings remains largely untested. Most existing benchmarks focus on clean, single-speaker, single-channel audio, failing to reflect the complexities of natural human interaction—where background noise, reverberation, and overlapping speech are the norm. To bridge these critical gaps, we present the Conversational Speech Processing Benchmark (CSPB), a new benchmark designed to assess the robustness of SSL speech models in realistic conversational scenarios. CSPB is constructed from four multi-party datasets—AMI, AliMeeting, MMCSG, and DiPCo—and supports both single-channel and multi-channel evaluation. By releasing CSPB as an open-source toolkit, we aim to establish a unified framework for evaluating and advancing robust, spatially-aware self-supervised speech models.
Multi-Token Completion for Text Anonymization
Pulkit Madaan | Krithika Ramesh | Lisa Bauer | Charith Peris | Anjalie Field
Pulkit Madaan | Krithika Ramesh | Lisa Bauer | Charith Peris | Anjalie Field
Text anonymization is a critical task for enabling research and development in high-stakes domains containing private data, like medicine, law, and social services. While much research has focused on redacting sensitive content from text, substantially less work has focused on what to replace redacted content with, which can enhance privacy and becomes increasingly important with greater levels of redaction. In this work, we formulate predicting replacements for sensitive spans as a research task with principled use-inspired evaluation criteria. We further propose a multi-token completion method for accomplishing this task that is designed to preserve consistency with low compute requirements, thus facilitating practitioners to anonymize data locally before sharing it externally. Human and automated annotations demonstrate that our approach produces more realistic text and better preserves utility than alternative infilling methods and differentially private mechanisms across multiple domains without retraining. Overall, our work explores the under-studied task of what to replace redacted content with and contributes grounded evaluations capturing utility, facilitating future work.
MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning
Kosei Uemura | David Guzmán | Quang Phuoc Nguyen | Jesujoba Oluwadara Alabi | En-Shiun Annie Lee | David Ifeoluwa Adelani
Kosei Uemura | David Guzmán | Quang Phuoc Nguyen | Jesujoba Oluwadara Alabi | En-Shiun Annie Lee | David Ifeoluwa Adelani
Large language models (LLMs) excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing methods align LLMs with multilingual encoders, such as LangBridge and MindMerger, raising the accuracy for mid and high-resource languages, yet large performance gap remains for LRLs. We present MERLIN, a model-stacking framework that iteratively refines in 2-stages based on a curriculum strategy (from general to specific where general is bilingual bitext and specific is task-specific data) and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini by 15.2 pp. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.
Now You Hear Me: Audio Narrative Attacks Against Large Audio–Language Models
Ye Yu | Haibo Jin | Yaoning Yu | Jun Zhuang | Haohan Wang
Ye Yu | Haibo Jin | Yaoning Yu | Jun Zhuang | Haohan Wang
Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.
Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders
Aaron J. Li | Suraj Srinivas | Usha Bhalla | Himabindu Lakkaraju
Aaron J. Li | Suraj Srinivas | Usha Bhalla | Himabindu Lakkaraju
Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM’s activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.
Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?
Karin De Langis | Püren Öncel | Ryan Peters | Andrew Elfenbein | Laura Kristen Allen | Andreas Schramm | Dongyeop Kang
Karin De Langis | Püren Öncel | Ryan Peters | Andrew Elfenbein | Laura Kristen Allen | Andreas Schramm | Dongyeop Kang
Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories.A probing study finds that LLMs’ internal representations can reliably identify incoherent events in narratives. However, this separation disappears by the narrative’s end, and weakens when the differences between coherent and incoherent stories are more subtle. When asked to rate overall coherence of narratives after reading, LLMs generate responses that fail to satisfactorily separate the coherent and incoherent narratives.Reasoning models tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior.Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building coherence through a meaning-based world model of the narrative setting. Together, our results indicate that LLMs lack robustness in their ability to recognize incoherence in narratives.
Strong Memory, Weak Control: An Empirical Study of Executive Functioning in LLMs
Karin De Langis | Jong Inn Park | Khanh Chi Le | Andreas Schramm | Andrew Elfenbein | Michael C. Mensink | Dongyeop Kang
Karin De Langis | Jong Inn Park | Khanh Chi Le | Andreas Schramm | Andrew Elfenbein | Michael C. Mensink | Dongyeop Kang
Working memory, or the ability to hold and manipulate information in the mind, is a critical component of human intelligence and executive functioning. It is correlated with performance on various cognitive tasks, including measures of fluid intelligence, which encompasses reasoning and problem solving. We use a comprehensive set of classic working memory tasks to estimate the working memory capacity of large language models (LLMs). We find that in most cases, LLMs exceed normative human scores. However, we do not find that the increased capacity of working memory is associated with higher performance on other executive functioning tasks or problem solving benchmarks. These results suggest that LLMs may have deficits in attentional control and cognitive flexibility, which result in difficulties with inhibiting automatic responses and adapting to shifting information. Our findings suggest that reasoning models, although they often do not currently fully compensate for these deficits, may have the potential to do so in the future.
Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
Analysing the role of lexical and temporal information in turn-taking through predictability
Sean Leishman | Sarenne Wallbridge | Peter Bell
Sean Leishman | Sarenne Wallbridge | Peter Bell
Turn-taking is a fundamental component of human communication and is signalled through complex cues distributed across lexical, temporal, and prosodic information. Full-duplex models of spoken dialogue integrate these information sources to produce impressive turn-taking behaviour. Yet, existing evaluations of their turn-taking capabilities do not address which information sources drive predictions.We present a systematic analysis of the role of lexical-temporal features on the predictability of turn structure by examining PairwiseTurnGPT, a full-duplex model of spoken dialogue transcripts. Through PCA, mixed-effects modelling, and temporal surprisal analysis, we reveal context-dependent patterns: linguistic fluency paradoxically creates overconfidence at intermediate completion points, while turn-shift overlap dominates boundary detection. Our findings uncover where lexical-temporal information suffices and where additional cues become necessary, establishing a deeper understanding of how turn-taking cues are distributed and how to evaluate dialogue systems.
Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
Jiyun Chun | Eric Fosler-Lussier | Michael White | Andrew Perrault
Jiyun Chun | Eric Fosler-Lussier | Michael White | Andrew Perrault
Evaluating the quality of children’s utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child’s response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child’s contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child’s speech contributes to and advances the conversation within its context.
Translation via Annotation: A Computational Study of Translating Classical Chinese into Japanese
Zilong Li | Jie Cao
Zilong Li | Jie Cao
Ancient people translated classical Chinese into Japanese using a system of annotations placed around characters. We abstract this process as sequence tagging tasks and fit them into modern language technologies. The research on this annotation and translation system faces a low resource problem. We alleviate this problem by introducing an LLM-based annotation pipeline and constructing a new dataset from digitized open-source translation data. We show that in the low-resource setting, introducing auxiliary Chinese NLP tasks enhances the training of sequence tagging tasks. We also evaluate the performance of Large Language Models (LLMs) on this task. While they achieve high scores on direct machine translation, our method could serve as a supplement to LLMs to improve the quality of character’s annotation.
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Yuatyong Chaichana | Pittawat Taveekitworachai | Warit Sirichotedumrong | Potsawee Manakul | Kunat Pipatanakul
Yuatyong Chaichana | Pittawat Taveekitworachai | Warit Sirichotedumrong | Potsawee Manakul | Kunat Pipatanakul
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Sai Akhil Kogilathota | Sripadha Vallabha E G | Luzhe Sun | Jiawei Zhou
Sai Akhil Kogilathota | Sripadha Vallabha E G | Luzhe Sun | Jiawei Zhou
Hallucinations remain a persistent challenge for vision–language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model’s internal representations in a single forward pass. Across a diverse set of vision–language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid layer features dominate in a few architectures (e.g., ∼0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes has the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Aaryamonvikram Singh | Debopriyo Banerjee | Dhruv Sahnan | Monojit Choudhury | Shivam Chauhan | Rocktim Jyoti Das | Xudong Han | Haonan Li | Alok Anil Jadhav | Utkarsh Agarwal | Mukund Choudhary | Fajri Koto | Junaid Hamid Bhat | Awantika Shukla | Samujjwal Ghosh | Samta Kamboj | Onkar Pandit | Lalit Pradhan | Rahul Pal | Sunil Kumar Sahu | Parvez Mullah | Ali El Filali | Zainul Abedien Ahmed Quraishi | Neha Sengupta | Gokulakrishnan Ramakrishnan | Rituraj Joshi | Gurpreet Gosal | Avraham Sheinin | Natalia Vassilieva | Preslav Nakov
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
Wugnectives: Novel Entity Inferences of Language Models from Discourse Connectives
Daniel Brubaker | William Sheffield | Junyi Jessy Li | Kanishka Misra
Daniel Brubaker | William Sheffield | Junyi Jessy Li | Kanishka Misra
The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present Wugnectives, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs.We release Wugnectives at https://github.com/kanishkamisra/wugnectives
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval over haystacks
Amey Hengle | Prasoon Bajpai | Soham Dan | Tanmoy Chakraborty
Amey Hengle | Prasoon Bajpai | Soham Dan | Tanmoy Chakraborty
Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model’s ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model’s capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including bAbI-style tasks that test multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, we design MLRBench to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval-Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in the improved evaluation and training of multilingual LLMs.
Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty
Sravanthi Machcha | Sushrita Yerra | Sahil Gupta | Aishwarya Sahoo | Sharmin Sultana | Hong Yu | Zonghai Yao
Sravanthi Machcha | Sushrita Yerra | Sahil Gupta | Aishwarya Sahoo | Sharmin Sultana | Hong Yu | Zonghai Yao
Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA), integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain when uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.
MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
Zonghai Yao | Zihao Zhang | Chaolong Tang | Xingyu Bian | Youxia Zhao | Zhichao Yang | Junda Wang | Huixue Zhou | Won Seok Jang | Feiyun Ouyang | Hong Yu
Zonghai Yao | Zihao Zhang | Chaolong Tang | Xingyu Bian | Youxia Zhao | Zhichao Yang | Junda Wang | Huixue Zhou | Won Seok Jang | Feiyun Ouyang | Hong Yu
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education’s Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks—LLM-as-medical-student and LLM-as-CS-examiner—designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs’ clinical capabilities for both open- and closed-source LLMs.
Continual-learning for Modelling Low-Resource Languages from Large Language Models
Santosh Srinath K | Mudit Somani | Varun Reddy Padala | Prajna Upadhyay | Abhijit Das
Santosh Srinath K | Mudit Somani | Varun Reddy Padala | Prajna Upadhyay | Abhijit Das
Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.
Language-Grounded Multi-Domain Image Translation via Semantic Difference Guidance
Jongwon Ryu | Joonhyung Park | Jaeho Han | Yeong-Seok Kim | Hye-Rin Kim | Sunjae Yoon | Junyeong Kim
Jongwon Ryu | Joonhyung Park | Jaeho Han | Yeong-Seok Kim | Hye-Rin Kim | Sunjae Yoon | Junyeong Kim
Multi-domain image-to-image translation requires grounding semantic differences expressed in natural language prompts into corresponding visual transformations, while preserving unrelated structural and semantic content. Existing methods struggle to maintain structural integrity and provide fine-grained, attribute-specific control, especially when multiple domains are involved. We propose LACE (Language-grounded Attribute-Controllable Translation), built on two components: (1) a GLIP-Adapter that fuses global semantics with local structural features to preserve consistency, and (2) a Multi-Domain Control Guidance mechanism that explicitly grounds the semantic delta between source and target prompts into per-attribute translation vectors, aligning linguistic semantics with domain-level visual changes. Together, these modules enable compositional multi-domain control with independent strength modulation for each attribute. Experiments on CelebA(Dialog) and BDD100K demonstrate that LACE achieves high visual fidelity, structural preservation, and interpretable domain-specific control, surpassing prior baselines. This positions LACE as a cross-modal content generation framework bridging language semantics and controllable visual translation. Code will be publicly available.
LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction
Junior Cedric Tonga | Chen Cecilia Liu | Iryna Gurevych | Fajri Koto
Junior Cedric Tonga | Chen Cecilia Liu | Iryna Gurevych | Fajri Koto
Large language models (LLMs) encode rich cultural knowledge learned from diverse web-scale data, offering an unprecedented opportunity to model cultural commonsense at scale. Yet this knowledge remains mostly implicit and unstructured, limiting its interpretability and use. We present an iterative, prompt-based framework for constructing a Cultural Commonsense Knowledge Graph (CCKG) that treats LLMs as cultural archives, systematically eliciting culture-specific entities, relations, and practices and composing them into multi-step inferential chains across languages. We evaluate CCKG on five countries with human judgments of cultural relevance, correctness, and path coherence. We find that the cultural knowledge graphs are better realized in English, even when the target culture is non-English (e.g., Chinese, Indonesian, Arabic), indicating uneven cultural encoding in current LLMs. Augmenting smaller LLMs with CCKG improves performance on cultural reasoning and story generation, with the largest gains from English chains. Our results show both the promise and limits of LLMs as cultural technologies and that chain-structured cultural knowledge is a practical substrate for culturally grounded NLP.
Nahw: A Comprehensive Benchmark of Arabic Grammar Understanding, Error Detection, Correction, and Explanation
Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed
Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed
Grammar comprehension is a critical capability for large language models (LLMs) to achieve fluency in a target language. In low-resource settings, such as the case with Arabic, limited availability of high-quality data can lead to significant gaps in grammatical understanding, making systematic evaluation essential. We introduce Nahw, a comprehensive benchmark for Arabic grammar that covers both theoretical knowledge and practical applications, including grammatical error detection, correction, and explanation. We evaluate a range of LLMs on these tasks and find that many models still exhibit substantial deficiencies in Arabic grammar comprehension, with GPT-4o achieving a score of 67% on average over all tasks, while the best performing Arabic model in our experiment (ALLaM-7B) achieving 42%. Our experiments also demonstrate that while fine-tuning with synthetic data can improve performance, it does not match the effectiveness of training on natural, high-quality data.
Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu | Miri Liu | Pin Chun Lu | Yufei Tian | Shao-Hua Sun | Nanyun Peng
Li-Chun Lu | Miri Liu | Pin Chun Lu | Yufei Tian | Shao-Hua Sun | Nanyun Peng
We examine, analyze, and compare four representative creativity measures—perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)—across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric’s ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity. We release the datasets and evaluation code: https://github.com/lichun-19/creative_eval.
TReX: Tokenizer Regression for Optimal Data Mixture
Inho Won | Hangyeol Yoo | Minkyung Cho | Jungyeul Park | Hoyun Song | KyungTae Lim
Inho Won | Hangyeol Yoo | Minkyung Cho | Jungyeul Park | Hoyun Song | KyungTae Lim
CONGRAD: Conflicting Gradient Filtering for Multilingual Preference Alignment
Jiangnan Li | Thuy-Trang Vu | Christian Herold | Amirhossein Tebbifakhr | Shahram Khadivi | Gholamreza Haffari
Jiangnan Li | Thuy-Trang Vu | Christian Herold | Amirhossein Tebbifakhr | Shahram Khadivi | Gholamreza Haffari
Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose ConGrad, an effective and scalable filtering method that mitigates this interference by identifying and selecting preference samples that exhibit high cross-lingual affinity. Based on principles of multi-objective optimization, our approach computes an aggregated, cross-lingually beneficial gradient direction and uses this to filter for samples whose individual gradients align with this consensus direction. To ensure scalability for LLMs, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate ConGrad into a self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that ConGrad consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari | Nicolas Fay | Sanjeevan Selvaganapathy | Amitava Datta | Usman Naseem | Mehwish Nasim
Pranav Bhandari | Nicolas Fay | Sanjeevan Selvaganapathy | Amitava Datta | Usman Naseem | Mehwish Nasim
Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective.However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models’ behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, andidentifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.
Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
Sergey Pankratov | Dan Alistarh
Sergey Pankratov | Dan Alistarh
KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLMs for Enhancing Automated Fact-checking
Vítor Lourenço | Aline Paes | Tillman Weyde | Audrey Depeige | Mohnish Dubey
Vítor Lourenço | Aline Paes | Tillman Weyde | Audrey Depeige | Mohnish Dubey
Claim verification is a core module in automated fact-checking systems, tasked with determining claim veracity using retrieved evidence. This work presents KG-CRAFT, a novel knowledge graph-based contrastive reasoning method that enhances automatic claim verification by LLMs. Our approach first constructs a knowledge graph from claims and associated reports, then formulates contextually relevant contrastive questions based on the knowledge graph structure. These questions guide the distillation of evidence-based reports, which are synthesised into a concise summary for veracity assessment. Extensive evaluations on two real-world datasets (LIAR-RAW and RAWFC) demonstrate that our method achieves a new state-of-the-art in predictive performance. Comprehensive analyses validate in detail the effectiveness of our knowledge graph-based contrastive reasoning approach in improving LLMs’ fact-checking capabilities.
SciRAG: Adaptive, Citation-Aware, and Outline-Guided Retrieval and Synthesis for Scientific Literature
Hang Ding | Yilun Zhao | Tiansheng Hu | Manasi Patwardhan | Arman Cohan
Hang Ding | Yilun Zhao | Tiansheng Hu | Manasi Patwardhan | Arman Cohan
The accelerating growth of scientific publications has intensified the need for scalable, trustworthy systems to synthesize knowledge across diverse literature. While recent retrieval-augmented generation (RAG) methods have improved access to scientific information, they often overlook citation graph structure, adapt poorly to complex queries, and yield fragmented, hard-to-verify syntheses. We introduce SciRAG, an open-source framework for scientific literature exploration that addresses these gaps through three key innovations: (1) adaptive retrieval that flexibly alternates between sequential and parallel evidence gathering; (2) citation-aware symbolic reasoning that leverages citation graphs to organize and filter supporting documents; and (3) outline-guided synthesis that plans, critiques, and refines answers to ensure coherence and transparent attribution. Extensive experiments across multiple benchmarks such as QASA and ScholarQA demonstrate that SciRAG outperforms prior systems in factual accuracy and synthesis quality, establishing a new foundation for reliable, large-scale scientific knowledge aggregation.
Unintended Memorization of Sensitive Information in Fine-Tuned Language Models
Marton Szep | Jorge Marin Ruiz | Georgios Kaissis | Paulina Seidl | Rüdiger von Eisenhart-Rothe | Florian Hinterwimmer | Daniel Rueckert
Marton Szep | Jorge Marin Ruiz | Georgios Kaissis | Paulina Seidl | Rüdiger von Eisenhart-Rothe | Florian Hinterwimmer | Daniel Rueckert
Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy–utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.
The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models
Giuseppe Russo | Debora Nozza | Paul Röttger | Dirk Hovy
Giuseppe Russo | Debora Nozza | Paul Röttger | Dirk Hovy
People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans’ decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap–a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance.
CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning
Van-Quang Nguyen | Takayuki Okatani
Van-Quang Nguyen | Takayuki Okatani
Existing datasets for multimodal table understanding, such as MMTab, primarily provide short factual answers without explicit multi-step reasoning supervision. Models trained on these datasets often generate brief responses that offers insufficient accuracy and limited interpretability into how these models arrive at the final answer. We introduce CoReTab, a code-driven reasoning framework that produces scalable, interpretable, and automatically verifiable annotations by coupling multi-step reasoning with executable Python code. Using the CoReTab framework, we curate a dataset of 115K verified samples averaging 529 tokens per response and fine-tune open-source MLLMs through a three-stage pipeline. We evaluate the resulting model trained on CoReTab across 17 MMTab benchmarks spanning table question answering, fact verification, and table structure understanding. Our model achieves significant gains of +6.2%, +5.7%, and +25.6%, respectively, over MMTab-trained baselines, while producing transparent and verifiable reasoning traces. These results establish CoReTab as a robust and generalizable supervision framework for improving multi-step reasoning in multimodal table understanding.
Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis
Yuxi Xia | Kinga Stańczak | Benjamin Roth
Yuxi Xia | Kinga Stańczak | Benjamin Roth
AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.
Elections go bananas: A First Large-scale Multilingual Study of Pluralia Tantum using LLMs
Elena Spaziani | Kamyar Zeinalipour | Pierluigi Cassotti | Nina Tahmasebi
Elena Spaziani | Kamyar Zeinalipour | Pierluigi Cassotti | Nina Tahmasebi
In this paper, we study the expansion of pluralia tantum, i.e., defective nouns which lack a singular form, like scissors. We base our work on an annotation framework specifically developed for the study of lexicalization of pluralia tantum, namely Lexicalization profiles. On a corresponding hand-annotated testset, we show that the OpenAI and DeepSeek models provide useful annotators for semantic, syntactic and sense categories, with accuracy ranging from 51% to 89%, averaged across all feature groups and languages. Next, we turn to a large-scale investigation of pluralia tantum. Using dictionaries, we extract candidate words for Italian, Russian and English and keep those for which the changing ratio of singular and plural form is evident in a corresponding reference corpus. We use an LLM to annotate each instance from the reference corpora according to the annotation framework. We show that the large amount of automatically annotated sentences for each feature can be used to perform in-depth linguistic analysis. Focusing on the correlation between an annotated feature and the grammatical form (singular vs. plural), patterns of morpho-semantic change are noted.
CacheNotes: Task-Aware Key-Value Cache Compression for Reasoning-Intensive Knowledge Tasks
Giulio Corallo | Orion Weller | Fabio Petroni | Paolo Papotti
Giulio Corallo | Orion Weller | Fabio Petroni | Paolo Papotti
Integrating external knowledge into Large Language Models (LLMs) iscrucial for many real-world applications, yet current methods like Retrieval-Augmented Generation (RAG) face limitations with broad, multi-source queries, while long-context models are computationally prohibitive.We introduce CacheNotes: Task-Aware Key-Value Cache Compression. Given a task description and a corpus, CacheNotes first generates a sequence of Compression-Planning-Tokens (CPTs), an offline task-focused distillation pass that identifies and organizes key information from the corpus. These CPTs are then used to guide a one-time compression of the corpus into a compact, reusable KV cache, which is then used alone at inference time to efficiently answer diverse, reasoning-intensive queries, eliminating repeated retrieval or context expansion.Experiments on LongBench show that, on Question-Answering tasks at a 20× compression, CacheNotes outperforms RAG by over 8 F1 points and reduces latency by over 4×. On RULER, it surpasses previous query-agnostic compression methods by 55 points, narrowing the gap to query-aware compression approaches. Additional results on real-world enterprise and synthetic datasets demonstrate its strong performance on multi-hop and broad-coverage queries.
Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance
Yao Fu | Ran Qiu | Xinhe Wang | Jacob Sansom | Sathvika Ayyappa Prabhu | Huijie Tang | Jaekyeom Kim | Sungryull Sohn | Honglak Lee
Yao Fu | Ran Qiu | Xinhe Wang | Jacob Sansom | Sathvika Ayyappa Prabhu | Huijie Tang | Jaekyeom Kim | Sungryull Sohn | Honglak Lee
Large language models (LLMs) have shown strong capabilities as task-solving agents across interactive domains. However, in complex environments, these agents may need to rely on auxiliary guidance to reduce the search space or make up for limited domain-specific knowledge. Such guidance includes human-provided manuals and demonstrations, retrieved examples from memory or external tools, high-level heuristics, and agent-acquired knowledge from prior interactions. However, this guidance may be imperfect. For example, due to changes in the environment, ambiguous or simplified language, or retrieval errors from external sources, guidance can be incomplete, outdated, or contextually mismatched, potentially causing errors or failures during task execution. To address this, we introduce MIRAGE, a benchmark for MeasurIng Robustness of LLM Agents under Imperfect GuidancE. MIRAGE includes procedurally generated environments in navigation, cooking, and gaming, where both the environment and the auxiliary guidance vary in fidelity and relevance. We further extend MIRAGE to realistic web tasks via WebArena, using noisy or underspecified instructions extracted from demonstrations. Our findings reveal critical failure modes in current LLM agents and motivate future work on improving their robustness under imperfect guidance.
How Do LLMs Generate Contrastive Sentiments? A Mechanistic Perspective
Van Bach Nguyen | Jörg Schlötterer | Christin Seifert
Van Bach Nguyen | Jörg Schlötterer | Christin Seifert
This paper presents a mechanistic investigation of how large language models (LLMs) generate contrastive sentiments. We define this task as transforming the sentiment of a given text (e.g., from positive to negative) while making minimal changes to its content. We identify two core mechanisms: (1) a preservation mechanism that maintains the sentiment of the input text, primarily mediated by specific attention heads, and (2) a sentiment transformation mechanism, which integrates a representation of the target sentiment label with the original valenced words using a circuit containing both MLP and attention layers. Building on these findings, we propose and validate a novel mechanistic intervention. By modifying key attention heads, we steer the LLM toward more effective contrastive generation, increasing the sentiment flip rate without sacrificing the minimality of changes. Our work not only deepens the understanding of the mechanisms underlying contrastive sentiment generation in LLMs, but also introduces a promising new direction to steer LLM behavior via targeted, mechanistic interventions.
Continual Neural Topic Model
Charu Karakkaparambil James | Waleed Mustafa | Marcio Monteiro | Marius Kloft | Sophie Fellenz
Charu Karakkaparambil James | Waleed Mustafa | Marcio Monteiro | Marius Kloft | Sophie Fellenz
In continual learning, our aim is to learn a new task without forgetting what was learned previously. In topic models, this translates to learning new topic models without forgetting previously learned topics. Previous work either considered Dynamic Topic Models (DTMs), which learn the evolution of topics based on the entire training corpus at once, or Online Topic Models, which are updated continuously based on new data but do not have long-term memory. To fill this gap, we propose the Continual Neural Topic Model (CoNTM), which continuously learns topic models at subsequent time steps without forgetting what was previously learned. This is achieved using a global prior distribution that is continuously updated. In our experiments, CoNTM consistently outperformed the dynamic topic model in terms of topic quality and predictive perplexity while being able to capture topic changes online. The analysis reveals that CoNTM can learn more diverse topics and better capture temporal changes than existing methods.
MAQuA: Multi-outcome Adaptive Question-Asking for Mental Health using Item Response Theory
Vasudha Varadarajan | Hui Xu | Rebecca Astrid Böhme | Mariam Marlen Mirström | Sverker Sikström | H. Andrew Schwartz
Vasudha Varadarajan | Hui Xu | Rebecca Astrid Böhme | Mariam Marlen Mirström | Sverker Sikström | H. Andrew Schwartz
Recent advances in LLMs offer new opportunities for scalable, interactive mental health assessment, but excessive querying burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, a multi-outcome modeling and adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50–87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.
Principled Self-Correction in Discrete Diffusion: A UCB-Guided Framework for Text Generation
Masaki Asada | Makoto Miwa
Masaki Asada | Makoto Miwa
Inspired by their success in image synthesis, diffusion models offer a flexible, iterative alternative to rigid left-to-right text generation. However, a fundamental training-inference discrepancy hinders their performance: models are trained on corrupted ground-truth tokens, but at inference time they must denoise inputs corrupted from their own predictions. To bridge this gap, we propose a unified framework. First, Deeper Self-Prediction (DSP) is a multi-step training objective that teaches robust self-correction by forcing the model to denoise its own intermediate outputs. Second, UCB-guided Decoding is a principled inference algorithm that frames token re-masking as a multi-armed bandit problem, using the Upper Confidence Bound (UCB) to balance exploration and exploitation. Experiments on text generation tasks demonstrate consistent improvements over existing diffusion baselines. The framework achieves higher faithfulness and coherence according to both automatic metrics and LLM-as-a-Judge evaluations.
ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan | Jakhongir Saydaliev | Grace Kim | Antoine Bosselut
Negar Foroutan | Jakhongir Saydaliev | Grace Kim | Antoine Bosselut
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages – often limited to single-domain data, such as the Bible – continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.
Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection
Yanran Chen | Lynn Greschner | Roman Klinger | Michael Klenk | Steffen Eger
Yanran Chen | Lynn Greschner | Roman Klinger | Michael Klenk | Steffen Eger
Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.
Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
Mizanur Rahman | Mohammed Saidul Islam | Md Tahmid Rahman Laskar | Shafiq Joty | Enamul Hoque
Mizanur Rahman | Mohammed Saidul Islam | Md Tahmid Rahman Laskar | Shafiq Joty | Enamul Hoque
Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity—qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.
Offline Preference Optimization via Maximum Marginal Likelihood Estimation
Saeed Najafi | Alona Fyshe
Saeed Najafi | Alona Fyshe
Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML-based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model’s general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO’s implicit preference optimization within the gradient updates.
The Relevance of Value Systems for Offensive Language Detection
Michael Wiegand | Elisabeth Eder | Josef Ruppenhofer
Michael Wiegand | Elisabeth Eder | Josef Ruppenhofer
We examine in how far a person’s value system has an impact on their perception of offensiveness. For instance, a scholar is likely to be offended by being accused of reporting unverified claims whereas many non-scholars would not feel that way. Thus, we move away from the assumption that offensiveness can be defined through a universal perspective. Ultimately, such research aims to support personalized approaches to content moderation. Our main contribution is the introduction of a dataset consisting of neutrally-phrased sentences on controversial topics, evaluated by individuals from 4 different value systems. This allows us to identify offensiveness patterns across value systems and conduct classification experiments.
Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact
Hyunji Lee | Seunghyun Yoon | Yunjae Won | Hanseok Oh | Geewook Kim | Trung Bui | Franck Dernoncourt | Elias Stengel-Eskin | Mohit Bansal | Minjoon Seo
Hyunji Lee | Seunghyun Yoon | Yunjae Won | Hanseok Oh | Geewook Kim | Trung Bui | Franck Dernoncourt | Elias Stengel-Eskin | Mohit Bansal | Minjoon Seo
Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
Aashiq Muhamed | Leonardo F. R. Ribeiro | Markus Dreyer | Virginia Smith | Mona T. Diab
Aashiq Muhamed | Leonardo F. R. Ribeiro | Markus Dreyer | Virginia Smith | Mona T. Diab
The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting dangerous over-confidence or over-caution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks—RefusalBench-NQ (single-document) and RefusalBench-GaRAGe (multi-document), and our complete generation framework to enable continued, dynamic evaluation of this critical capability.
Query Decomposition for RAG: Balancing Exploration-Exploitation
Roxana Petcu | Kenton Murray | Daniel Khashabi | Evangelos Kanoulas | Maarten de Rijke | Dawn Lawrie | Kevin Duh
Roxana Petcu | Kenton Murray | Daniel Khashabi | Evangelos Kanoulas | Maarten de Rijke | Dawn Lawrie | Kevin Duh
Retrieval-augmented generation (RAG) systems address complex user requests by decomposing them into subqueries, retrieving potentially relevant documents for each, and then aggregating them to generate an answer. Efficiently selecting informative documents requires balancing a key trade-off: (i) retrieving broadly enough to capture all the relevant material, and (ii) limiting retrieval to avoid excessive noise and computational cost. We formulate query decomposition and document retrieval in an exploitation-exploration setting, where retrieving one document at a time builds a belief about the utility of a given sub-query and informs the decision to continue exploiting or exploring an alternative. We experiment with a variety of bandit learning methods and demonstrate their effectiveness in dynamically selecting the most informative sub-queries. Our main finding is that estimating document relevance using rank information and human judgments yields a 35% gain in document-level precision, 15% increase in α-nDCG, and better performance on the downstream task of long-form generation. Code is available on GitHub.
Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs
Chi Zhang | Wenxuan Ding | Jiale Liu | Mingrui Wu | Qingyun Wu | Ray Mooney
Chi Zhang | Wenxuan Ding | Jiale Liu | Mingrui Wu | Qingyun Wu | Ray Mooney
Vision-Language Models (VLMs) have shown strong multimodal reasoning capability on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has extensively studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the ConText-VQA (i.e. Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough testing framework is designed and executed to benchmark the susceptibility of various models to these conflicting textual inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
Sycophancy Hides Linearly in the Attention Heads
Rifo Ahmad Genadi | Munachiso Samuel Nwadike | Nurdaulet Mukhituly | Tatsuya Hiraoka | Hilal AlQuabeh | Kentaro Inui
Rifo Ahmad Genadi | Munachiso Samuel Nwadike | Nurdaulet Mukhituly | Tatsuya Hiraoka | Hilal AlQuabeh | Kentaro Inui
We find that correct-to-incorrect sycophancy signals are most linearly accessible within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations. Code will be released upon publication.
AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human–machine classification under in-distribution settings. To bridge this gap, we introduce AICD Bench, the most comprehensive benchmark for AI-generated code detection. It spans 2M examples, 77 models across 11 families, and 9 programming languages, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: (i) Robust Binary Classification under distribution shifts in language and domain, (ii) Model Family Attribution, grouping generators by architectural lineage, and (iii) Fine-Grained Human–Machine Classification across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a unified, challenging evaluation suite to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench.
Safeguarding Language Models via Self-Destruct Trapdoor
Shahar Katz | Bar Alon | Ariel Shaulov | Lior Wolf | Mahmood Sharif
Shahar Katz | Bar Alon | Ariel Shaulov | Lior Wolf | Mahmood Sharif
The potential misuse and misalignment of language models (LMs) is a central safety concern. This work presents Self-Destruct, a novelmechanism to restrict specific behaviors in LMs by leveraging overlooked properties of the underlying hardware. We observe that the LMframeworks use limited-precision formats (e.g., BF16), which are vulnerable to overflow errors during matrix multiplications. Exploitingthis property, Self-Destruct replaces selected weights in pre-trained LM layers with values that act as traps, triggering a system error onlywhen the model engages in targeted behaviors, such as harmful text generation, while leaving normal functionality unaffected. Unlike posthoc filters, this safeguard is embedded directly within the model, introduces neither inference overhead nor auxiliary models, and requires only a set of examples for calibration. Extensive experiments with five LM families demonstrate that Self-Destruct provides competitive protection against jailbreak attacks while preserving accuracy on standard benchmarks. In addition, we also show that Self-Destruct is versatile, helping mitigate biased text generation and enable model fingerprinting, highlighting the potential of hardware-aware safeguards as an efficient, low-overhead complement to existing LM defenses.
Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity
Prakhar Ganesh | Reza Shokri | Golnoosh Farnadi
Prakhar Ganesh | Reza Shokri | Golnoosh Farnadi
Large language models (LLMs) are known to "hallucinate" by generating false or misleading outputs. Hallucinations pose various harms, from erosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on correctness and often overlooks consistency, necessary to distinguish and address these harms. To bridge this gap, we introduce prompt multiplicity, a framework for quantifying consistency in LLM evaluations. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely misunderstood. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques detect consistency, not correctness, and (b) mitigation techniques like RAG, while beneficial, can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.
Hype or not? Formalizing Automatic Promotional Language Detection in Biomedical Research
Bojan Batalo | Erica K. Shimomoto | Dipesh Satav | Neil Millar
Bojan Batalo | Erica K. Shimomoto | Dipesh Satav | Neil Millar
In science, promotional language (’hype’) is increasing and can undermine objective evaluation of evidence, impede research development, and erode trust in science. In this paper, we introduce the task of automatic detection of hype, which we define as hyperbolic or subjective language that authors use to glamorize, promote, embellish, or exaggerate aspects of their research. We propose formalized guidelines for identifying hype language and apply them to annotate a portion of the National Institutes of Health (NIH) grant application corpus. We then evaluate traditional text classifiers and language models on this task, comparing their performance with a human baseline. Our experiments show that formalizing annotation guidelines can help humans reliably annotate candidate hype adjectives and that using our annotated dataset to train machine learning models yields promising results. Our findings highlight the linguistic complexity of the task and the potential need for domain knowledge. While some linguistic works address hype detection, to the best of our knowledge, we are the first to approach it as a natural language processing task. Our annotation guidelines and dataset are available at https://github.com/hype-busters/eacl2026-hype-dataset.
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
Selim Furkan Tekin | Fatih Ilhan | Sihao Hu | Tiansheng Huang | Yichang Xu | Zachary Yahn | Ling Liu
Selim Furkan Tekin | Fatih Ilhan | Sihao Hu | Tiansheng Huang | Yichang Xu | Zachary Yahn | Ling Liu
The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model’s representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and more honest in three aspects: it outperforms each individually aligned model by 11.37%, and provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by 13.77% and model-merging approaches by 6.18 %. Code is available at https://github.com/git-disl/h3fusion.
Revisiting Generalization Across Difficulty Levels: It’s Not So Easy
Yeganeh Kordi | Nihal V. Nayak | Max Zuo | Ilana Nguyen | Stephen Bach
Yeganeh Kordi | Nihal V. Nayak | Max Zuo | Ilana Nguyen | Stephen Bach
We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs’ generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
BLUR: A Bi-Level Optimization Approach for LLM Unlearning
Hadi Reisizadeh | Jinghan Jia | Zhiqi Bu | Bhanukiran Vinzamuri | Anil Ramakrishna | Kai-Wei Chang | Volkan Cevher | Sijia Liu | Mingyi Hong
Hadi Reisizadeh | Jinghan Jia | Zhiqi Bu | Bhanukiran Vinzamuri | Anil Ramakrishna | Kai-Wei Chang | Volkan Cevher | Sijia Liu | Mingyi Hong
Enabling large language models (LLMs) to unlearn knowledge and capabilities acquired during training has proven vital for ensuring compliance with data regulations and promoting ethical practices in generative AI. Although there are growing interests in developing various unlearning algorithms, it remains unclear how to best formulate the unlearning problem. The most popular formulation uses a weighted sum of forget and retain loss, but it often leads to performance degradation due to the inherent trade-off between forget and retain losses. In this work, we argue that it is important to model the hierarchical structure of the unlearning problem, where the forget problem (which unlearns certain knowledge and/or capabilities) takes priority over the retain problem (which preserves model utility). This hierarchical structure naturally leads to a bi-level optimization formulation where the lower-level objective focuses on minimizing the forget loss, while the upper-level objective aims to maintain the model’s utility. Based on this new formulation, we propose a novel algorithm, termed Bi-Level UnleaRning (), which not only possesses strong theoretical guarantees but more importantly, delivers superior performance. In particular, our extensive experiments demonstrate that consistently outperforms all the state-of-the-art algorithms across various unlearning tasks, models, and metrics.
DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding
Moulik Choraria | Xinbo Wu | Akhil Bhimaraju | Nitesh Sekhar | Yue Wu | Xu Zhang | Prateek Singhal | Lav R. Varshney
Moulik Choraria | Xinbo Wu | Akhil Bhimaraju | Nitesh Sekhar | Yue Wu | Xu Zhang | Prateek Singhal | Lav R. Varshney
Hyperscaling of data and parameter count in LLMs is yielding diminishing improvement when weighed against training costs, underlining a growing need for more efficient finetuning and inference without sacrificing performance. This is especially so for multimodal language models (MLMs), where the overhead of processing multimodal tokens can limit their practical viability. Parallely, recent work has uncovered implicit cross-modal alignment in the deeper layers of large MLMs, deepening our understanding of how MLMs process and encode information. Motivated by this, and our observation that MLMs naturally defer most cross-modal token interactions to deeper layers of the model, we propose a simple modification. Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle, allowing them to entirely bypass the early layers. Our results with diverse modalities, (i) LLaVA & BLIP for vision, (ii) LTU for audio, and (iii) MoLCA for molecular data, and model sizes, starting from 350M to 13B parameters, indicate that our method reduces both training and inference costs, while at least preserving, if not surpassing the performance of existing baselines.
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
Mirac Suzgun | Mert Yuksekgonul | Federico Bianchi | Dan Jurafsky | James Zou
Mirac Suzgun | Mert Yuksekgonul | Federico Bianchi | Dan Jurafsky | James Zou
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet’s accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o’s success rate on the Game of 24 puzzle increased from about 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro Engineering and Physics problems. Crucially, DC’s memory is self-curated, focusing on concise, transferable snippets rather than entire transcripts, thereby facilitating meta-learning and avoiding context ballooning. Unlike fine-tuning or static retrieval methods, DC adapts LMs’ problem-solving skills on the fly, without modifying their underlying parameters, and offers a practical approach for continuously refining responses and cutting routine errors. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
Evidential Semantic Entropy for LLM Uncertainty Quantification
Lucie Kunitomo-Jacquin | Edison Marrese-Taylor | Ken Fukuda | Masahiro Hamasaki
Lucie Kunitomo-Jacquin | Edison Marrese-Taylor | Ken Fukuda | Masahiro Hamasaki
Quantifying uncertainty in large language models (LLMs) is crucial for applications where safety is a concern, as it helps identify factually incorrect LLM answers, commonly referred to as hallucinations. Recently, advancements have been made in quantifying uncertainty, specifically by incorporating the semantics of sampled answers to estimate entropy. These methods typically rely on a normalized probability that is calculated using a limited number of sampled answers. However, we note these estimation methods fail to account for the effects of the semantics that are possible to be obtained as answers, but are not observed in the sample. This is a significant oversight, since a heavier tail of unobserved answer probabilities indicates a higher level of overall uncertainty. To alleviate this issue, we propose Evidential Semantic Entropy (EVSE), which leverages evidence theory to represent both total ignorance arising from unobserved answers and partial ignorance stemming from the semantic relationships among the observed answers. Experiments show that EVSE significantly improves uncertainty quantification performance. Our code is available at: https://github.com/lucieK-J/EvidentialSemanticEntropy.git.
SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Laya Iyer | Angelina Wang | Sanmi Koyejo
Laya Iyer | Angelina Wang | Sanmi Koyejo
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess the audio beyond just what words are said— rather, in how they are said and the non-speech components of the audio. To strengthen ecological validity, we include a small human-recorded evaluation split per category. Based on the needs articulated by audio understanding use-cases of accessibility technology and industrial noise monitoring, this benchmark reveals critical gaps in current LALMs. The performance in each task is quite varied, with some tasks having performance far below random chance and others with high accuracy. We also provide a structured error taxonomy to characterize standard failure modes across tasks. These results provide direction for targeted improvements in model capabilities.
Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan | Teng Xiao | Shuchang Tao | Xue Wang | Jinyang Gao | Bolin Ding | Bingbing Xu
Yige Yuan | Teng Xiao | Shuchang Tao | Xue Wang | Jinyang Gao | Bolin Ding | Bingbing Xu
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. Code is at https://github.com/W2SR-ARR/Code.
DivMerge: A divergence-based model merging method for multi-tasking
Brahim Touayouch | Loïc Fosse | Géraldine Damnati | Gwénolé Lecorvé
Brahim Touayouch | Loïc Fosse | Géraldine Damnati | Gwénolé Lecorvé
Merging fine-tuned models is a promising alternative to costly multi-task training, but task interference remains a challenge, especially as the number of tasks grows. We present DivMerge, a reference-free method that merges models trained on different tasks by minimizing Jensen-Shannon divergence between their outputs and those of the merged model, automatically balancing task importance. While the method exhibits strong theoretical properties, experiments on classification and generative tasks with autoregressive models show that DivMerge consistently outperforms prior work, and remains robust when scaling to more tasks.
A Reinforcement Learning Framework for Robust and Secure LLM Watermarking
Li An | Yujian Liu | Yepeng Liu | Yuheng Bu | Yang Zhang | Shiyu Chang
Li An | Yujian Liu | Yepeng Liu | Yuheng Bu | Yang Zhang | Shiyu Chang
Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria.
Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
Sameer Komoravolu | Khalil Mrini
Sameer Komoravolu | Khalil Mrini
LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, developer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent’s weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20–30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full open-source implementation.
User-Centric Evidence Ranking for Attribution and Fact Verification
Guy Alt | Eran Hirsch | Serwar Basch | Ido Dagan | Oren Glickman
Guy Alt | Eran Hirsch | Serwar Basch | Ido Dagan | Oren Glickman
Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language
Mena Attia | Aashiq Muhamed | Mai Alkhamissi | Thamar Solorio | Mona T. Diab
Mena Attia | Aashiq Muhamed | Mai Alkhamissi | Thamar Solorio | Mona T. Diab
We present a comprehensive evaluation of large language models’ (LLMs) ability to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and social nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation across Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Results show a consistent hierarchy: accuracy on Arabic proverbs is 4.29% lower than on English proverbs, and performance on Egyptian idioms is 10.28% lower than on Arabic proverbs. On the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing idioms’ contextual sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with full inter-annotator agreement. Figurative language thus serves as an effective diagnostic for cultural reasoning, revealing that while LLMs often interpret figurative meaning, they still face major challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation
Hieu Tran | Phuong-Anh Nguyen-Le | Huy Nghiem | Quang-Nhan Nguyen | Wei Ai | Marine Carpuat
Hieu Tran | Phuong-Anh Nguyen-Le | Huy Nghiem | Quang-Nhan Nguyen | Wei Ai | Marine Carpuat
Machine translation (MT) systems universally degrade when faced with code-mixed text. This problem is more acute for low-resource languages that lack dedicated parallel corpora. This work directly addresses this gap for Vietnamese-English, a language context characterized by challenges including orthographic ambiguity and the frequent omission of diacritics in informal text. We introduce VietMix, the first expert-translated, naturally occurring parallel corpus of Vietnamese-English code-mixed text. We establish VietMix’s utility by developing a data augmentation pipeline that leverages iterative fine-tuning and targeted filtering. Experiments show that models augmented with our data outperform strong back-translation baselines by up to +3.5 xCOMET points and improve zero-shot models by up to +11.9 points. Our work delivers a foundational resource for a challenging language pair and provides a validated, transferable framework for building and augmenting corpora in other low-resource settings.
Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs
Aditya Sanjiv Kanade | Tanuja Ganu
Aditya Sanjiv Kanade | Tanuja Ganu
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Paradoxically, MLLMs sometimes produce correct answers while misinterpreting crucial visual elements, masking these underlying perception failures. Our preliminary analysis on a joint perception-reasoning dataset revealed that 29% of correct reasoning answers from a leading MLLM contained perception errors. To systematically study visual perception abilities of MLLMs, we introduce Do You See Me- a scalable, programmatically generated benchmark with 1758 images and 2612 questions across seven core subtasks spanning 2D and 3D variants (twelve total tasks) providing parametric control over difficulty levels. The benchmark tasks are inspired by human psychology. Our evaluation of eleven leading MLLMs reveals a stark deficit: humans achieve 95.83% accuracy, while top MLLMs average below 50%. This performance gap widens drastically as task complexity increases. Further diagnostics show: (1) supervised finetuning offers only modest gains (11%), (2) models tend to exploit task “shortcuts” like MCQ formats over detailed visual analysis, and (3) Chain-of-Thought prompting can degrade complex visual tasks by verbalizing images into lossy text. These findings expose the foundational perception limits in current MLLMs and highlight the need for robust visual perception improvements in MLLMs. The benchmark dataset, source code and evaluation scripts are available at[<https://github.com/microsoft/Do-You-See-Me>].
An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents
Farnoosh Hashemi | Michael Macy
Farnoosh Hashemi | Michael Macy
Large Language Models (LLMs) increasingly mediate our social, cultural, and political interactions. While they can simulate some aspects of human behavior and decision-making, it is still underexplored whether repeated interactions with other agents amplify their biases or lead to exclusionary behaviors. To this end, we study Chirper.ai—an LLM-driven social media platform—analyzing 7M posts and interactions among 32K LLM agents over a year. We start with homophily and social influence among LLMs, learning that similar to humans’, their social networks exhibit these fundamental phenomena. Next, we study the toxic language of LLMs, its linguistic features, and their interaction patterns, finding that LLMs show different structural patterns in toxic posting than humans. After studying the ideological leaning in LLMs posts, and the polarization in their community, we focus on how to prevent their potential harmful activities. We present a simple yet effective method, called Chain of Social Thought (CoST), that reminds LLM agents to avoid harmful posting.
Detecting Subtle Biases: An Ethical Lens on Underexplored Areas in AI Language Models Biases
Shayan Bali | Farhan Farsi | Mohammad Hosseini | Adel Khorramrouz | Ehsaneddin Asgari
Shayan Bali | Farhan Farsi | Mohammad Hosseini | Adel Khorramrouz | Ehsaneddin Asgari
Large Language Models (LLMs) are increasingly embedded in the daily lives of individuals across diverse social classes. This widespread integration raises urgent concerns about the subtle, implicit biases these models may contain. In this work, we investigate such biases through the lens of ethical reasoning, analyzing model responses to scenarios in a new dataset we propose comprising 1,016 scenarios, systematically categorized into ethical, unethical, and neutral types. Our study focuses on dimensions that are socially influential but less explored, including (i) residency status, (ii) political ideology, (iii) Fitness Status, (iv) educational attainment, and (v) attitudes toward AI. To assess LLMs’ behavior, we propose a baseline and employ one statistical test and one metric: a permutation test that reveals the presence of bias by comparing the probability distributions of ethical/unethical scenarios with the probability distribution of neutral scenarios on each demographic group, and a tendency measurement that captures the magnitude of bias with respect to the relative difference between probability distribution of ethical and unethical scenarios. Our evaluations of 12 prominent LLMs reveal persistent and nuanced biases across all four attributes, and Llama models exhibited the most pronounced biases. These findings highlight the need for refined ethical benchmarks and bias-mitigation tools in LLMs.
HarfoSokhan: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal Variations
Hamid Jahad Sarvestani | Vida Ramezanian | Saee Saadat | Neda Taghizadeh Serajeh | Maryam Sadat Razavi Taheri | Shohreh Kasaei | MohammadAmin Fazli | Ehsaneddin Asgari
Hamid Jahad Sarvestani | Vida Ramezanian | Saee Saadat | Neda Taghizadeh Serajeh | Maryam Sadat Razavi Taheri | Shohreh Kasaei | MohammadAmin Fazli | Ehsaneddin Asgari
A wide array of NLP/NLU models have been developed for the Persian language and have shown promising results. However, the performance of such models drops significantly when applied to the colloquial form of Persian. This challenge arises from the substantial differences between colloquial and formal Persian and the lack of parallel data facilitating the robustness of the model to the colloquial data or to transform the data to formal Persian. In addressing this gap, our research is dedicated to the development of the HarfoSokhan dataset, a large-scale colloquial to formal Persian parallel dataset of 6M sentence pairs. Our proposed dataset is a critical resource for training models that can effectively bridge the linguistic variations between colloquial and formal Persian. To illustrate the utility of our dataset, we used it to train a GPT2 model, which exhibited remarkable proficiency in colloquial to formal text style transfer, outperforming both OpenAI’s GPT-3.5-turbo model and a leading rule-based system in this task. This conclusion is supported by our proposed ranking-based human evaluation. The results underscore the significance of the HarfoSokhan dataset in enhancing the performance of natural language processing models in the challenging task of colloquial to formal Persian conversion.
Compressing Language Models for Specialized Domains
Miles Williams | George Chrysostomou | Vitor Amancio Jeronymo | Nikolaos Aletras
Miles Williams | George Chrysostomou | Vitor Amancio Jeronymo | Nikolaos Aletras
Language models (LMs) excel at tasks across diverse domains, yet require substantial computational resources during inference. Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this issue, but requires a computationally expensive full-parameter fine-tuning pipeline. To this end, we propose MixCal, a novel calibration method designed to improve the in-domain performance of compressed LMs in a post-training setting. Through extensive experimentation, we demonstrate that MixCal substantially outperforms existing approaches on domain-specific tasks while preserving general performance. Notably, these performance gains are achieved while also reducing the computational cost of LM compression.
GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences
Priyanka Dey | Daniele Rosa | Wenqing Zheng | Daniel Barcklow | Jieyu Zhao | Emilio Ferrara
Priyanka Dey | Daniele Rosa | Wenqing Zheng | Daniel Barcklow | Jieyu Zhao | Emilio Ferrara
Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. We introduce (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users’ interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks—including Hofstede’s cultural dimensions, Schwartz’s basic values, the World Values Survey, and Big Five OCEAN traits—synthesizes chosen/rejected preference pairs to guide personalized content generation. We evaluate on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization. Code and datasets will be released upon acceptance.
Multimodal Conversation Structure Understanding
Kent K. Chang | Mackenzie Hanh Cramer | Anna Ho | Ti Ti Nguyen | Yilin Yuan | David Bamman
Kent K. Chang | Mackenzie Hanh Cramer | Anna Ho | Ti Ti Nguyen | Yilin Yuan | David Bamman
While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation—conversational roles and threading—remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.
A Review of Incorporating Psychological Theories in LLMs
Zizhou Liu | Ziwei Gong | Lin Ai | Zheng Hui | Run Chen | Colin Wayne Leach | Michelle R. Greene | Julia Hirschberg
Zizhou Liu | Ziwei Gong | Lin Ai | Zheng Hui | Run Chen | Colin Wayne Leach | Michelle R. Greene | Julia Hirschberg
Psychological insights have long shaped pivotal NLP breakthroughs, from attention mechanisms to reinforcement learning and social modeling. As Large Language Models (LLMs) develop, there is a rising consensus that psychology is essential for capturing human-like cognition, behavior, and interaction.This paper reviews how psychological theories can inform and enhance stages of LLM development. Our review integrates insights from six subfields of psychology, including cognitive, developmental, behavioral, social, personality psychology, and psycholinguistics. With stage-wise analysis, we highlight current trends and gaps in how psychological theories are applied. By examining both cross-domain connections and points of tension, we aim to bridge disciplinary divides and promote more thoughtful integration of psychology into NLP research.
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
Aly M. Kassem | Bernhard Schölkopf | Zhijing Jin
Aly M. Kassem | Bernhard Schölkopf | Zhijing Jin
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance by dynamically assigning queries to the most appropriate model based on query complexity. Despite recent advances showing that preference-data-based routers can outperform traditional methods, current evaluation benchmarks remain limited—they largely focus on general model capabilities while overlooking task-specific behaviors and critical concerns such as privacy, safety, and potential backdoor vulnerabilities introduced through preference data. In response, we propose the DSC benchmark: Diverse, simple, and categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types—including coding, translation, mathematics, human instructions, general knowledge, and LLM jailbreaking—and integrates privacy and safety assessments to reveal hidden risks. Our experiments on three preference-based routers and two commercial counterparts demonstrate that while these systems improve efficiency, they often make suboptimal, category-driven decisions; for instance, a BERT-based router directs all coding and mathematics queries to the most powerful LLM—even when simpler models would suffice—while routing jailbreaking attempts to weaker models, thereby elevating safety risks.
NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering
Kaiwen Shi | Zheyuan Zhang | Zhengqing Yuan | Keerthiram Murugesan | Vincent Galassi | Chuxu Zhang | Yanfang Ye
Kaiwen Shi | Zheyuan Zhang | Zhengqing Yuan | Keerthiram Murugesan | Vincent Galassi | Chuxu Zhang | Yanfang Ye
Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph–guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.
Verification-Aware Planning for Multi-Agent Systems
Tianyang Xu | Dan Zhang | Kushan Mitra | Estevam Hruschka
Tianyang Xu | Dan Zhang | Kushan Mitra | Estevam Hruschka
Large language model (LLM) agents are increasingly deployed to tackle complex tasks, often necessitating collaboration among multiple specialized agents. However, multi-agent collaboration introduces new challenges in planning, coordination, and verification. Execution failures frequently arise not from flawed reasoning alone, but from subtle misalignments in task interpretation, output format, or inter-agent handoffs. To address these challenges, we present VeriMAP, a framework for multi-agent collaboration with verification-aware planning. The VeriMAP planner decomposes tasks, models subtask dependencies, and encodes planner-defined passing criteria as subtask verification functions (VFs) in Python and natural language. We evaluate VeriMAP on diverse datasets, demonstrating that it outperforms both single- and multi-agent baselines while enhancing system robustness and interpretability. Our analysis highlights how verification-aware planning enables reliable coordination and iterative refinement in multi-agent systems, without relying on external labels or annotations.
Zero-Shot Open-Schema Entity Structure Discovery
Xueqiang Xu | Jinfeng Xiao | James Barry | Mohab Elkaref | Jiaru Zou | Pengcheng Jiang | Yunyi Zhang | Maxwell J Giammona | Geeth De Mel | Jiawei Han
Xueqiang Xu | Jinfeng Xiao | James Barry | Mohab Elkaref | Jiaru Zou | Pengcheng Jiang | Yunyi Zhang | Maxwell J Giammona | Geeth De Mel | Jiawei Han
Entity structure extraction, which aims to extract entities and their associated attribute–value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce ZOES, a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs’ ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.
Beyond Semantics: How Temporal Biases Shapes Retrieval in Transformer and State-Space Models
Anooshka Bajaj | Deven Mahesh Mistry | Sahaj Singh Maini | Yash Aggarwal | Zoran Tiganj
Anooshka Bajaj | Deven Mahesh Mistry | Sahaj Singh Maini | Yash Aggarwal | Zoran Tiganj
In-context learning depends not only on what appears in the prompt but also on when it appears. To isolate this temporal component from semantic confounds, we construct prompts with repeated anchor tokens and average the model’s predictions over hundreds of random permutations of the intervening context. This approach ensures that any observed position-dependent effects are driven purely by temporal structure rather than token identity or local semantics. Across four transformer LLMs and three state-space/recurrent models, we observe a robust serial recall signature: models allocate disproportionate probability mass to the tokens that previously followed the anchor, but the strength of this signal is modulated by serial position, yielding model-specific primacy/recency profiles. We then introduce an overlapping-episode probe in which only a short cue from one episode is re-presented; retrieval is reliably weakest for episodes embedded in the middle of the prompt, consistent with "lost-in-the-middle" behavior. Mechanistically, ablating high-induction-score attention heads in transformers reduces serial recall and episodic separation. For state-space models, ablating a small fraction of high-attribution channels produces analogous degradations, suggesting a sparse subspace supporting induction-style copying. Together, these results clarify how temporal biases shape retrieval across architectures and provide controlled probes for studying long-context behavior.
Diagnosing Vision Language Models’ Perception by Leveraging Human Methods for Color Vision Deficiencies
Kazuki Hayashi | Shintaro Ozaki | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Kazuki Hayashi | Shintaro Ozaki | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Large-scale Vision-Language Models (LVLMs) are being deployed in real-world settings that require visual inference. As capabilities improve, applications in navigation, education, and accessibility are becoming practical. These settings require accommodation of perceptual variation rather than assuming a uniform visual experience. Color perception illustrates this requirement: it is central to visual understanding yet varies across individuals due to Color Vision Deficiencies, an aspect largely ignored in multimodal AI.In this work, we examine whether LVLMs can account for variation in color perception using the Ishihara Test. We evaluate model behavior through generation, confidence, and internal representation, using Ishihara plates as controlled stimuli that expose perceptual differences. Although models possess factual knowledge about color vision deficiencies and can describe the test, they fail to reproduce the perceptual outcomes experienced by affected individuals and instead default to normative color perception. These results indicate that current systems lack mechanisms for representing alternative perceptual experiences, raising concerns for accessibility and inclusive deployment in multimodal settings.
Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping
Fan Jiang | Honglin Yu | Grace Y Chung | Trevor Cohn
Fan Jiang | Honglin Yu | Grace Y Chung | Trevor Cohn
Extending Large Language Models (LLMs) to new languages is challenging, with most methods proposed suffering from high computational cost and catastrophic forgetting of original model capabilities. Embedding relearning (CITATION), a technique that creates new tokenizers and tunes embeddings on fixed model weights for target language adaptation, is both light-weight and performant. However, it has only been shown to work for older generation encoder-only models and for high resource languages. In this paper, we extend this framework to decoder-only LLMs focusing on joint adaptation to many languages, including low-resource ones. We experiment in three language groups over 100 languages each. We adapt a pre-trained LLM via switching to a customized tokenizer, and relearning the embedding layer. Across 96 diverse languages spanning both classification and generation tasks, we show embedding relearning improves models by up to 20%, being highly competitive with full-weight updating baselines while vastly more computationally efficient and mitigating catastrophic forgetting. This translates into better results in transferring the improved multilingual performance to tasks that build on core English abilities (e.g., multilingual math reasoning), compared to various baselines. Further analysis reveals the critical role of customizing tokenizers in achieving effective language transfer, particularly for non-Latin script languages.
Active Generalized Category Discovery with Diverse LLM Feedback
Henry Peng Zou | Siffi Singh | Yi Nian | Jianfeng He | Jason Cai | Saab Mansour | Hang Su
Henry Peng Zou | Siffi Singh | Yi Nian | Jianfeng He | Jason Cai | Saab Mansour | Hang Su
Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose DeLFGCD , a unified framework for generalized category discovery that actively learns from diverse and collaborative LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of DeLFGCD over state-of-the-art models across diverse datasets, metrics, and supervision settings.
RAFFLES: Reasoning-based Attribution of Faults for LLM Systems
Chenyang Zhu | Spencer Hong | Jingyu Wu | Kushal Chawla | Yuhui Tang | Youbing Yin | Nathan Wolfe | Erin Babinsky | Daben Liu
Chenyang Zhu | Spencer Hong | Jingyu Wu | Kushal Chawla | Yuhui Tang | Youbing Yin | Nathan Wolfe | Erin Babinsky | Daben Liu
The advent of complex, interconnected long-horizon LLM systems has made it incredibly tricky to identify where and when these systems break down. Evaluation capabilities that currently exist today are limited in that they often focus on simple metrics, end-to-end outcomes, and are dependent on the perspectives of humans. In order to match the increasing complexity of these many component systems, evaluation frameworks must also be able to reason, probe, iterate, and understand the nuanced logic passing through these systems. In this paper, we present RAFFLES, an offline evaluation architecture that incorporates iterative reasoning. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically identify faults and a set of specialized Evaluators to assess the quality of the candidate faults as well as rationales of the Judge. We evaluated RAFFLES with several benchmarks - the Who&When dataset to identify step-level faults in multi-agent systems and the ReasonEval datasets to diagnose step-level mathematical reasoning errors. RAFFLES outperforms strong baselines, achieving an accuracy of over 20% and 50% on the Who&When Hand-Crafted and Algorithmically-Generated datasets, and over 80% on the ReasonEval datasets. These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual review.
Jailbreaks as Inference-Time Alignment: A Framework for Understanding Safety Failures in LLMs
James Beetham | Souradip Chakraborty | Mengdi Wang | Furong Huang | Amrit Singh Bedi | Mubarak Shah
James Beetham | Souradip Chakraborty | Mengdi Wang | Furong Huang | Amrit Singh Bedi | Mubarak Shah
Large language models (LLMs) are safety-aligned to prevent harmful response generation, yet still remain vulnerable to jailbreak attacks. While prior works have focused on improving jailbreak attack effectiveness, they offer little explanation for why safety alignment fails. We address this gap by framing jailbreaks as inference-time alignment, connecting attack design and safety alignment within a unified optimization framework. This framing allows us to extend best-of-N inference-time alignment to the adversarial setting, called LIAR (Leveraging Inference-time Alignment to jailbReak), and derive suboptimality bounds that show LIAR provably approaches an optimal jailbreak as compute scales. Interestingly, our framework allows us to develop the notion of a Safety-Net, a measure of how vulnerable an LLM is to jailbreaks, which helps to explain why safety alignment can fail. Empirically, LIAR produces natural, hard-to-detect prompts that achieve a competitive attack success rate while running 10 to 100x faster than prior suffix-based jailbreaks.
Over-Searching in Retrieval-Augmented Large Language Models
Roy Xie | Deepak Gopinath | David Qiu | Dong Lin | Haitian Sun | Saloni Potdar | Bhuwan Dhingra
Roy Xie | Deepak Gopinath | David Qiu | Dong Lin | Haitian Sun | Saloni Potdar | Bhuwan Dhingra
Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search – unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our findings show: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA benchmark to foster continued research into efficient search-augmented LLMs.
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
Daniel Fein | Sebastian Russo | Violet Xiang | Kabir Jolly | Rafael Rafailov | Nick Haber
Daniel Fein | Sebastian Russo | Violet Xiang | Kabir Jolly | Rafael Rafailov | Nick Haber
Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. To address this gap, we introduce LitBench, a large-scale benchmark for creative writing evaluation, featuring a training corpus of 43,827 story pairs and a 2,480-pair test set curated from Reddit. Using LitBench, we benchmark existing LLM judges and train specialized reward models. Our analysis reveals that the strongest OTS judge, Claude-3.7-Sonnet, achieves only 73% agreement with human preferences. In contrast, our trained Bradley-Terry and generative reward models both reach 78% accuracy, outperforming all OTS judges. An online human study further validates our models, showing their rankings of newly generated stories align more closely with human preferences. Our work provides the first reliable benchmark and specialized reward models for creative writing, establishing a crucial foundation for the future development of more capable verifiers.
H-Mem: Hybrid Multi-Dimensional Memory Management for Long-Context Conversational Agents
Zihe Ye | Jingyuan Huang | Weixin Chen | Yongfeng Zhang
Zihe Ye | Jingyuan Huang | Weixin Chen | Yongfeng Zhang
Long-context conversational agents require robust memory, but existing frameworks struggle to organize information effectively across dimensions like time and topic, leading to poor retrieval. To address this, we introduce H-Mem, a novel Hybrid Multi-Dimensional Memory architecture. H-Mem stores conversational facts in two parallel, hierarchical data structures: a temporal tree that organizes information chronologically and a semantic tree that organizes it conceptually. This dual-tree design enables a hybrid retrieval mechanism managed by an intelligent Mode Controller. Based on the query, the controller dynamically chooses between a sequential search using semantic anchors and an intersective search combining both hierarchies. Our experiments on long-context QA datasets demonstrate that H-Mem provides a more flexible approach to memory management, leading to significant improvements of over 8.4% compared to other state-of-the-art systems.
“Yuki Gets Sushi, David Gets Steak?”: Uncovering Gender and Racial Biases in LLM-Based Meal Recommendations
Xuefeng Wei | Xuan Zhou | Yusuke Sakai | Taro Watanabe
Xuefeng Wei | Xuan Zhou | Yusuke Sakai | Taro Watanabe
Group bias in Large Language Models (LLMs) is a well-documented issue, its impact in high-stakes domains such as personalized nutritional advice remains under explored. This study introduces the USChainMains dataset to systematically evaluate LLMs, prompting them with names associated with specific racial and gender groups and rigorously quantifying the healthfulness of the generated meal recommendations against established dietary standards. The findings demonstrate that LLMs systematically recommend meals with significantly higher levels of adverse nutrients for names associated with Black, Hispanic, or male individuals, thereby reflecting and potentially reinforcing detrimental dietary stereotypes. Furthermore, our analysis of two common mitigation strategies reveals their limitations. While model scaling improves overall recommendation healthfulness, it is insufficient to eliminate the healthfulness gap between demographic groups. Notably, while augmented reasoning was effective in mitigating gender bias, it did not mitigate racial disparities. This work underscores the necessity of developing more nuanced, group-aware debiasing techniques to ensure AI-driven systems advance, rather than hinder, health equity.
Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Haeji Jung | Jinju Kim | Kyungjin Kim | Youjeong Roh | David R. Mortensen
Haeji Jung | Jinju Kim | Kyungjin Kim | Youjeong Roh | David R. Mortensen
Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on three downstream tasks—named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI)—and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning
Renxi Wang | Honglin Mu | Liqun Ma | Lizhi Lin | Yunlong Feng | Timothy Baldwin | Xudong Han | Haonan Li
Renxi Wang | Honglin Mu | Liqun Ma | Lizhi Lin | Yunlong Feng | Timothy Baldwin | Xudong Han | Haonan Li
Long-context understanding has emerged as a critical capability for large language models (LLMs). However, evaluating this ability remains challenging. We present SCALAR, a benchmark designed to assess citation-grounded long-context reasoning in academic writing. SCALAR leverages academic papers and their citation structure to automatically generate high-quality ground-truth labels without human annotation. It features controllable difficulty levels and a dynamic updating mechanism that mitigates data contamination. The benchmark includes two tasks: a multiple-choice QA format and a cloze-style citation prediction. We evaluate a range of state-of-the-art LLMs and find that the multiple-choice task effectively distinguishes model capabilities—while human experts achieve over 90% accuracy, most models struggle. The cloze-style task is even more challenging, with no model exceeding 40% accuracy. SCALAR provides a domain-grounded, continuously updating framework for tracking progress in citation-based long-context understanding. Code and data will be publicly released.
Look Before You Leap: A Lookahead Reasoning Quality Gate for Speculative Decoding
Hiroaki Kingetsu | Kaoru Yokoo | Kenji Fukumizu | Manohar Kaul
Hiroaki Kingetsu | Kaoru Yokoo | Kenji Fukumizu | Manohar Kaul
We present a lookahead quality gate (verifier) for speculative decoding for reasoning or chain-of-thought language models. The gate accepts the longest reliable prefix of each k-token lookahead (block-wise) draft. Unlike token-level likelihood search, which is myopic and often rewards verbosity, or tree-level sampling methods that trade accuracy for latency, our approach works at an intermediate granularity. It uses only the base model’s hidden states to compute a geometry-based quality score for each prefix, then accepts the longest prefix whose score exceeds a quantile-calibrated threshold estimated from unlabeled prompts. The method integrates seamlessly with speculative/blockwise decoding and adds minimal runtime overhead, requiring no auxiliary heads, reward models, or finetuning. On math and science benchmarks, it improves accuracy over sampling baselines while achieving 2.6-7.9× faster generation.
FanarGuard: A Culturally-Aware Moderation Filter for Arabic Language Models
Masoomali Fatehkia | Enes Altinisik | Husrev Taha Sencar
Masoomali Fatehkia | Enes Altinisik | Husrev Taha Sencar
Content moderation filters are a critical safeguard against alignment failures in language models. Yet most existing filters focus narrowly on general safety and overlook cultural context. In this work, we introduce FanarGuard, a bilingual moderation filter that evaluates both safety and cultural alignment in Arabic and English. We construct a dataset of over 468K prompt and response pairs, drawn from synthetic and public datasets, scored by a panel of LLM judges on harmlessness and cultural awareness, and use it to train two filter variants.To rigorously evaluate cultural alignment, we further develop the first benchmark targeting Arabic cultural contexts, comprising over 1K norm-sensitive prompts with LLM-generated responses annotated by human raters. Results show that FanarGuard achieves stronger agreement with human annotations than inter-annotator reliability, while matching the performance of state-of-the-art filters on safety benchmarks. These findings highlight the importance of integrating cultural awareness into moderation and establish FanarGuard as a practical step toward more context-sensitive safeguards.
BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation
Tsung-Min Pai | Jui-I Wang | Li-Chun Lu | Shao-Hua Sun | Hung-yi Lee | Kai-Wei Chang
Tsung-Min Pai | Jui-I Wang | Li-Chun Lu | Shao-Hua Sun | Hung-yi Lee | Kai-Wei Chang
Multi-LLM systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi-LLM collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by extracting and blending multiple distinct persona vectors directly in the model’s activation space. We steer the model’s generation process with this merged vector while inference, enabling multi-perspective output without explicit multi-LLM communication. Our experiments across creativity-oriented benchmarks demonstrate that BILLY surpasses single model prompting and traditional multi-LLM approaches, while substantially reducing inference time and computational costs. Our analyses further reveal that distinct persona vectors can be blended to achieve both effective control over complementary aspects of generation and greater interpretability.
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Pedashenko Vladislav | Laida Kushnareva | Yana Khassan Nibal | Eduard Tulchinskii | Kristian Kuznetsov | Vladislav Zharchinskii | Yury Maximov | Irina Piontkovskaya
Pedashenko Vladislav | Laida Kushnareva | Yana Khassan Nibal | Eduard Tulchinskii | Kristian Kuznetsov | Vladislav Zharchinskii | Yury Maximov | Irina Piontkovskaya
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (∼ 8), encyclopedic content medium ID (∼ 9), and creative/opinion writing high ID (∼ 10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.
Image Corruption-Inspired Membership Inference Attacks against Large Vision-Language Models
Zongyu Wu | Minhua Lin | Zhiwei Zhang | Fali Wang | Xianren Zhang | Xiang Zhang | Suhang Wang
Zongyu Wu | Minhua Lin | Zhiwei Zhang | Fali Wang | Xianren Zhang | Xiang Zhang | Suhang Wang
Large vision-language models (LVLMs) have demonstrated outstanding performance in many downstream tasks. However, LVLMs are trained on large-scale datasets, which can pose privacy risks if training images contain sensitive information. Therefore, it is important to detect whether an image is used to train the LVLM. Recent studies have investigated membership inference attacks (MIAs) against LVLMs, including detecting image-text pairs and single-modality content. In this work, we focus on detecting whether a target image is used to train the target LVLM. We design simple yet effective Image Corruption-Inspired Membership Inference Attacks (ICIMIA) against LVLMs, which are inspired by LVLM’s different sensitivity to image corruption for member and non-member images. We first perform an MIA method under the white-box setting, where we can obtain the embeddings of the image through the vision part of the target LVLM. The attacks are based on the embedding similarity between the image and its corrupted version. We further explore a more practical scenario where we have no knowledge about target LVLMs and we can only query the target LVLMs with an image and a textual instruction. We then conduct the attack by utilizing the output text embeddings’ similarity. Experiments on existing datasets validate the effectiveness of our proposed methods under those two different settings.
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models
Chengzhi Zhong | Fei Cheng | Qianying Liu | Yugo Murawaki | Chenhui Chu | Sadao Kurohashi
Chengzhi Zhong | Fei Cheng | Qianying Liu | Yugo Murawaki | Chenhui Chu | Sadao Kurohashi
Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models
Atharvan Dogra | Soumya Suvra Ghosal | Ameet Deshpande | Ashwin Kalyan | Dinesh Manocha
Atharvan Dogra | Soumya Suvra Ghosal | Ameet Deshpande | Ashwin Kalyan | Dinesh Manocha
Large language models are increasingly used for creative writing and engagement content, raising safety concerns about their outputs. Using humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. We further supplement this by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that even for fixed neutral setups, harmful outputs receive higher humor scores, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show that harmful cues widen predictive uncertainty and, surprisingly, can even make harmful punchlines more expected for some models, suggesting intrinsic structural embedding in learned humor distributions. Experiments and human evaluation on an additional satire-generation task with human-perceived funniness judgments show that LLM funniness relies on increased stereotypicality and toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain 10%–21% in mean humor score, stereotypical jokes appear 11% to 28% more often among the jokes marked funny by an LLM-based metric, and up to 10% more often in generations perceived as funny by humans.
Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in LLMs
Yujia Zheng | Tianhao Li | Haotian Huang | Tianyu Zeng | Jingyu Lu | Chuangxin Chu | Yuekai Huang | Ziyou Jiang | Qian Xiong | Yuyao Ge | Mingyang Li
Yujia Zheng | Tianhao Li | Haotian Huang | Tianyu Zeng | Jingyu Lu | Chuangxin Chu | Yuekai Huang | Ziyou Jiang | Qian Xiong | Yuyao Ge | Mingyang Li
Prompt-based adversarial attacks are a key tool for assessing the robustness of large language models (LLMs). Yet, existing studies typically treat prompts as flat text, overlooking their internal structure, different components within a prompt contribute unequally to robustness. This work introduces PromptAnatomy, a framework that decomposes prompts into functional components, and ComPerturb, a controlled perturbation method that selectively modifies these components to expose component-wise vulnerabilities while ensuring linguistic plausibility via perplexity-based filtering. Using this framework, four instruction-tuning datasets are structurally annotated and validated by human reviewers. Experiments across five advanced LLMs show that ComPerturb achieves state-of-the-art attack success rates, while ablation analyses confirm the complementary effects of prompt dissection and perplexity filtering. These results highlight the importance of structural awareness in evaluating and improving the adversarial robustness of LLMs.
A Regex Minimization Benchmark: A PSPACE-Complete Challenge for Language Models
Hyundong Jin | Joonghyuk Hahn | Yo-Sub Han
Hyundong Jin | Joonghyuk Hahn | Yo-Sub Han
Language models (LMs) have shown impressive reasoning capabilities across various domains. A fundamental question is the extent of their reasoning power. While recent studies show that LMs can solve NP-complete problems, their ability to handle PSPACE-complete problems remains underexplored. We investigate regex minimization as a PSPACE-complete challenge for LMs to address this issue. Regexes, formal expressions for regular languages widely used in NLP, software engineering (SE), and programming language (PL), are supported by several efficient tools for their manipulation grounded in theoretical backgrounds. Inspired by this, we introduce the first benchmark for regex minimization containing over a million regexes paired with their minimal equivalents. Through extensive experiments with two LMs trained on our dataset and five open-source large language models (LLMs), we analyze how well LMs perform on PSPACE-complete problems, highlighting their capabilities of generalization and limitations in reasoning. To the best of our knowledge, this is the first study to systematically evaluate LM reasoning in regex minimization and establish a foundation for solving PSPACE-complete problems with LMs. Our code is available at https://github.com/hyundong98/RegexPSPACE.
Teaching Small Language Models to Learn Logic through Meta-Learning
Leonardo Bertolazzi | Manuel Vargas Guzmán | Raffaella Bernardi | Maciej Malicki | Jakub Szymanik
Leonardo Bertolazzi | Manuel Vargas Guzmán | Raffaella Bernardi | Maciej Malicki | Jakub Szymanik
Large language models (LLMs) are increasingly evaluated on reasoning tasks, yet their logical abilities remain contested. To address this, we study LLMs’ reasoning in a well-defined fragment of logic: syllogistic reasoning. We cast the problem as premise selection and construct controlled datasets to isolate logical competence. Beyond evaluation, an open challenge is enabling LLMs to acquire abstract inference patterns that generalize to novel structures. We propose to apply few-shot meta-learning to this domain, thereby encouraging models to extract rules across tasks rather than memorize patterns within tasks. Although meta-learning has been little explored in the context of logic learnability, our experiments show that it is effective: small models (1.5B–7B) fine-tuned with meta-learning demonstrate strong gains in generalization, with especially pronounced benefits in low-data regimes. These meta-learned models outperform GPT-4o and o3-mini on our syllogistic reasoning task.
COMPACT: Building Compliance Paralegals via Clause Graph Reasoning over Contracts
Ayush Singh | Dishank Aggarwal | Pranav Bhagat | Ainulla Khan | Sameer Malik | Amar Prakash Azad
Ayush Singh | Dishank Aggarwal | Pranav Bhagat | Ainulla Khan | Sameer Malik | Amar Prakash Azad
Contract compliance verification requires reasoning about cross-clause dependencies where obligations, exceptions, and conditions interact across multiple provisions, yet existing legal NLP benchmarks like ContractNLI and CUAD focus exclusively on isolated single-clause tasks. We introduce COMPACT (COMpliance PAralegals via Clause graph reasoning over conTracts), a framework that models cross-clause dependencies through structured clause graphs. Our approach extracts deontic-temporal entities from clauses and constructs typed relationship graphs capturing definitional dependencies, exception hierarchies, and temporal sequences. From these graphs, we introduce ACE (Assessing Compliance in Enterprise)- a benchmark containing 4,700 carefully constructed compliance scenarios derived from 633 real-world contracts covering 26 types of agreements. Each scenario requires multi-hop reasoning across multiple clauses, and undergoes independent LLM-based validation to ensure quality. Evaluation reveals that multi-clause reasoning poses a fundamental challenge for state-of-the-art models (34-57% base accuracy), while training on ACE yields substantial improvements on compliance tasks (+22–43 % points) and also enhances general legal reasoning performance on other benchmarks (PrivaCI-Bench, ContractNLI).
Surprisal and Metaphor Novelty Judgments: Moderate Correlations and Divergent Scaling Effects Revealed by Corpus-Based and Synthetic Datasets
Omar Momen | Emilie Sitter | Berenike Herrmann | Sina Zarrieß
Omar Momen | Emilie Sitter | Berenike Herrmann | Sina Zarrieß
Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with annotations of metaphor novelty in different datasets. We analyse the surprisal of metaphoric words in corpus-based and synthetic metaphor datasets using 16 causal LM variants. We propose a cloze-style surprisal method that conditions on full-sentence context. Results show that LM surprisal yields significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (quality–power hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains limited as a metric of linguistic creativity. Code and data are publicly available: https://github.com/OmarMomen14/surprisal-metaphor-novelty
Repairing Regex Vulnerabilities via Localization-Guided Instructions
Sicheol Sung | Joonghyuk Hahn | Yo-Sub Han
Sicheol Sung | Joonghyuk Hahn | Yo-Sub Han
Regular expressions (regexes) are foundational to modern computing for criticaltasks like input validation and data parsing, yet their ubiquity exposes systemsto regular expression denial of service (ReDoS), a vulnerability requiringautomated repair methods. Current approaches, however, are hampered by atrade-off. Symbolic, rule-based systems are precise but fail to repair unseen orcomplex vulnerability patterns. Conversely, large language models (LLMs) possessthe necessary generalizability but are unreliable for tasks demanding strictsyntactic and semantic correctness. We resolve this impasse by introducing ahybrid framework, localized regex repair (LRR), designed to harness LLMgeneralization while enforcing reliability. Our core insight is to decoupleproblem identification from the repair process. First, a deterministic, symbolicmodule localizes the precise vulnerable subpattern, creating a constrained andtractable problem space. Then, the LLM is invoked to generate a semanticallyequivalent fix for this isolated segment. This combined architecturesuccessfully resolves complex repair cases intractable for rule-based repairwhile avoiding the semantic errors of LLM-only approaches. Our work provides avalidated methodology for solving such problems in automated repair, improvingthe repair rate by 15.4%p over the state-of-the-art.
Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality
Jana Jung | Marlene Lutz | Indira Sen | Markus Strohmaier
Jana Jung | Marlene Lutz | Indira Sen | Markus Strohmaier
Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests – originally developed for humans – yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests on 17 LLMs for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests on LLMs are essential before interpreting their scores. Our findings also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.
ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Yindong Wang | Martin Preiß | Margarita Bugueño | Jan Vincent Hoffbauer | Abdullatif Ghajar | Tolga Buz | Gerard de Melo
Yindong Wang | Martin Preiß | Margarita Bugueño | Jan Vincent Hoffbauer | Abdullatif Ghajar | Tolga Buz | Gerard de Melo
The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT, a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit’s r/AskScience. Evaluating 9 state-of-the-art LLMs reveals two critical limitations. First, models exhibit a dominant salient distractor failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors. Crucially, this pattern persists across all model scales (1B to 70B), indicating a fundamental semantic grounding deficit that scaling alone fails to resolve. Second, we find that comparative judgment is paradoxically harder than independent detection–even GPT-4o’s F1 score drops from 0.67 to 0.53 when comparing answers side-by-side. These findings directly challenge the reliability of LLM-as-Judge paradigms for scientific factuality. Code and data are released at https://github.com/ddz5431/ReFACT.
Cosine Similarity as Logits?: A Scalable Knowledge Probe Using Embedding Vectors from Generative Language Models
Tomoyuki Jinno | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Tomoyuki Jinno | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Generating Multi-Aspect Queries for Conversational Search
Zahra Abbasiantaeb | Simon Lupart | Mohammad Aliannejadi
Zahra Abbasiantaeb | Simon Lupart | Mohammad Aliannejadi
Conversational information seeking (CIS) systems aim to model the user’s information need within the conversational context and retrieve the relevant information. One major approach to modeling the conversational context aims to rewrite the user utterance in the conversation to represent the information need independently. In this work, we hypothesize that breaking down the information of an utterance into multiple queries covering different aspects of the information need can lead to more effective retrieval performance. This is more evident in more complex utterances that require gathering evidence from various information sources, where a single query rewrite or query representation cannot capture the complexity of the utterance. We propose MQ4CS, a multi-aspect query generation and retrieval framework, which uses Large Language Models (LLMs) to break the user utterance into multiple queries. This approach improves retrieval performance, as most utterances benefit from more than one rewritten query. We evaluate MQ4CS on six widely used CIS datasets, showing it outperforms state-of-the-art query rewriting methods. Using MQ4CS, we also construct MASQ, which includes multiple-aspect queries for the six datasets. Fine-tuning the model on MASQ yields significant improvements. We make our code and dataset publicly available.
Navigating the Infinite Dynamic Web Space: Effective In-Context Exploration via Cognitive Multi-Agent Collaboration
Guozhao Mo | Yanjiang Liu | Yafei Shi | Jiawei Chen | Yang Li | Yaojie Lu | Hongyu Lin | Ben He | Le Sun | Bo Zheng | Xianpei Han
Guozhao Mo | Yanjiang Liu | Yafei Shi | Jiawei Chen | Yang Li | Yaojie Lu | Hongyu Lin | Ben He | Le Sun | Bo Zheng | Xianpei Han
Dynamic web navigation is challenging due to infinite decision space and the constantly changing nature of cyberspace. Existing methods rely on greedy strategies or value estimation, struggle to achieve effective backtracking and are heavily dependent on proprietary models. In this paper, we propose HintNavigator, a cognitive multi-agent collaboration framework that enhances cyberspace exploration capability through In-Context Exploration (ICE). Inspired by the human cognitive planning process, we categorize the interaction history into Declarative History (environment observations) and Procedural History (action trajectories) to enhance historical reflection capability. These dual-history streams are dynamically integrated through specialized cognitive agents, enabling effective self-directed backtracking guided by working memory consolidation. Experiments show that HintNavigator achieves state-of-the-art performance among open-source LLM agents, surpassing proprietary model Claude-3.5 Sonnet on the WebArena benchmark.
TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks
Ryo Fujii | Makoto Morishita | Kazuki Yano | Jun Suzuki
Ryo Fujii | Makoto Morishita | Kazuki Yano | Jun Suzuki
With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers.Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked.In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects.Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates.The construction process is fully automated, enabling live updates of the benchmark.Furthermore, we curated a human-verified subset to ensure problem solvability.We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset.Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies.Our dataset and implementation are available at https://github.com/tohoku-nlp/timemachine-bench.
As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model’s solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model’s actions and reasoning process can be continued by the weak model—when the two can co-construct a successful solution—optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human–AI collaboration and multi-agent communication.
Can MLLMs Find Their Way in a City? Exploring Emergent Navigation from Web-Scale Knowledge
Dwip Dalal | Utkarsh Mishra | Narendra Ahuja | Nebojsa Jojic
Dwip Dalal | Utkarsh Mishra | Narendra Ahuja | Nebojsa Jojic
Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with , a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent’s internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/
Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models
Alla Chepurova | Aydar Bulatov | Mikhail Burtsev | Yuri Kuratov
Alla Chepurova | Aydar Bulatov | Mikhail Burtsev | Yuri Kuratov
Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored.In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain texts by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication.The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets.On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods.Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3× fewer than AriGraph and <1/20 of GraphRAG.The proposed pipeline improves the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs. Wikontic is available at https://github.com/screemix/Wikontic.
CAIRE: Cultural Attribution of Images with Retrieval
Arnav Yayavaram | Siddharth Yayavaram | Simran Khanuja | Michael Saxon | Graham Neubig
Arnav Yayavaram | Siddharth Yayavaram | Simran Khanuja | Michael Saxon | Graham Neubig
As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs.Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRE (https://github.com/siddharthyayavaram/CAIRE), an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label.On a manually curated dataset of culturally salient but rare items built using language models, CAIRE surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising of T2I generated outputs and another retrieved from naturally-occurring data. CAIRE achieves Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.
What Does Infect Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs
Xinlan Yan | Di Wu | Yibin Lei | Christof Monz | Iacer Calixto
Xinlan Yan | Di Wu | Yibin Lei | Christof Monz | Iacer Calixto
In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset designed for benchmarking large language models (LLMs) in fine-grained clinical specialties. S-MedQA consists of over 24k examples, covering 15 medical specialties, with QA pairs that can have multiple specialty annotations, such as when a question is cross-disciplinary. The dataset is constructed using both machine and expert verification to maximize data availability and reliability. We use S-MedQA to investigate the role of clinical specialties in the knowledge-intensive scenario of medical QA. Our results show that training on data from a clinical specialty does not necessarily lead to the best performance on that specialty. Additionally, regardless of the specialty the LLM was fine-tuned on, token probabilities of clinically relevant terms consistently increase across all specialties. Based on these findings, we hypothesize that improvement gains, at least in our settings, are derived primarily from domain shifting (e.g., general to medical) rather than from injecting specialty-specific knowledge. This suggests a need to rethink the role of fine-tuning data in the medical domain. To encourage further advancements in the clinical NLP field, we release S-MedQA along with all the code required to reproduce our experiments for the research community.
Redefining Retrieval Evaluation in the Era of LLMs
Giovanni Trappolini | Florin Cuconasu | Simone Filice | Yoelle Maarek | Fabrizio Silvestri
Giovanni Trappolini | Florin Cuconasu | Simone Filice | Yoelle Maarek | Fabrizio Silvestri
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components.
Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation
Abir Harrasse | Chaithanya Bandi | Hari Bandi
Abir Harrasse | Chaithanya Bandi | Hari Bandi
The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks.We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of mis-ranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-Bench, AlignBench, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen’s 𝜅), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost-accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity.Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.
IYKYK: Using language models to decode extremist cryptolects
Christine de Kock | Arij Riabi | Zeerak Talat | Michael Sejr Schlichtkrull | Pranava Madhyastha | Eduard Hovy
Christine de Kock | Arij Riabi | Zeerak Talat | Michael Sejr Schlichtkrull | Pranava Madhyastha | Eduard Hovy
Extremist groups develop complex in-group language, also referred to as cryptolects, to exclude or mislead outsiders. We investigate the ability of current language technologies to detect and interpret the cryptolects of two online extremist platforms. Evaluating eight models across six tasks, our results indicate that general purpose LLMs cannot consistently detect or decode extremist language. However, performance can be significantly improved by domain adaptation and specialised prompting techniques. These results provide important insights to inform the development and deployment of automated moderation technologies. We further develop and release novel labelled and unlabelled datasets, including 19.4M posts from extremist platforms and lexicons validated by human experts.
Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models
Sawsan Alqahtani | Mir Tafseer Nayeem | Md Tahmid Rahman Laskar | Tasnim Mohiuddin | M Saiful Bari
Sawsan Alqahtani | Mir Tafseer Nayeem | Md Tahmid Rahman Laskar | Tasnim Mohiuddin | M Saiful Bari
Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Vera Demberg | Kentaro Inui | Lluís Marquez
Investigating the Multilingual Calibration Effects of Language Model Instruction Tuning
Jerry Huang | Peng Lu | Qiuhao Zeng | Yusuke Iwasawa | Yutaka Matsuo | Sarath Chandar | Edison Marrese-Taylor | Irene Li
Jerry Huang | Peng Lu | Qiuhao Zeng | Yusuke Iwasawa | Yutaka Matsuo | Sarath Chandar | Edison Marrese-Taylor | Irene Li
Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.
Language changes faster than dictionaries can be revised, yet automatic tools still struggle to spot the subtle, short-term shifts in meaning that precede a formal update. We present a language-independent pipeline that detects word-sense shifts in large, time-stamped web corpora. The method couples a robust re-implementation of the Adaptive Skip-Gram model, which induces multiple sense vectors per lemma without any external inventory, with a second stage that tracks each sense through time under three alternative frequency normalizations. Linear Regression and the robust Mann-Kendall/Theil-Sen estimator then test whether a sense’s frequency slope deviates significantly from zero, producing a ranked list of headwords whose semantics are drifting.We evaluate the system on the English (12 B tokens) and Czech (1 B tokens) Timestamped corpora for May 2023-May 2025. Expert annotation of the top-100 candidates for each model variant shows that 50.7% of Czech and 25.7% of English headwords exhibit genuine sense shifts, despite web-scale noise.
Logic Haystacks: Probing LLMs’ Long-Context Logical Reasoning (Without Easily Identifiable Unrelated Padding)
Damien Sileo
Damien Sileo
Large language models demonstrate promising long-context processing capabilities, with recent models touting context windows close to one million tokens. However, the evaluations supporting these claims often involve simple retrieval tasks or synthetic tasks padded with irrelevant text, which models may easily detect and discard. In this work, we generate lengthy, simplified English text with first-order logic representations spanning up to 2048 sentences (~25k GPT-4 tokens). We formulate an evaluation task with evidence retrieval for contradiction detection. The long, homogeneous text is filled with distractors that are both hard to distinguish from relevant evidence and provably non-interfering. Our evaluation of evidence retrieval reveals that the effective context window is much smaller with such realistic distractors, already crumbling at 128 sentences.
When Does Auxiliary Modality Matter in Solving Geometric Problems? A Comprehensive Study of Textual, Formal, and Visual Modalities
Hyuk Namgoong | Jeesu Jung | Yerim Han | Sangkeun Jung
Hyuk Namgoong | Jeesu Jung | Yerim Han | Sangkeun Jung
Progressive Visual Refinement for Multi-modal Summarization
Ye Xiong | Hidetaka Kamigaito | Soichiro Murakami | Peinan Zhang | Hiroya Takamura | Manabu Okumura
Ye Xiong | Hidetaka Kamigaito | Soichiro Murakami | Peinan Zhang | Hiroya Takamura | Manabu Okumura
Multi-modal summarization (MMS) has emerged as a critical research area driven by the proliferation of multimedia content, focusing on generating condensed summaries by cross-modal complementary information synthesis. Previous studies have demonstrated the effectiveness of heterogeneous fusion paradigms, particularly through visual-centric feature extraction mechanisms, in constructing cross-modal representations that yield substantial performance gains. Nevertheless, the comprehensive utilization of multimodal information along with the intricate interdependencies among textual content, visual elements, and the summary generation process has been still insufficiently explored. We propose the Patch-Refined Visual Information Network (PRVIN) to address the insufficient exploitation of visual information. The essential patch selector and patch refiner components in PRVIN work collaboratively to progressively identify and refine critical visual features. An additional vision-to-summary alignment mechanism is also introduced to enhance the semantic connections between multi-modal representations and summary outputs. Extensive experiments conducted on two public MMS benchmark datasets demonstrate the superiority of PRVIN while quantitatively validating the crucial role of comprehensive visual information utilization in MMS tasks.
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
Zhenglin Wang | Jialong Wu | Pengfei Li | Yong Jiang | Deyu Zhou
Zhenglin Wang | Jialong Wu | Pengfei Li | Yong Jiang | Deyu Zhou
Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.
Training in Step-by-Step Formal Reasoning Improves Pronominal Reasoning in Language Models
Vagrant Gautam
Vagrant Gautam
Large reasoning models are trained to solve problems by decomposing them into steps.While they show impressive progress on reasoning tasks, "reasoning" here is typically limited to formal reasoning, i.e., math, code, and logic.An open question is whether these abilities transfer to _pronominal reasoning_, where step-by-step thinking in non-reasoning models worsens performance, but code pre-training may help.I answer this question by evaluating six pairs of original and DeepSeek-distilled models (1.5B-70B parameters) on six challenging datasets for English pronoun resolution (identifying whom a pronoun refers to) and pronoun fidelity (learning and applying a pronoun mapping correctly).Performance improves statistically significantly on all datasets (31% relative increase), indicating that distilling step-by-step formal reasoning does in fact help with pronominal reasoning, in part by improving instruction-following.With a qualitative evaluation of 720 generations, I show that improvements occur across granular error types, and come from plausible-looking reasoning chains employing a variety of reasoning strategies.However, the gains put models just above random performance on these datasets, leaving plenty of room for improvement.
When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate Speech
Priyansh Singhal | Piyush Joshi
Priyansh Singhal | Piyush Joshi
Hate speech on social media poses significant challenges for content moderation and user safety. While various datasets exist for hate speech detection, existing approaches treat hate speech as a monolithic phenomenon, detecting hateful content by using simple categorical labels such as hate, offensive, or toxic. This approach fails to distinguish between the speaker’s underlying motivations and the content’s potential societal consequences. This paper introduces I2-Hate, a novel dataset with a dual taxonomy that separately captures Intent (why the speaker produced hate speech) and Impact (what harm it may cause to individuals and communities) of online hateful posts. This dual-taxonomy approach enables moderation systems to differentiate hateful content based on underlying motivation and potential harm, supporting more nuanced intervention strategies. We release the I2-Hate dataset and code publicly.
Lost in Activations: A Neuron-level Analysis of Encoders for Cross-Lingual Emotion Detection
Pranaydeep Singh | Orphee De Clercq | Els Lefever
Pranaydeep Singh | Orphee De Clercq | Els Lefever
The rapid advancement of multilingual pre-trained transformers has fueled significant progress in natural language understanding across diverse languages. Yet, their inner workings remain opaque, especially with regard to how individual neurons encode and generalize semantic and affective features across languages. This paper presents an interpretability study of a fine-tuned XLM-R model for multilingual emotion classification. Using neuron-level activation analysis, we investigate the variance of neurons across labels, cross-lingual alignment of activations, and the existence of “polyglot” versus language-specific neurons. Our results reveal that while certain neurons consistently encode emotion-related concepts across languages, others show strong monolingual specialization.
When learning to code, students often develop misconceptions about various programming language concepts. These can not only lead to bugs or inefficient code, but also slow down the learning of related concepts. In this paper, we introduce McMining, the task of mining programming misconceptions from samples of code from a student. To enable the training and evaluation of McMining systems, we develop an extensible benchmark dataset of misconceptions, together with a large set of code samples where these misconceptions are manifested. We then introduce two LLM-based McMiner approaches and, through extensive evaluations, show that models from the Gemini, Claude, and GPT families are effective at discovering misconceptions in student code.
Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly
Yi-Chien Lin | William Schuler
Yi-Chien Lin | William Schuler
There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers’ per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models’ per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms
Zhisong Zhang | Tianqing Fang | Kaixin Ma | Wenhao Yu | Hongming Zhang | Haitao Mi | Dong Yu
Zhisong Zhang | Tianqing Fang | Kaixin Ma | Wenhao Yu | Hongming Zhang | Haitao Mi | Dong Yu
With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives models the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.
Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust.Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields.However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks.To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics.The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT’24 English-to-Japanese (En–Ja) and English-to-German (En–De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model.Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja–En and De–En.
To Paraphrase or Not: Efficient Comment Detoxification with Unsupervised Detoxifiability Discrimination
Jing Ke | Zheyong Xie | Shaosheng Cao | Tong Xu | Enhong Chen
Jing Ke | Zheyong Xie | Shaosheng Cao | Tong Xu | Enhong Chen
Mitigating toxic content is critical for maintaining a healthy social platform, yet existing detoxification systems face significant limitations: overcorrection from uniformly processing all toxic comments, and parallel data scarcity in paraphrasing model training. To tackle these challenges, we propose Detoxifiability-Aware Detoxification (DID), a novel paradigm that adaptively conducts filtering or paraphrasing for each toxic comment based on its detoxifiability, namely whether it can be paraphrased into a benign comment in essence. Specifically, DID integrates three core modules: (1) an unsupervised detoxifiability discriminator, (2) a semantic purification module that extracts harmful intents and then performs targeted paraphrasing only on detoxifiable comments and (3) a feedback-adaptive refinement loop that processes remaining harmful contents only when they are detoxifiable. Experimental results demonstrate that DID significantly outperforms existing approaches on academic data and an industrial platform, establishing a novel and practical modeling paradigm for comment detoxification.
Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of *naturalness* vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of *at-issueness* to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., "Hey, wait a minute") are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
Exploring Cross-Lingual Voice Conversion Methods for Anonymizing Low-Resource Text-to-Speech
Shenran Wang | Aidan Pine | Mengzhe Geng
Shenran Wang | Aidan Pine | Mengzhe Geng
We describe and compare multiple approaches for using voice conversion techniques to mask speaker identities in low-resource text-to-speech. We build and evaluate speaker-anonymized text-to-speech systems for two Canadian Indigenous languages, nêhiyawêwin and SENĆOŦEN, and show that cross-lingual speaker transfer via multilingual training with English data produces the most consistent results across both languages. Our research also underscores the need for better evaluation metrics tailored to cross-lingual voice conversion. Our code can be found at https://github.com/EveryVoiceTTS/Speaker_Anonymization_StyleTTS2
Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities
Hongseok Oh | Wonseok Hwang | Kyoung-Woon On
Hongseok Oh | Wonseok Hwang | Kyoung-Woon On
We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models’ legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at https://github.com/lbox-kr/kcl.
Task-Level Instructions Induction for Audio Question Answering from Few Examples
Po-Chun Chen | Hen-Hsen Huang | Hsin-Hsi Chen
Po-Chun Chen | Hen-Hsen Huang | Hsin-Hsi Chen
Large audio-language models (LALMs) benefit from Chain-of-Thought (CoT) prompting for audio question answering (AQA), but acquiring audio CoT examples is particularly challenging as it requires sequential listening and careful integration of acoustic and linguistic information. Surprisingly, our experiments reveal that standard few-shot prompting yields inconsistent results compared to zero-shot CoT, with several models showing degraded accuracy. Moreover, few-shot prompting incurs substantially higher inference costs by processing multiple audio demonstrations per inference. We propose Audio-Induct, which induces reusable textual task instructions from few audio examples once per task, requiring no additional demonstrations at inference. Evaluated on 9 LALMs across two benchmarks, Audio-Induct outperforms state-of-the-art prompting methods while maintaining low inference costs. Inducted Task Instructions transfer effectively across models, enabling scalable deployment.
Optical Character Recognition for the International Phonetic Alphabet
Shu Okabe | Dejvi Zelo | Alexander Fraser
Shu Okabe | Dejvi Zelo | Alexander Fraser
As grammar books are increasingly used as additional reference resources specifically for very low-resource languages, a significant portion comes from scans and relies on the quality of the Optical Character Recognition (OCR) tool. We focus here on a particular script used in linguistics to transcribe sounds: the International Phonetic Alphabet (IPA). We consider two data sources: actual grammar book PDFs for two languages under documentation, Japhug and Kagayanen, and a synthetically generated dataset based on Wiktionary. We compare two neural OCR frameworks, Tesseract and Calamari, and a recent large vision-language model, Qwen2.5-VL-7B, all three in an off-the-shelf setting and with fine-tuning. While their zero-shot performance is relatively poor for IPA characters in general due to character set mismatch, fine-tuning with the synthetic dataset leads to notable improvements.
How DDAIR you? Disambiguated Data Augmentation for Intent Recognition
Galo Castillo-López | Alexis Lombard | Nasredine Semmar | Gaël de Chalendar
Galo Castillo-López | Alexis Lombard | Nasredine Semmar | Gaël de Chalendar
Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.
Measuring Linguistic Competence of LLMs on Indigenous Languages of the Americas
Justin Vasselli | Arturo Mp | Frederikus Hudi | Haruki Sakajo | Taro Watanabe
Justin Vasselli | Arturo Mp | Frederikus Hudi | Haruki Sakajo | Taro Watanabe
This paper presents an evaluation framework for probing large language models’ linguistic knowledge of Indigenous languages of the Americas using zero- and few-shot prompting. The framework consists of three tasks: (1) language identification, (2) cloze completion of Spanish sentences supported by Indigenous-language translations, and (3) grammatical feature classification. We evaluate models from five major families (GPT, Gemini, DeepSeek, Qwen, and LLaMA) on 13 Indigenous languages, including Bribri, Guarani, and Nahuatl. The results show substantial variation across both languages and model families. While a small number of model-language combinations demonstrate consistently stronger performance across tasks, many others perform near chance, highlighting persistent gaps in current models’ abilities on Indigenous languages.
Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models
DongHyeok Lee | Jeongyeon Park | Kyungbeen Cho | Jae Sung Lee
DongHyeok Lee | Jeongyeon Park | Kyungbeen Cho | Jae Sung Lee
Tokenization plays a crucial role in the performance of language models. However, most existing tokenizers rely on frequency-based segmentation, which fails to capture the morphological structure of languages and often leads to inefficient token representations. In this study, we propose a novel tokenization method that emphasizes the importance of Korean morphological structures in eojeol (Korean spacing unit). This method is designed to accommodate both inter-eojeol segmentation and intra-eojeol segmentation, enabling the selection of subwords based on morphemes. We pretrained a language model using the proposed method and evaluated its performance on Korean benchmark tasks. Experimental results demonstrate that the proposed method generally outperforms existing approaches. Notably, it produces significantly fewer tokens per input sequence, indicating its effectiveness and efficiency for Korean language modeling. The code is available at https://github.com/Dohy-Lee/mob.
Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches
Hachem Madmoun | Salem Lahlou
Hachem Madmoun | Salem Lahlou
Eliciting cooperation in multi-agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.
Mind Your Special Tokens! On the Importance of Dedicated Sequence-End Tokens in Vision-Language Embedding Models
Elio Musacchio | Giovanni Semeraro | Goran Glavaš
Elio Musacchio | Giovanni Semeraro | Goran Glavaš
Large Vision-Language Models (LVLMs), trained by aligning visual encoders to LLMs on extensive vision-language data, demonstrate impressive performance across a broad variety of tasks that require understanding of both visual and textual inputs. Acknowledging this, recent work proposed to post-hoc convert generative LVLMs into vision-language encoders (VLEs) via supervised contrastive learning objectives. This type of training enables LVLMs to produce better representations, i.e., embeddings for image and text input, used in retrieval and (semantic) similarity tasks. Having observed that this type of VLEs (i.e., LVLMs turned into encoders) commonly employ last-token pooling in downstream tasks, without using special sequence-end tokens, in this focused contribution, we study the effect of pooling strategies on VLEs’ downstream performance. We empirically show that, in contrast to mean pooling, last-token pooling (without special sequence-end tokens) makes VLEs highly sensitive to end-of-input artifacts in fine-tuning and inference data, e.g., whether input sequences end with punctuation or newline characters. Finally, we show that introducing the special end-of-sequence token removes this sensitivity and makes VLEs robust to formatting artifacts of input text.
The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Alan Saji | Raj Dabre | Anoop Kunchukuttan | Ratish Puduppully
Alan Saji | Raj Dabre | Anoop Kunchukuttan | Ratish Puduppully
Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM’s reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting “Lost in Translation," where translation steps lead to errors that would have been avoided by reasoning in the language of the question.
When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
David Tan | Pinzhen Chen | Josef Van Genabith | Koel Dutta Chowdhury
David Tan | Pinzhen Chen | Josef Van Genabith | Koel Dutta Chowdhury
Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz’s FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Aishwarya Verma | Laud Ammah | Olivia Nercy Ndlovu Lucas | Andrew Zaldivar | Vinodkumar Prabhakaran | Sunipa Dev
Aishwarya Verma | Laud Ammah | Olivia Nercy Ndlovu Lucas | Andrew Zaldivar | Vinodkumar Prabhakaran | Sunipa Dev
Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region’s complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.
STREAM-ZH: Simplified Topic Retrieval Exploration and Analysis Module for Chinese Language
Hongyi Li | Jianjun Lian | Anton Frederik Thielmann | Andre Python
Hongyi Li | Jianjun Lian | Anton Frederik Thielmann | Andre Python
We introduce Simplified Topic Retrieval Exploration and Analysis Module for Chinese language (STREAM-ZH), the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows. Tailored to both simplified and traditional Chinese language, our package extends the STREAM topic modeling framework with a curated collection of preprocessed textual datasets in Chinese from which we assess the performance of classical, neural, and clustering topic models using commonly-used intruder, diversity, and coherence metrics. The results of a benchmark analysis bring evidence that within our framework, topic models may generate coherent and diverse topics from datasets in Chinese language, outperforming those generated by topic models using English-translated textual input. Our framework facilitates multilingual accessibility and research in topic modeling applied to Chinese textual data. The code is available at the following link: https://github.com/AnFreTh/STREAM
MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment
Sagarika Banerjee | Tangatar Madi | Advait Swaminathan | Jolie Nguyen | Shivank Garg | Kevin Zhu | Vasu Sharma
Sagarika Banerjee | Tangatar Madi | Advait Swaminathan | Jolie Nguyen | Shivank Garg | Kevin Zhu | Vasu Sharma
Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets (MiC and MiS) based on a contrastive pair design in the domains of safety and culture, and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.
Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
Lei Tang | Wei Zhou | Mohsen Mesgar
Lei Tang | Wei Zhou | Mohsen Mesgar
Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.
Out of Distribution, Out of Luck: Process Rewards Misguide Reasoning Models
Alexey Dontsov | Anton Korznikov | Andrey V. Galichin | Elena Tutubalina
Alexey Dontsov | Anton Korznikov | Andrey V. Galichin | Elena Tutubalina
Process Reward Models (PRMs) have emerged as a promising approach for guiding large language models (LLMs) through multi-step reasoning by providing step-level feedback during inference. However, our evaluation across 7 LLMs reveals a failure mode: while PRMs improve performance for instruct mathematical models, they fail to enhance and sometimes degrade reasoning model performance. Through systematic analysis with linear probes, we identify distinct reward prediction patterns that differentiate reasoning from non-reasoning model outputs. To understand this mechanism, we train Sparse Autoencoders on the Qwen2.5-Math-PRM and analyze reasoning features. Our analysis reveals that 80% of these features respond to formatting artifacts (whitespace patterns, Unicode tokens, punctuation) rather than mathematical content. Reasoning model outputs exhibit distinct metacognitive patterns absent from standard mathematical solutions. This explains why they lead to unreliable reward estimation. Our findings expose a fundamental limitation in applying existing reward models to reasoning systems and provide mechanistic insights into this failure mode. We release our trained SAEs to facilitate future research into reward model interpretability.
Learning to Ideate for Machine Learning Engineering Agents
Yunxiang Zhang | Kang Zhou | Zhichao Xu | Kiran Ramnath | Yun Zhou | Sangmin Woo | Haibo Ding | Lin Lee Cheong
Yunxiang Zhang | Kang Zhou | Zhichao Xu | Kiran Ramnath | Yun Zhou | Sangmin Woo | Haibo Ding | Lin Lee Cheong
Existing machine learning engineering (MLE) agents struggle to iteratively optimize their implemented algorithms for effectiveness. To address this, we introduce MLE-Ideator, a dual-agent framework that separates ideation from implementation. In our system, an implementation agent can request strategic help from a dedicated Ideator. We show this approach is effective in two ways. First, in a training-free setup, our framework significantly outperforms implementation-only agent baselines on MLE-Bench. Second, we demonstrate that the Ideator can be trained with reinforcement learning (RL) to generate more effective ideas. With only 1K training samples from 10 MLE tasks, our RL-trained Qwen3-8B Ideator achieves an 11.5% relative improvement compared to its untrained counterpart and surpasses Claude Sonnet 3.5. These results highlights a promising path toward training strategic AI systems for scientific discovery.
Infinity-MoE: Generalizing Mixture of Experts to Infinite Experts
Shota Takashiro | Takeshi Kojima | Shohei Taniguchi | Yusuke Iwasawa | Yutaka Matsuo
Shota Takashiro | Takeshi Kojima | Shohei Taniguchi | Yusuke Iwasawa | Yutaka Matsuo
The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose ∞-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based ∞-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5% in accuracy over conventional MoE.
Beyond Tokens: Concept-Level Training Objectives for LLMs
Laya Iyer | Pranav Somani | Alice Guo | Dan Jurafsky | Chen Shani
Laya Iyer | Pranav Somani | Alice Guo | Dan Jurafsky | Chen Shani
The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the token level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent. As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations.We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., "mom," "mommy," "mother" → MOTHER). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests concept-level supervision as an improved training signal that better aligns LLMs with human semantic abstractions.
Persuasion Tokens for Editing Factual Knowledge in LLMs
Paul Youssef | Jörg Schlötterer | Christin Seifert
Paul Youssef | Jörg Schlötterer | Christin Seifert
In-context knowledge editing (IKE) is a promising technique for updating Large Language Models (LLMs) with new information. However, IKE relies on lengthy, fact-specific demonstrations which are costly to create and consume significant context window space. In this paper, we introduce persuasion tokens (P-Tokens) – special tokens trained to replicate the effect of IKE demonstrations, enabling efficient knowledge editing without requiring fact-specific demonstrations. We evaluate P-Tokens across two editing datasets and three LLMs, demonstrating performance comparable to, and often exceeding, IKE. We further find that editing performance is robust to distractors with small negative effects to neighboring facts, and that increasing the number of P-Tokens improves performance. Our work addresses key limitations of IKE and provides a more practical and scalable alternative for editing LLMs.
Language Family Matters: Evaluating SpeechLLMs Across Linguistic Boundaries
Yuchen Zhang | Ravi Shekhar | Haralambos Mouratidis
Yuchen Zhang | Ravi Shekhar | Haralambos Mouratidis
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
Xunyi Jiang | Dingyi Chang | Julian McAuley | Xin Xu
Xunyi Jiang | Dingyi Chang | Julian McAuley | Xin Xu
The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue.
On the Additive Compositionality of Task Vectors in Vision–Language Models
Yuting Shi | Houjing Wei | Naoya Inoue
Yuting Shi | Houjing Wei | Naoya Inoue
In-context learning (ICL) in large language models (LLMs) has been shown to operate through task vectors—the representation that summarizes the mapping induced by in-context demonstrations and can be composed by simple arithmetic operations. While this phenomenon is well studied in LLMs, its extension to vision-language models (VLMs) remains underexplored. In this work, we systematically examine the additive compositionality of in-context task vectors in VLMs, extracted from text-side hidden representations. Specifically, we construct compositional visual reasoning tasks with clearly defined subtasks and extract task vectors from few-shot demonstrations. Empirical experiments show that the vector for a complex task can be approximated by adding the vectors of its constituent subtasks. Beyond this, we analyze token-level contextual embeddings and show that additive composition arises because complex-task representations emerge as the superposition of atomic subtask components, preserving semantic structure within the model’s activation space.
Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs
Arya Labroo | Ivaxi Sheth | Vyas Raina | Amaani Ahmed | Mario Fritz
Arya Labroo | Ivaxi Sheth | Vyas Raina | Amaani Ahmed | Mario Fritz
Large Language Models (LLMs) offer strong generative capabilities, but many applications require explicit and fine-grained control over specific textual concepts, such as humor, persuasiveness, or formality. Prior approaches in prompting and representation engineering can provide coarse or single-attribute control, but systematic evaluation of multi-attribute settings remains limited. We introduce an evaluation framework for fine-grained controllability for both single- and dual-concept scenarios, focusing on linguistically distinct concept pairs (e.g., persuasiveness vs. humor). Surprisingly, across multiple LLMs and generative tasks, we find that performance often drops in the dual-concept setting, even though the chosen concepts should in principle be separable. This reveals a fundamental limitation of naive prompting-based control: models struggle with compositionality even when concepts are intuitively independent. Our framework provides systematic evidence of this gap and offers a principled approach for measuring the ability of future methods for multi-concept control.
Common Sense or Ableism? Rethinking Commonsense Reasoning Through the Lens of Disability
Karina H Halevy | Kimi Wenzel | Seyun Kim | Kyle Dean Bauer | Bruno Neira | Mona T. Diab | Maarten Sap
Karina H Halevy | Kimi Wenzel | Seyun Kim | Kyle Dean Bauer | Bruno Neira | Mona T. Diab | Maarten Sap
Machine translation Evaluation Eng-Thai MQM Ranking dataset
Phichet Phuangrot | Natdanai Trintawat | Kanawat Vilasri | Yanapat Patcharawiwatpong | Pachara Boonsarngsuk | Nat Pavasant | Ekapol Chuangsuwanich
Phichet Phuangrot | Natdanai Trintawat | Kanawat Vilasri | Yanapat Patcharawiwatpong | Pachara Boonsarngsuk | Nat Pavasant | Ekapol Chuangsuwanich
We introduce MEET-MR (Machine Translation English–Thai MQM and Ranking Dataset), a comprehensive benchmark for evaluating English–Thai machine translation systems. The dataset is constructed using the Multidimensional Quality Metrics (MQM) annotation framework, providing fine-grained human judgements of translation quality. In addition, MEET-MR includes human preference rankings and reference translations, enabling both absolute and relative assessment of translation quality. The dataset covers nine diverse domains providing linguistic and contextual diversity. By combining high-quality reference translations, objective MQM error annotations, and subjective preference rankings, MEET-MR serves as a valuable resource for studying translation quality estimation, model alignment with human evaluation, and cross-domain performance in English–Thai machine translation. MEET-MR is publicly available at https://huggingface.co/datasets/Chula-AI/MEET-MR
Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason’s Selection Task
Hirohiko Abe | Kentaro Ozeki | Risako Ando | Takanobu Morishita | Koji Mineshima | Mitsuhiro Okada
Hirohiko Abe | Kentaro Ozeki | Risako Ando | Takanobu Morishita | Koji Mineshima | Mitsuhiro Okada
As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention.In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts.Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored.In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs’ conditional reasoning under deontic rules.We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule).Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors.Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.
Confidence Leaps in LLM Reasoning: Early Stopping and Cross-Model Transfer
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
Pavel Tikhonov | Ivan Oseledets | Elena Tutubalina
We challenge the common assumption that Large Language Models (LLMs) build confidence gradually during reasoning. Instead, we find that conviction is often reached in a discrete "moment of insight", characterized by a sudden and sharp increase in an answer’s probability-a phenomenon we term a "confidence leap". Leveraging this discovery, we introduce a training-free, model-agnostic early-stopping heuristic that halts generation upon detecting such a leap, significantly reducing the generation length without sacrificing accuracy. We also demonstrate that the reasoning text leading up to this leap is semantically potent and transferable: feeding this partial reasoning to a different model family substantially boosts its performance. This suggests that the "confidence leap" marks a shared, interpretable reasoning milestone, not just a model-specific statistical artifact.
Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models
Francesco Maria Molfese | Momchil Hardalov | Rexhina Blloshmi | Bill Byrne | Adrià de Gispert
Francesco Maria Molfese | Momchil Hardalov | Rexhina Blloshmi | Bill Byrne | Adrià de Gispert
With context windows of millions of tokens, Long-Context Language Models (LCLMs) can encode entire document collections, offering a strong alternative to conventional retrieval-augmented generation (RAG). However, it remains unclear whether fine-tuning strategies can improve long-context performance and translate to greater robustness under KV-cache compression techniques. In this work, we investigate which training strategies most effectively enhance LCLMs’ ability to identify and use relevant information, as well as enhancing their robustness under KV-cache compression. Our experiments show substantial in-domain improvements, achieving gains of up to +20 points over the base model. However, out-of-domain generalization remains task dependent with large variance – LCLMs excels on finance questions (+9 points), while RAG shows stronger performance on multiple-choice questions (+6 points) over the baseline models. Finally, we show that our fine-tuning approaches bring moderate improvements in robustness under KV-cache compression, with gains varying across tasks.
Post-ASR Correction in Hindi: Comparing Language Models and Large Language Models in Low-Resource Scenarios
Rishabh Kumar | Amrith Krishna | Ganesh Ramakrishnan | Preethi Jyothi
Rishabh Kumar | Amrith Krishna | Ganesh Ramakrishnan | Preethi Jyothi
Automatic Speech Recognition (ASR) systems for low-resource languages like Hindi often produce erroneous transcripts due to limited annotated data and linguistic complexity. **Post-ASR correction** using language models (LMs) and large language models (LLMs) offers a promising approach to improve transcription quality. In this work, we compare fine-tuned LMs (mT5, ByT5), fine-tuned LLMs (Nanda 10B), and instruction-tuned LLMs (GPT-4o-mini, LLaMA variants) for post-ASR correction in Hindi. Our findings reveal that **smaller, fine-tuned models** consistently **outperform larger LLMs** in both fine-tuning and in-context learning (ICL) settings. We observe a **U-shaped inverse scaling** trend under zero-shot ICL, where mid-sized LLMs degrade performance before marginal recovery at extreme scales, yet still fall short of fine-tuned models. **ByT5 is more effective for character-level corrections** such as transliteration and word segmentation, while **mT5 handles broader semantic inconsistencies**. We also identify performance drops in out-of-domain settings and propose **mitigation strategies** to preserve domain fidelity. In particular, we observe similar trends in **Marathi and Telugu**, indicating the broader applicability of our findings across low-resource Indian languages.
CHiRPE: A Step Towards Real-World Clinical NLP with Clinician-Oriented Model Explanations
Stephanie Fong | Zimu Wang | Guilherme C Oliveira | Xiangyu Zhao | Yiwen Jiang | Jiahe Liu | Beau-Luke Colton | Scott W. Woods | Martha Shenton | Barnaby Nelson | Zongyuan Ge | Dominic Dwyer
Stephanie Fong | Zimu Wang | Guilherme C Oliveira | Xiangyu Zhao | Yiwen Jiang | Jiahe Liu | Beau-Luke Colton | Scott W. Woods | Martha Shenton | Barnaby Nelson | Zongyuan Ge | Dominic Dwyer
The medical adoption of NLP tools requires interpretability by end users, yet traditional explainable AI (XAI) methods are misaligned with clinical reasoning and lack clinician input. We introduce CHiRPE (Clinical High-Risk Prediction with Explainability), an NLP pipeline that takes transcribed semi-structured clinical interviews to: (i) predict psychosis risk; and (ii) generate novel SHAP explanation formats co-developed with clinicians. Trained on 944 semi-structured interview transcripts across 24 international clinics of the AMP-SCZ study, the CHiRPE pipeline integrates symptom-domain mapping, LLM summarisation, and BERT classification. CHiRPE achieved over 90% accuracy across three BERT variants and outperformed baseline models. Explanation formats were evaluated by 28 clinical experts who indicated a strong preference for our novel concept-guided explanations, especially hybrid graph-and-text summary formats. CHiRPE demonstrates that clinically-guided model development produces both accurate and interpretable results. Our next step is focused on real-world testing across our 24 international sites.
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, 5.7 × 102 or 580?”This raises a fundamental question: Do LLMs even know how big these numbers are?We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the *log-magnitudes* of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers).Furthermore, the hidden state after reading a *pair* of numerals encodes their *ranking*, with a linear classifier achieving over 90% accuracy.Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective.Finally, we show that incorporating the classifier probe’s log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models’ internal magnitude representations can enhance their numerical reasoning capabilities.
On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions
Felix Stollenwerk
Felix Stollenwerk
Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.
Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models
Qi Cao | Andrew Gambardella | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Qi Cao | Andrew Gambardella | Takeshi Kojima | Yutaka Matsuo | Yusuke Iwasawa
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, their limited truthfulness and tendency toward overconfidence constrain their reliability in factual tasks. Uncertainty quantification offers a promising approach to identifying potentially unreliable outputs from LLMs. Yet most existing methods rely on repeated sampling or auxiliary models, which substantially increase computational overhead. To address these limitations, we propose an efficient uncertainty quantification method that leverages semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters based on embedding clustering and prefix matching, and compute a cluster-based score at each decoding step to represent uncertainty. Our approach requires only a single generation and does not depend on any auxiliary models. Experiments on multiple datasets and models demonstrate that our method achieves performance comparable to existing baselines while substantially reducing computational overhead.
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa | Daechul Ahn | Reokyoung Kim | Dongyeop Kang | Jonghyun Choi
Seungyeon Jwa | Daechul Ahn | Reokyoung Kim | Dongyeop Kang | Jonghyun Choi
Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate and reuse evaluation insights across cases, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce **Learning While Evaluating** (LWE), a framework that allows evaluators to improve *sequentially* at inference time without requiring additional training or external signals. LWE maintains an evolving *meta-prompt* that (i) stores evaluation insights derived from self-generated feedback during sequential testing, and (ii) allows evaluators to leverage these insights to generate sample-specific evaluation instructions for accurate and consistent judgments. Furthermore, we propose ***Selective* LWE**, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, *Selective* LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update—learning most from the cases they struggle with.
Statistical Foundations of DIME: Risk Estimation for Practical Index Selection
Giulio D'Erasmo | Cesare Campagnano | Antonio Mallia | Pierpaolo Brutti | Nicola Tonellotto | Fabrizio Silvestri
Giulio D'Erasmo | Cesare Campagnano | Antonio Mallia | Pierpaolo Brutti | Nicola Tonellotto | Fabrizio Silvestri
High-dimensional dense embeddings have become central to modern Information Retrieval, but many dimensions are noisy or redundant. Recently proposed DIME (Dimension IMportance Estimation), provides query-dependent scores to identify informative components of embeddings. DIME relies on a costly grid search to select a priori a dimensionality for all the query corpus’s embeddings. Our work provides a statistically grounded criterion that directly identifies the optimal set of dimensions for each query at inference time. Experiments confirm that this approach improves retrieval effectiveness and reduces embedding size by an average 50% of across different models and datasets at inference time.
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Danilo Croce | Jochen Leidner | Nafise Sadat Moosavi
Danilo Croce | Jochen Leidner | Nafise Sadat Moosavi
Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates
Mohamed Chenene | Jeanne Rouhier | Jean Daniélou | Mihir Sarkar | Elena Cabrio
Mohamed Chenene | Jeanne Rouhier | Jean Daniélou | Mihir Sarkar | Elena Cabrio
Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents **Stakeholder Suite**, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.
DeepPavlov Strikes Back: A Toolkit for Improving LLM Reliability and Trustworthiness
Evgenii Nikolaev | Timur Ionov | Anna Korzanova | Vasily Konovalov | Maksim Savkin
Evgenii Nikolaev | Timur Ionov | Anna Korzanova | Vasily Konovalov | Maksim Savkin
This paper introduces DeepPavlov 1.1, a new version of an open-source library for natural language processing (NLP). DeepPavlov 1.1 supports both traditional NLP tasks (like named entity recognition, sentiment classification) and new tasks needed to enhance LLMs truthfulness and reliability. These tools include: a hallucination detection model, an evergreen question classifier, and a toxicity classifier. The library is easy to use, flexible, and works with many languages. It is designed to help researchers and developers build better, safer AI systems that use language. It is publicly available under the Apache 2.0 license and includes access to an interactive online demo.
PropGenie: A Multi-Agent Conversational Framework for Real Estate Assistance
Chang Shen | Shaozu Yuan | Kuizong Wu | Long Xu | Meng Chen
Chang Shen | Shaozu Yuan | Kuizong Wu | Long Xu | Meng Chen
In this paper, we present PropGenie, a novel multi-agent framework based on large language models (LLMs) to deliver comprehensive real estate assistance in real-world scenarios. PropGenie coordinates eight specialized sub-agents, each tailored for distinct tasks, including search and recommendation, question answering, financial calculations, and task execution. To enhance response accuracy and reliability, the system integrates diverse knowledge sources and advanced computational tools, leveraging structured, unstructured, and multimodal retrieval-augmented generation techniques. Experiments on real user queries show that PropGenie outperforms both a general-purpose LLM (OpenAI’s o3-mini-high) and a domain-specific chatbot (Realty AI’s Madison) in real estate scenarios. We hope that PropGenie serves as a valuable reference for future research in broader AI-driven applications.
Pro-QuEST: A Prompt-chain based Quiz Engine for testing Specialized Technical Product Knowledge
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng | Mohammed Hamzeh
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng | Mohammed Hamzeh
In today’s rapidly evolving large language model (LLM) landscape, technology companies such as Cisco face the difficult challengeof selecting the most suitable model for downstream tasks that demand deep, domain-specificproduct knowledge. Specialized benchmarks not only inform this decision making but alsocan be leveraged to rapidly create quizzes that can effectively train engineering and marketingpersonnel on novel product offerings in a continually growing Cisco product space.We present Pro-QuEST, our Prompt-chain based Quiz Engine using state-of-the-art LLMsfor generating multiple-choice questions on Specialized Technical products. In Pro-QuEST,we first identify key terms and topics from a given professional certification textbook orproduct guide, and generate a series of multiple-choice questions using domain-knowledgeguided prompts. We show LLM benchmarking results with the question benchmarks generated by Pro-QuEST using a range of latestopen-source, and proprietary LLMs and compare them with expert-created exams and review questions to derive insights on their composition and difficulty. Our experiments indicate that though there is room for improvementin Pro-QuEST to generate questions of the complexity levels seen in expert-designed certification exams, question-type based prompts provide a promising direction to address this limitation. In sample user studies with Cisco personnel, Pro-QuEST was received with high optimism for its practical usefulness in quicklycompiling quizzes for self-assessment on knowledge of novel products in the rapidly changing tech sector.
elfen: A Python Package for Efficient Linguistic Feature Extraction for Natural Language Datasets
Maximilian Maurer
Maximilian Maurer
A detailed understanding of the basic properties of text collections produced by humans or generated synthetically is vital for all steps of the natural language processing system life cycle, from training to evaluating model performance and synthetic texts.To facilitate the analysis of these properties, we introduce elfen, a Python library for efficient linguistic feature extraction for text datasets. It includes the largest set of item-level linguistic features in eleven feature areas: surface-level, POS, lexical richness, readability, named entity, semantic, information-theoretic, emotion, psycholinguistic, dependency, and morphological features. Building on top of popular NLP and modern dataframe libraries, elfen enables feature extraction in various languages (80 at the moment) on thousands of items, even given limited computing resources. We show how using elfen enables linguistically informed data selection, outlier detection, and text collection comparison.We release elfen as an open-source PyPI package, accompanied by extensive documentation, including tutorials. We host the code at https://github.com/mmmaurer/elfen/, make it available through the GESIS Methods Hub at https://methodshub.gesis.org/library/methods/elfen/, and provide documentation and tutorials at https://elfen.readthedocs.io/en/latest/.
DELTA: A Toolkit for Measuring Linguistic Diversity in Dependency-Parsed Corpora
Louis Estève | Kaja Dobrovoljc
Louis Estève | Kaja Dobrovoljc
Despite growing interest in measuring linguistic diversity on the one hand and the increasing availability of cross-linguistically comparable parsed corpora on the other, tools for systematically measuring the diversity of specific linguistic phenomena on such data remain limited. To address this gap, we present DELTA, an open-source framework that integrates dependency tree querying with diversity computation, enabling systematic measurement across multiple linguistic levels (e.g., lexis, morphology, syntax) and multiple diversity dimensions (variety, balance, disparity). The pipeline processes CoNLL-U formatted corpora through configurable workflows, treating the format as a general-purpose tabular structure independent of specific annotation conventions. We validate DELTA on Parallel Universal Dependencies multilingual dataset, demonstrating its capacity for corpus profiling and cross-corpus diversity comparison.
CLARIESG: An End-to-End System for ESG Analysis over Complex Tables in Corporate Reports
Marta Santacroce | Michele Luca Contalbo | Sara Pederzoli | Riccardo Benassi | Venturelli Valeria | Matteo Paganelli | Francesco Guerra
Marta Santacroce | Michele Luca Contalbo | Sara Pederzoli | Riccardo Benassi | Venturelli Valeria | Matteo Paganelli | Francesco Guerra
Sustainability reports contain rich Environmental, Social and Governance (ESG) information, but their heterogeneous layouts and complex multi-table structures pose major challenges for LLMs, especially for unit normalization, cross-document reasoning, and precise numerical computation. We present CLARIESG, an end-to-end system that couples robust table extraction with a structured prompting framework for multi-table filtering, normalization, and program-of-thought reasoning. On ESG-focused multi-table benchmarks, CLARIESG consistently outperforms standard prompting and provides transparent, auditable reasoning, supporting more reliable ESG analysis and greenwashing detection in real-world settings.
Fact Finder - Enhancing Domain Expertise of Large Language Models by Incorporating Knowledge Graphs
Daniel Steinigen | Roman Teucher | Timm Heine Ruland | Max Rudat | Nicolas Flores-Herr | Peter Fischer | Nikola Milosevic | Christopher Schymura | Angelo Ziletti
Daniel Steinigen | Roman Teucher | Timm Heine Ruland | Max Rudat | Nicolas Flores-Herr | Peter Fischer | Nikola Milosevic | Christopher Schymura | Angelo Ziletti
Recent advancements in Large Language Models (LLMs) have showcased their proficiency in answering natural language queries. However, their effectiveness is hindered by limited domain-specific knowledge, raising concerns about the reliability of their responses. We introduce a hybrid system that augments LLMs with domain-specific knowledge graphs (KGs), thereby aiming to enhance factual correctness using a KG-based retrieval approach. We focus on a medical KG to demonstrate our methodology, which includes (1) pre-processing, (2) Cypher query generation, (3) Cypher query processing, (4) KG retrieval, and (5) LLM-enhanced response generation. We evaluate our system on a curated dataset of 69 samples, achieving a precision of 78% in retrieving correct KG nodes. Our findings indicate that the hybrid system surpasses a standalone LLM in accuracy and completeness, as verified by an LLM-as-a-Judge evaluation method. This positions the system as a promising tool for applications that demand factual correctness and completeness, such as target identification — a critical process in pinpointing biological entities for disease treatment or crop enhancement. Moreover, its intuitive search interface and ability to provide accurate responses within seconds make it well-suited for time-sensitive, precision-focused research contexts. We publish the source code together with the dataset and the prompt templates used.
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques – Attribution Analysis, Function Vector Analysis, and Circuit Tracing – and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations successfully bridged the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user’s prior LLM experience and their comprehension scores, indicating that the system effectively leveled the playing field. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
IntelliCode: A Multi-Agent LLM Tutoring System with Centralized Learner Modeling
Jones David | Shreya Ghosh
Jones David | Shreya Ghosh
LLM-based tutors are typically single-turn assistants that lack persistent representations of learner knowledge, making it difficult to provide principled, transparent, and long-term pedagogical support. We introduce IntelliCode, a multi-agent LLM tutoring system built around a centralized, versioned learner state that integrates mastery estimates, misconceptions, review schedules, and engagement signals. A StateGraph Orchestrator coordinates six specialized agents: skill assessment, learner profiling, graduated hinting, curriculum selection, spaced repetition, and engagement monitoring, each operating as a pure transformation over the shared state under a single-writer policy. This architecture enables auditable mastery updates, proficiency-aware hints, dependency-aware curriculum adaptation, and safety-aligned prompting.The demo showcases an end-to-end tutoring workflow: a learner attempts a DSA problem, receives a conceptual hint when stuck, submits a corrected solution, and immediately sees mastery updates and a personalized review interval. We report validation results with simulated learners, showing stable state updates, improved task success with graduated hints, and diverse curriculum coverage. IntelliCode demonstrates how persistent learner modeling, orchestrated multi-agent reasoning, and principled instructional design can be combined to produce transparent and reliable LLM-driven tutoring.
FiMMIA: scaling semantic perturbation-based membership inference across modalities
Anton Emelyanov | Sergei Kudriashov | Alena Fenogenova
Anton Emelyanov | Sergei Kudriashov | Alena Fenogenova
Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release FiMMIA — a modular Framework for Multimodal MIA. We propose to train a neural networks to analyze the target model’s behavior on perturbed inputs, capturing interactions between semantic domains and loss values on members and non-members in the local neighborhood of each sample. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal settings.
A Browser-based Open Source Assistant for Multimodal Content Verification
Rosanna Milner | Michael Foster | Olesya Razuvayevskaya | Valentin Porcellini | Denis Teyssou | Ian Roberts | Kalina Bontcheva
Rosanna Milner | Michael Foster | Olesya Razuvayevskaya | Valentin Porcellini | Denis Teyssou | Ian Roberts | Kalina Bontcheva
Disinformation and advanced generative AI content pose a significant challenge for journalists and fact-checkers who must rapidly verify digital media. While many NLP models exist for detecting signals like persuasion techniques, subjectivity, and AI-generated text, they often remain inaccessible to non-expert users and are not integrated into their daily workflows as a unified framework. This paper demonstrates the Verification Assistant, a browser-based tool designed to bridge this gap. The Verification Assistant, a core component of the widely adopted Verification Plugin (140,000+ users), allows users to submit URLs or media files to a unified interface. It automatically extracts content and routes it to a suite of backend NLP classifiers, presenting actionable credibility signals, AI-generation likelihood, and other verification advice in an easy-to-digest format. This paper will showcase the tool’s architecture, its integration of multiple NLP services, and its real-world application for detecting disinformation.
Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes
Johann Frei | Nils Feldhus | Lisa Raithel | Roland Roller | Alexander Meyer | Frank Kramer
Johann Frei | Nils Feldhus | Lisa Raithel | Roland Roller | Alexander Meyer | Frank Kramer
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5-Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground-truth data remain open problems.
BOOM: Beyond Only One Modality KIT’s Multimodal Multilingual Lecture Companion
Sai Koneru | Fabian Retkowski | Christian Huber | Lukas Hilgert | Seymanur Akti | Enes Yavuz Ugan | Alexander Waibel | Jan Niehues
Sai Koneru | Fabian Retkowski | Christian Huber | Lukas Hilgert | Seymanur Akti | Enes Yavuz Ugan | Alexander Waibel | Jan Niehues
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline[All released code and models are licensed under the MIT License].
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec | Ivan Srba | Maria Bielikova
Robert Belanec | Ivan Srba | Maria Bielikova
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory.
Similar, but why? A Toolkit for Explaining Text Similarity
Juri Opitz | Andrianos Michail | Lucas Moeller | Sebastian Padó | Simon Clematide
Juri Opitz | Andrianos Michail | Lucas Moeller | Sebastian Padó | Simon Clematide
Explaining text similarity and developing interpretable models are emerging research challenges (Opitz et al., 2025). We release XPLAINSIM, a Python package that unifies three complementary approaches for explaining textual similarity in an easily accessible way: 1. a token attribution method that explains how individual word interactions contribute to the predicted similarity of any embedding model; 2. a method for inferring structured neural embedding spaces that capture explainable aspects of text, and 3. a symbolic approach that explains textual similarity transparently through parsed meaning representations. We demonstrate the value of our package through intuitive examples and three focused empirical research studies. The first study evaluates interpretability methods for constructing cross-lingual token alignments. The second investigates how modern information retrieval methods handle stop words. The third sheds more light on a long-standing question in computational linguistics: the distinction between relatedness and similarity. XPLAINSIM is available at https://github.com/flipz357/XPLAINSIM.
AlignFix: A Tool for Parallel Corpora Augmentation and Refinement
Samuel Frontull | Simon Haller-Seeber
Samuel Frontull | Simon Haller-Seeber
High-quality datasets are crucial for training effective state of the art machine translation systems. However, due to the data-intensive nature of these systems, they have to be trained on large amounts of text that can easily go beyond the scope of full human inspection. This makes the presence of noise that can degrade overall system performance a frequent and significant issue. While various approaches have been developed to identify and select only the highest-quality training examples, this is undesirable in scenarios where resources are limited. For this reason, we introduce AlignFix, an open-source tool for augmenting data, identifying and correcting errors in parallel corpora. Leveraging word alignments, AlignFix extracts consistent phrase pairs, enabling targeted replacements that can improve the dataset quality. Besides targeted replacements, the tool enables contextual augmentation by duplicating sentences and allowing users to substitute words with alternatives of their choice. The tool maintains and updates the underlying word alignments, thereby avoiding the costly recomputation. AlignFix runs locally in the browser, requires no installation, and ensures that all data remains entirely on the client side. It is released under Apache 2.0 license, encouraging broad adoption, reuse, and further development. A live demo is available at https://ifi-alignfix.uibk.ac.at.
PromptLab: A Collaborative Platform for Prompt Engineering and Dataset Curation
Maged S. Al-shaibani | Zaid Alyafeai | Dania Refai | Nawaf Alomari | Ahmed Ashraf | Mais Alheraki | Mustafa Alturki | Hamzah Luqman | Irfan Ahmad
Maged S. Al-shaibani | Zaid Alyafeai | Dania Refai | Nawaf Alomari | Ahmed Ashraf | Mais Alheraki | Mustafa Alturki | Hamzah Luqman | Irfan Ahmad
PromptLab is a web-based platform for collaborative prompt engineering across diverse natural language processing tasks and datasets. The platform addresses primary challenges in prompt development, including template creation, collaborative review, and quality assurance through a comprehensive workflow that supports both individual researchers and team-based projects. PromptLab integrates with HuggingFace and provides AI-assisted prompt generation via OpenRouter[<https://openrouter.ai/>], and supporting real-time validation with multiple Large Language Models (LLMs). The platform features a flexible templating system using Jinja2, role-based project management, peer review processes, and supports programmatic access through RESTful APIs. To ensure data privacy and support sensitive research environments, PromptLab includes an easy CI/CD pipeline for self-hosted deployments and institutional control. We demonstrate the platform’s effectiveness through two evaluations: a controlled comparison study with six researchers across five benchmark datasets and 13 models with 90 prompts; and a comprehensive case study in instruction tuning research, where over 350 prompts across 80+ datasets have been developed and validated by multiple team members. The platform is available at https://promptlab.up.railway.app and the source code is available on GitHub at https://github.com/KFUPM-JRCAI/PromptLab.
LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation
Himel Ghosh | Nick Elias Werner
Himel Ghosh | Nick Elias Werner
As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns.LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions.The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.
InkSight: Towards AI-Aided Historical Manuscript Analysis
Andrey Sakhovskiy | Ivan Ulitin | Emilia Bojarskaja | Vladimir Kokh | Ruslan Murtazin | Maxim Novopoltsev | Semen Budennyy
Andrey Sakhovskiy | Ivan Ulitin | Emilia Bojarskaja | Vladimir Kokh | Ruslan Murtazin | Maxim Novopoltsev | Semen Budennyy
Large-scale scientific research on historical documents — particularly medieval Arabic manuscripts — remains challenging due to the need for advanced paleographic and linguistic training, the large volume of hand-written materials, and the absence of assisting software. In this paper, we propose InkSight, the first end-to-end Arabic manuscript analysis tool for manuscript-based analytics and research hypothesis testing. InkSight integrates three key components: (i) an Optical Character Recognition (OCR) module utilizing a Large Visual Language Model (LVLM); (ii) a lightweight document indexing and information retrieval module that enables query-based evidence retrieval from book-length manuscripts; and (iii) a flexible Large Language Model (LLM) prompting interface factually grounded to the given manuscript via Retrieval-Augmented Generation (RAG). Empirical evaluation on the existing KITAB OCR benchmark and our in-house dataset of ancient Arabic manuscripts has revealed that historical research can be effectively supported using smaller fine-tuned LVLMs without relying on larger proprietary models. The live web demo for InkSight is available freely at: https://inksight.ru and the source code for InkSight is publicly available at Github: https://github.com/ds-hub-sochi/InkSight-tool.
promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle | Timo Heiß | Moritz Schlager | Matthias Aßenmacher | Matthias Feurer
Tom Zehle | Timo Heiß | Moritz Schlager | Matthias Aßenmacher | Matthias Feurer
Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers demonstrate its effectiveness, practical adoption is hindered because existing implementations are often tied to unmaintained, isolated research codebases or require invasive integration into application frameworks. To address this, we introduce promptolution, a unified, modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers, supports systematic and reproducible benchmarking, and returns framework-agnostic prompt strings, enabling seamless integration into existing LLM pipelines while remaining agnostic to the underlying model implementation.
T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
Dmitrii Stoianov | Danil Taranets | Olga Tsymboi | Ramil Latypov | Almaz Dautov | Vladislav Kruglikov | Surkov Nikita | German Abramov | Pavel Gein | Dmitry Abulkhanov | Mikhail Gashkov | Viktor Zelenkovskiy | Artem Batalov | Aleksandr Medvedev | Anatolii Potapov
Dmitrii Stoianov | Danil Taranets | Olga Tsymboi | Ramil Latypov | Almaz Dautov | Vladislav Kruglikov | Surkov Nikita | German Abramov | Pavel Gein | Dmitry Abulkhanov | Mikhail Gashkov | Viktor Zelenkovskiy | Artem Batalov | Aleksandr Medvedev | Anatolii Potapov
We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference.The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on HuggingFace. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains.T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.Demo: https://t-pro2eagle.streamlit.app/https://huggingface.co/collections/t-tech/t-pro-20
SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation
Sergio Burdisso | Séverin Baroudi | Yanis Labrak | David Grünert | Pawel Cyrta | Yiyang Chen | Srikanth Madikeri | Esaú Villatoro-tello | Ricard Marxer | Petr Motlicek
Sergio Burdisso | Séverin Baroudi | Yanis Labrak | David Grünert | Pawel Cyrta | Yiyang Chen | Srikanth Madikeri | Esaú Villatoro-tello | Ricard Marxer | Petr Motlicek
We present SDialog, an MIT-licensed open-source Python toolkit for end-to-end development, simulation, evaluation, and analysis of LLM-based conversational agents. Built around a standardized Dialog representation, SDialog unifies persona-driven multi-agent simulation with composable orchestration for controlled synthetic dialog generation; multi-layer evaluation combining linguistic metrics, LLM-as-a-judge assessments, and functional correctness validators; mechanistic interpretability tools for activation inspection and causal behavior steering via feature ablation and induction; and audio rendering with full acoustic simulation, including 3D room modeling and microphone effects. The toolkit integrates with major LLM backends under a consistent API, enabling mixed-backend and reproducible experiments. By bridging agent construction, user simulation, dialog generation, evaluation, and interpretability within a single coherent workflow, SDialog enables more controlled, transparent, and systematic research on conversational systems.
Agentic AI for Human Resources: LLM-Driven Candidate Assessment
Kamer Ali Yuksel | Abdul Basit Anees | Ashraf Hatim Elneima | Sanjika Hewavitharana | Mohamed Al-Badrashiny | Hassan Sawaf
Kamer Ali Yuksel | Abdul Basit Anees | Ashraf Hatim Elneima | Sanjika Hewavitharana | Mohamed Al-Badrashiny | Hassan Sawaf
In this work, we present a modular and interpretable framework that uses Large Language Models (LLMs) to automate candidate assessment in recruitment. The system integrates diverse sources—including job descriptions, CVs, interview transcripts, and HR feedback—to generate structured evaluation reports that mirror expert judgment. Unlike traditional ATS tools that rely on keyword matching or shallow scoring, our approach employs role-specific, LLM-generated rubrics and a multi-agent architecture to perform fine-grained, criteria-driven evaluations. The framework outputs detailed assessment reports, candidate comparisons, and ranked recommendations that are transparent, auditable, and suitable for real-world hiring workflows. Beyond rubric-based analysis, we introduce an LLM-Driven Active Listwise Tournament mechanism for candidate ranking. Instead of noisy pairwise comparisons or inconsistent independent scoring, the LLM ranks small candidate subsets (“mini-tournaments”), and these listwise permutations are aggregated using a Plackett–Luce model. An active-learning loop selects the most informative subsets, producing globally coherent and sample-efficient rankings. This adaptation of listwise LLM preference modeling—previously explored in financial asset ranking —provides a principled and highly interpretable methodology for large-scale candidate ranking in talent acquisition.
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove’s data management features reduce memory consumption by a factor of 2.6. Moreover, Trove’s easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.
ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access
Jiwoo Park | Ruoqi Liu | Avani Jagdale | Andrew Srisuwananukorn | Jing Zhao | Ping Zhang | Sachin Kumar
Jiwoo Park | Ruoqi Liu | Avani Jagdale | Andrew Srisuwananukorn | Jing Zhao | Ping Zhang | Sachin Kumar
We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.
Large language models (LLMs) have expanded the potential for AI-assisted scientific claim verification, yet existing systems often exhibit unverifiable attributions, shallow evidence mapping, and hallucinated citations. We present SciTrue, a claim verification system providing source-level accountability and evidence traceability. SciTrue links each claim component to explicit, verifiable scientific sources, enabling users to inspect and challenge model inferences, addressing limitations of both general-purpose and search-augmented LLMs. In a human evaluation of 300 attributions, SciTrue achieves high fidelity in summary traceability, attribution accuracy, and context alignment, substantially outperforming RAG-based baselines such as GPT-4o-search-preview and Perplexity Sonar Pro. These results underscore the importance of principled attribution and context-aware reasoning in AI-assisted scientific verification. A system demo is available at .
PUCP-Metrix: An Open-source and Comprehensive Toolkit for Linguistic Analysis of Spanish Texts
Javier Alonso Villegas Luis | Marco Antonio Sobrevilla Cabezudo
Javier Alonso Villegas Luis | Marco Antonio Sobrevilla Cabezudo
Linguistic features remain essential for interpretability and tasks that involve style, structure, and readability, but existing Spanish tools offer limited coverage. We present PUCPMetrix, an open-source and comprehensive toolkit for linguistic analysis of Spanish texts. PUCP-Metrix includes 182 linguistic metrics spanning lexical diversity, syntactic and semantic complexity, cohesion, psycholinguistics, and readability. It enables fine-grained, interpretable text analysis. We evaluate its usefulness on Automated Readability Assessment and Machine-Generated Text Detection, showing competitive performance compared to an existing repository and strong neural baselines. PUCP-Metrix offers a comprehensive and extensible resource for Spanish, supporting diverse NLP applications.
Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments
Ashish Raj Shekhar | Shiven Agarwal | Priyanuj Bordoloi | Yash Shah | Tejas Anvekar | Vivek Gupta
Ashish Raj Shekhar | Shiven Agarwal | Priyanuj Bordoloi | Yash Shah | Tejas Anvekar | Vivek Gupta
Multi-Modal Large Language Models (MLLMs) can now solve entire exams directlyfrom uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model’s decoding process, making them ineffective when students query proprietary black-box systems using instructor-provided documents. We present INTEGRITYSHIELD,a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 question papers spanning STEM, humanities, and medical reasoning, INTEGRITYSHIELD achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance and authorship evidence.
Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SciLire System
Necva Bölücü | Jessica Irons | Changhyun Lee | Brian Jin | Maciej Rybinski | Huichen Yang | Andreas Duenser | Stephen Wan
Necva Bölücü | Jessica Irons | Changhyun Lee | Brian Jin | Maciej Rybinski | Huichen Yang | Andreas Duenser | Stephen Wan
The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.
xLM: A Python Package for Non-Autoregressive Language Models
Dhruvesh Patel | Durga Prasad Maram | Sai Sreenivas Chintha | Benjamin Rozonoyer | Andrew McCallum
Dhruvesh Patel | Durga Prasad Maram | Sai Sreenivas Chintha | Benjamin Rozonoyer | Andrew McCallum
In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion package) that can be used by the research community.
AITutor-EvalKit: Exploring the Capabilities of AI Tutors
Numaan Naeem | Kaushal Kumar Maurya | Kseniia Petukhova | Ekaterina Kochmar
Numaan Naeem | Kaushal Kumar Maurya | Kseniia Petukhova | Ekaterina Kochmar
We present AITutor-EvalKit, an application that uses language technology to evaluate the pedagogical quality of AI tutors, provides software for demonstration and evaluation, as well as model inspection and data visualization. This tool is aimed at education stakeholders as well as *ACL community at large, as it supports learning and can also be used to collect user feedback and annotation.
Robust and comprehensive evaluation of large language models (LLMs) is essential for identifying effective LLM system configurations and mitigating risks associated with deploying LLMs in sensitive domains. However, traditional statistical metrics are poorly suited to open-ended generation tasks, leading to growing reliance on LLM-based evaluation methods. These methods, while often more flexible, introduce additional complexity: they depend on carefully chosen models, prompts, parameters, and evaluation strategies, making the evaluation process prone to misconfiguration and bias. In this work, we present EvalSense, a flexible, extensible framework for constructing domain-specific evaluation suites for LLMs. EvalSense provides out-of-the-box support for a broad range of model providers and evaluation strategies, and assists users in selecting and deploying suitable evaluation methods for their specific use-cases. This is achieved through two unique components: (1) an interactive guide aiding users in evaluation method selection and (2) automated meta-evaluation tools that assess the reliability of different evaluation approaches using perturbed data. We demonstrate the effectiveness of EvalSense in a case study involving the generation of clinical notes from unstructured doctor-patient dialogues, using a popular open dataset. All code, documentation, and assets associated with EvalSense are open-source and publicly available at https://github.com/nhsengland/evalsense.
AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments
Saeid Vaghefi | Aymane Hachcham | Veronica Grasso | Nakiete Msemo | Chiara Colesanti Senni | Markus Leippold
Saeid Vaghefi | Aymane Hachcham | Veronica Grasso | Nakiete Msemo | Chiara Colesanti Senni | Markus Leippold
Tracking financial investments in climate adaptation is complex and expertise-intensive, particularly for Early Warning Systems (EWS), where multilateral development bank (MDB) and fund reports lack standardized financial reporting and appear as heterogeneous PDFs with complex tables and inconsistent layouts.We introduce an agent-based Retrieval-Augmented Generation (RAG) system that uses hybrid retrieval and internal chain-of-thought (CoT) reasoning to extract relevant financial data, classify EWS investments, and allocate budgets with grounding evidence spans. While these components are individually established, our contribution is their integration into a domain-specific workflow tailored to heterogeneous MDB reports and numerically grounded EWS budget allocation. On a manually annotated CREWS Fund corpus, our system outperforms four alternatives (zero-shot classifier, few-shot “zero rule” classifier, fine-tuned transformer-based classifier, and few-shot CoT+ICL classifier) on multi-label classification and budget allocation, achieving 87% accuracy, 89% precision, and 83% recall. We further benchmark against the Gemini 2.5 Flash AI Assistant on an expert-annotated MDB evidence set co-curated with the World Meteorological Organization (WMO), enabling a comparative analysis of glass-box agents versus black-box assistants in transparency and performance. The system is publicly deployed and accessible at https://ews-front.vercel.app/ (see Appendix A for demonstration details and Appendix B for dataset statistics and splits). We will open-source all code, LLM generations, and human annotations to support further work on AI-assisted climate finance.
RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
Keerthana Murugaraj | Salima Lamsiyah | Martin Theobald
Keerthana Murugaraj | Salima Lamsiyah | Martin Theobald
Evaluating Retrieval-Augmented Generation(RAG) systems remains a challenging task: existingmetrics often collapse heterogeneous behaviorsinto single scores and provide little insightinto whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduceRAGVUE, a diagnostic and explainableframework for automated, reference-freeevaluation of RAG pipelines. RAGVUE decomposesRAG behavior into retrieval quality,answer relevance and completeness, strictclaim-level faithfulness, and judge calibration.Each metric includes a structured explanation,making the evaluation process transparent. Ourframework supports both manual metric selectionand fully automated agentic evaluation. Italso provides a Python API, CLI, and a localStreamlit interface for interactive usage. Incomparative experiments, RAGVUE surfacesfine-grained failures that existing tools suchas RAGAS often overlook. We showcase thefull RAGVUE workflow and illustrate how itcan be integrated into research pipelines andpractical RAG development. The source codeand detailed instructions on usage are publiclyavailable on Github.
SmartMatch: Real-Time Semantic Retrieval for Translation Memory Systems
Ernesto Luis Estevanell Valladares | Salima Lamsiyah | Alicia Picazo-Izquierdo | Tharindu Ranasinghe | Ruslan Mitkov | Rafael Munoz
Ernesto Luis Estevanell Valladares | Salima Lamsiyah | Alicia Picazo-Izquierdo | Tharindu Ranasinghe | Ruslan Mitkov | Rafael Munoz
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner | Jens Rupprecht | Georg Ahnert | Ahmed Salem | Markus Strohmaier
Maximilian Kreutner | Jens Rupprecht | Georg Ahnert | Ahmed Salem | Markus Strohmaier
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation (>40 million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers. We also find that answers can be obtained for a fraction of the compute cost, by changing the presentation method. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
A Virtual Assistant for Architectural Design in a VR Environment
Ander Salaberria | Oier Ijurco | Markel Ferro | Jiayuan Wang | Iñigo Vilá Muñoz | Roberto de Ioris | Jeremy Barnes | Oier Lopez De Lacalle
Ander Salaberria | Oier Ijurco | Markel Ferro | Jiayuan Wang | Iñigo Vilá Muñoz | Roberto de Ioris | Jeremy Barnes | Oier Lopez De Lacalle
Architectural design relies on 3D modeling procedures, generally carried out in Building Information Modeling (BIM) formats. In this setting, architects and designers collaborate on building designs, iterating over many possible versions until a final design is agreed upon. However, this iteration is complicated by the fact that any changes need to be made by manually introducing changes to the complex BIM files, which lengthens the design process and makes it difficult to quickly prototype changes. To speed up prototyping, we propose VR-Arch, a virtual assistant that allows users to interact with the BIM file in a virtual reality (VR) environment. This framework enables users to 1) make changes directly in the VR environment, 2) make complex queries about the BIM, and 3) combine these to perform more complex actions. All of this is done via voice commands and processed through a ReAct-based agentic system that selects appropriate tools depending on the query context.This multi-tool approach enables real-time, contextualized interaction through natural language, allowing for a faster and more natural prototyping experience.
ARGSBASE: A Multi-Agent Interface for Structured Human–AI Deliberation
Frieso Turkstra | Sara Nabhani | Khalid Al Khatib
Frieso Turkstra | Sara Nabhani | Khalid Al Khatib
We present a new deliberation interface that enables users to engage with multiple large language models (LLMs), coordinated by a moderator agent that assigns roles, manages turn-taking, and ensures structured interaction. Grounded in argumentation theory, the system fosters critical thinking through user–LLM dialogues, real-time summaries of agreements and open questions, and argument maps. Rather than treating LLMs as mere answer providers, our tool positions them as reasoning partners, supporting epistemically responsible human–AI collaboration. It exemplifies hybrid argumentation and aligns with recent calls for “reasonable parrots,” where LLM agents interact with users guided by argumentative principles such as relevance, responsibility, and freedom. A user study shows that participants found the tool easy to use, perspective-enhancing, and promising for research, while suggesting areas for improvement. We make the deliberation interface accessible for testing and provide a recorded demonstration.
Simultaneous Speech-to-Text Translation Web Application for Estonian
Bohdan Podziubanchuk | Aivo Olev | Jiaming Kong | Tanel Alumäe
Bohdan Podziubanchuk | Aivo Olev | Jiaming Kong | Tanel Alumäe
This paper presents a new open-source web application for simultaneous speech-to-text translation. The system translates live Estonian speech into English, Russian, and Ukrainian text, and also supports English-to-Estonian translation. Our solution uses a cascaded architecture that combines streaming speech recognition with a recently proposed LLM-based simultaneous translation model. The LLM treats translation as a conversation, processing input in small five-word chunks. Our streaming speech recognition achieves a word error rate of 10.2% and a BLEU score of 26.1 for Estonian-to-English, significantly outperforming existing streaming solutions. The application is designed for real-world use, featuring a latency of only 3 - 6 seconds. The application is available at https://est2eng.vercel.app.
The AI Committee: A Multi-Agent Framework for Automated Validation and Remediation of Web-Sourced Data
Sunith Vallabhaneni | Thomas Berkane | Maimuna S. Majumder
Sunith Vallabhaneni | Thomas Berkane | Maimuna S. Majumder
Many research areas rely on data from theweb to gain insights and test their methods.However, collecting comprehensive researchdatasets often demands manually reviewingmany web pages to identify and record relevantdata points, which is labor-intensive and sus-ceptible to error. While the emergence of largelanguage models (LLM)-powered web agentshas begun to automate parts of this process,they often struggle to ensure the validity of thedata they collect. Indeed, these agents exhibitseveral recurring failure modes—including hal-lucinating or omitting values, misinterpretingpage semantics, and failing to detect invalidinformation—which are subtle and difficultto detect and correct manually. To addressthis, we introduce the AI Committee, a novelmodel-agnostic multi-agent system that auto-mates the process of validating and remediatingweb-sourced datasets. Each agent is special-ized in a distinct task in the data quality assur-ance pipeline, from source scrutiny and fact-checking to data remediation and integrity val-idation. The AI Committee leverages variousLLM capabilities—including in-context learn-ing for dataset adaptation, chain-of-thought rea-soning for complex semantic validation, and aself-correction loop for data remediation—allwithout task-specific training. We demonstratethe effectiveness of our system by applyingit to three real-world datasets, showing that itgeneralizes across LLMs and significantly out-performs baseline approaches, achieving datacompleteness up to 73.3% and precision up to97.3%. We additionally conduct an ablationstudy demonstrating the contribution of eachagent to the Committee’s performance. Thiswork is released as an open-source tool for theresearch community
entity-linkings: A Unified Library for Entity Linking
Yuya Sawada | Tsuyoshi Fujita | Yusuke Sakai | Taro Watanabe
Yuya Sawada | Tsuyoshi Fujita | Yusuke Sakai | Taro Watanabe
Entity linking (EL) aims to disambiguate named entities in text by mapping them to the appropriate entities in a knowledge base. However, it is difficult to use some EL methods, as they sometimes have issues in reproducibility due to limited maintenance or the lack of official resources.To address this, we introduce , a unified library for using and developing entity linking systems through a unified interface. Our library flexibly integrates various candidate retrievers and re-ranking models, making it easy to compare and use any entity linking methods within a unified framework. In addition, it is designed with a strong emphasis on API usability, making it highly extensible, and it supports both command-line tools and APIs. Our code is available on GitHub and is also distributed via PyPI under the MIT-license. The video is available on YouTube.
ESG-KG: A Multi-modal Knowledge Graph System for Automated Compliance Assessment
Li-Yang Chang | Chih-Ming Chen | Hen-Hsen Huang | Ming-Feng Tsai | An-Zi Yen | Chuan-Ju Wang
Li-Yang Chang | Chih-Ming Chen | Hen-Hsen Huang | Ming-Feng Tsai | An-Zi Yen | Chuan-Ju Wang
Our system is built upon a multi-modal information extraction pipeline designed to process and interpret corporate sustainability reports. This integrated framework systematically handles diverse data formats—including text, tables, figures, and infographics—to extract, structure, and evaluate ESG-related content. The extracted multi-modal data is subsequently formalized into a structured knowledge graph (KG), which serves as both a semantic framework for representing entities, relationships, and metrics relevant to ESG domains, and as the foundational infrastructure for the automated compliance system. This KG enables high-precision retrieval of information across multiple source formats and reporting modalities. The trustworthy, context-rich representations provided by the knowledge graph establish a verifiable evidence base, creating a critical foundation for reliable retrieval-augmented generation (RAG) and subsequent LLM-based scoring and analysis of automatic ESG compliance system.
BanSuite: A Unified Toolkit and Software Platform for Low-Resource NLP in Bangla
Md. Abu Sayed | Faisal Ahamed Khan | Jannatul Ferdous Tuli | Nabeel Mohammed | Mohammad Ruhul Amin | Mohammad Mamun Or Rashid
Md. Abu Sayed | Faisal Ahamed Khan | Jannatul Ferdous Tuli | Nabeel Mohammed | Mohammad Ruhul Amin | Mohammad Mamun Or Rashid
Bangla is one of the world’s most widely spoken languages, yet it remains significantly under-resourced in natural language processing (NLP). Existing efforts have focused on isolated tasks such as Part-of-Speech (POS) tagging and Named Entity Recognition (NER), but comprehensive, integrated systems for core NLP tasks including Shallow Parsing and Dependency Parsing are largely absent. To address this gap, we present BanSuite, a unified Bangla NLP ecosystem developed under the EBLICT project. BanSuite combines a large-scale, manually annotated Bangla Treebank with high-quality pretrained models for POS tagging, NER, shallow parsing, and dependency parsing, achieving strong in-domain baseline performance (POS: 90.16 F1, NER: 90.11 F1, SP: 86.92 F1, DP: 90.27 UAS). The system is accessible through a Python toolkit (Bkit) and a Web Application, providing both researchers and non-technical users with robust NLP functionalities, including tokenization, normalization, lemmatization, and syntactic parsing. In benchmarking against existing Bangla NLP tools and multilingual Large Language Models (LLMs), BanSuite demonstrates superior task performance while maintaining high efficiency in resource usage. By offering the first comprehensive, open, and integrated NLP platform for Bangla, BanSuite lays a scalable foundation for research, application development, and further advancement of low-resource language technologies. A demonstration video is provided to illustrate the system’s functionality in https://youtu.be/3pcfiUQfCoA
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Selene Baez Santamaria | Sai Ashish Somayajula | Atsuki Yamaguchi
Selene Baez Santamaria | Sai Ashish Somayajula | Atsuki Yamaguchi
EACL 2026 Student Research Workshop: Mentorship Program Report
Selene Báez Santamaría | Sai Ashish Somayajula | Atsuki Yamaguchi
Selene Báez Santamaría | Sai Ashish Somayajula | Atsuki Yamaguchi
This report provides a summary and analysis of the EACL 2026 Student Research Workshop (SRW) Mentorship Program, using structured exit surveys collected from mentors and mentees. Following the spirit of recent ACL Program Chairs’ Reports , this document aims to increase transparency, record lessons learned, and offer actionable guidance for future SRW organizers. The analysis evaluates overall satisfaction, identifies systematic strengths and weaknesses of the mentorship process, and offers recommendations to improve the alignment of expectations and program logistics. We hope that the publication of these findings serves to clarify the organization of mentorship at *ACL venues, provide empirical data for future chairs, and contribute context for meta-research regarding early-career support within the NLP community.
Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
Boqi Chen | Xudong Liu | Jianing Qiu
Boqi Chen | Xudong Liu | Jianing Qiu
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
Domain Adaptation of Image Encoder for Multimodal Manga Translation
Kota Manabe | Tomoyuki Kajiwara | Takashi Ninomiya | Isao Goto | Shonosuke Ishiwatari | Hiroshi Noji
Kota Manabe | Tomoyuki Kajiwara | Takashi Ninomiya | Isao Goto | Shonosuke Ishiwatari | Hiroshi Noji
The objective of this paper is to enhance machine translation for manga (Japanese comics) by developing and employing an image encoder that is capable of more accurately comprehending its visual context. Conventional manga machine translation systems have faced the challenge of lacking sufficient manga comprehension capabilities when utilizing image information. To address this issue, we propose a domain-adapted image encoder training method for manga. The proposed method involves training encoders to acquire visual features that consider the structural and sequential characteristics of the manga. This approach draws upon a technique that has proven to be highly effective in training language models. The image encoders trained by the proposed methods are used as visual processors in a multimodal machine translation model, and they are evaluated in a Japanese-English translation task. The experimental results demonstrate that the proposed method enhances the performance metrics for translation evaluation, such as BLEU and xCOMET, in comparison to the conventional method.
Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Mohammad Nehad Alam | Proma Hossain Progga | Swakkhar Shatabda
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Mohammad Nehad Alam | Proma Hossain Progga | Swakkhar Shatabda
Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver
Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer
Maxence Lasbordes | Sinoué Gad
Maxence Lasbordes | Sinoué Gad
The landscape of Large Language Models remains predominantly English-centric, resulting in a significant performance gap for other major languages, such as French, especially in the context of Small Language Models (SLMs). Existing multilingual models demonstrate considerably lower performance in French compared to English, and research on efficient adaptation methods for French remains limited. To address this, we introduce Luth, a family of French-specialized SLMs: through targeted post-training on curated, high-quality French data, our models outperform all open-source counterparts of comparable size on multiple French benchmarks while retaining their original English capabilities. We further show that strategic model merging enhances performance in both languages, establishing Luth as a new state of the art for French SLMs and a robust baseline for future French-language research.
Machine Translation for Low-Resource Languages through Monolingual Data and LLM: A Case Study of English-to-Basque
Nam Luu | Aitor Soroa | German Rigau | Ondřej Bojar
Nam Luu | Aitor Soroa | German Rigau | Ondřej Bojar
Developing a machine translation (MT) system requires a considerable amount of high-quality parallel data, which is often limited for low-resource languages. This paper explores the use of synthetic data for training an LLM-based MT system in the English-to-Basque direction. Using Basque monolingual corpora as a starting point, we apply back-translation to generate parallel corpora, taking advantage of the fact that current LLMs do not translate well from English to Basque, but they yield an acceptable performance in the reverse direction. We conduct experiments in a multi-stage approach, from a simple Supervised Fine-tuning (SFT) step, to preference learning with the Direct Preference Optimization (DPO) technique. We then evaluate the approach with both automatic metrics and manual assessment. Experimental results suggest that for this task, SFT brings a clear improvement in translation quality, while DPO only yields marginal enhancement.
Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety
Denis Janiak | Julia Moska | Dawid Motyka | Karolina Seweryn | Paweł Walkowiak | Bartosz Żuk | Arkadiusz Janz
Denis Janiak | Julia Moska | Dawid Motyka | Karolina Seweryn | Paweł Walkowiak | Bartosz Żuk | Arkadiusz Janz
Large language models (LLMs) require careful alignment to balance competing objectives: factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.
Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
York Hay Ng | Aditya Khan | Xiang Lu | Matteo Salloum | Michael Zhou | Phuong Hanh Hoang | A. Seza Doğruöz | En-Shiun Annie Lee
York Hay Ng | Aditya Khan | Xiang Lu | Matteo Salloum | Michael Zhou | Phuong Hanh Hoang | A. Seza Doğruöz | En-Shiun Annie Lee
Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
Is He Extroverted? Identifying Missing Relevant Personas for Faithful User Simulation
Weiwen Su | Yuhan Zhou | Zihan Wang | Naoki Yoshinaga | Masashi Toyoda
Weiwen Su | Yuhan Zhou | Zihan Wang | Naoki Yoshinaga | Masashi Toyoda
Existing user simulation approaches focus on generating user-like responses in dialogue. They often assume that the provided persona is sufficient for producing such responses, without verifying whether critical personas are supplied. This raises concerns about the validity of simulation results.To address this issue, we study the task of identifying persona dimensions (e.g., ”whether the user is price-sensitive”) that are relevant but missing in simulating a user’s reply for a given dialogue context.We introduce PICQ-drama (constructed from TVShowGuess), a benchmark of context-aware choice questions, annotated with missing persona dimensions whose absence leads to ambiguous user choices. We further design diverse evaluation criteria for missing persona identification.Benchmarking leading LLMs on our PICQ-drama dataset demonstrates the feasibility of this task. Evaluation across diverse criteria, along with further analyses, reveals cognitive differences between LLMs and humans and highlights the distinct roles of different persona categories in shaping responses.
Quality-Aware Adversarial Ensemble for Singer Identification in 1960s Tamil Film Music
Sathiyakugan Balakrishnan | Uthayasanker Thayasivam
Sathiyakugan Balakrishnan | Uthayasanker Thayasivam
1960s Tamil cinema’s musical heritage lacks adequate metadata identifying playback singers in archival recordings. We present a quality-aware adversarial ensemble approach addressing two critical challenges: (1) variable audio degradation requiring adaptive model selection, and (2) instrumentation leakage confounding singer-specific features. We curate 348 annotated clips (12 hours) spanning 48 singers from 179 films. Our methodology introduces: a reliability estimation network dynamically gating five complementary pre-trained speaker models (Wav2Vec2, ECAPA-TDNN, WeSpeaker, CAM++, ERes2NetV2) based on degradation characteristics; adversarial training disentangling singer identity from accompaniment style; and uncertainty-calibrated predictions for human-in-the-loop workflows. On a held-out test set of 52 clips, we achieve 96.2% accuracy (95% CI: [87.5%, 99.2%]) and 2.0% EER (95% CI: [1.2%, 3.1%]), representing 7.7% absolute improvement over the best single model and 2.0% over static ensemble fusion. Ablations show quality-aware gating contributes 2.0% and adversarial disentanglement 2.0% beyond standard ensembles. We publicly release the dataset and code with fixed splits.
Thesis Proposal: Efficient KV Cache Reuse for Multi-Document Retrieval-Augmented Generation
Zhipeng Zhang | Dmitry Ilvovsky
Zhipeng Zhang | Dmitry Ilvovsky
Retrieval-Augmented Generation (RAG) systems face efficiency bottlenecks in prefill due to attention mechanism, and traditional KV cache only accelerates decoding. In this context, reusing document-level KV cache computed for retrieved documents in previous sessions during the prefill stage appears to be a natural way to amortize computation, but it raises serious correctness challenges due to position and context misalignment across queries and sessions. This research proposes a multi-document KV cache reuse framework for multi-document RAG workloads across queries and sessions to resolve position misalignment and context misalignment, preserving accuracy while eliminating document-specific quadratic complexity in prefill. Theoretical analysis will establish conditions under which multi-document KV cache reuse remains stable and close to full recomputation, providing principled guarantees for both efficiency and accuracy. These results will enable deployment in existing RAG pipelines without architectural changes or model retraining. Crucially, to ensure robustness in real-world deployments, validation will extend beyond standard benchmarks to include noise-robustness tests and domain-specific workloads (e.g., legal). The research aims to empirically confirm these guarantees and demonstrate that substantial prefill speedups can be achieved without materially degrading task-level performance.
Thesis proposal: COGNILENS: Analyzing Cognitive Decline in Language Models for Alzheimer’s Monitoring
Jonathan Guerne
Jonathan Guerne
This research proposal describes a cross-disciplinary project aimed at developing Digital Twins (DTs) of Alzheimer’s Disease (AD) using Language Models (LMs). By mimicking the functional deficits observed in individuals with AD, these DTs will serve as tools for early detection and understanding of disease progression. Several approaches to altering the LM will be explored, and the resulting effects on brain score — an evaluation of the correlation between brain activity and the LM’s internal activations — will be studied. Detection models will be trained based on each approach; these models will be compared against themselves and the state-of-the-art.Two converging lines of evidence motivate this work: LMs achieve high accuracy in classifying AD from speech transcripts, and their internal representations correlate significantly with human brain activity during language processing. If successful, this project could lead to significant advancements in the early detection and monitoring of AD, ultimately improving patient outcomes.
Beyond One-Step Distillation: Bridging the Capacity Gap in Small Language Models via Multi-Step Knowledge Transfer
Gaeun Yim | Nayoung Ko | Manasa Bharadwaj
Gaeun Yim | Nayoung Ko | Manasa Bharadwaj
Large Language Models (LLMs) excel across diverse NLP tasks but remain too large for efficient on-device deployment. Although knowledge distillation offers a promising compression strategy, direct one-step distillation from a large teacher to a small student often leads to substantial performance loss due to the capacity gap. In this work, we revisit multi-step knowledge distillation (MSKD) as an effective remedy, exploring how staged, size-aware transfer paths can better preserve teacher knowledge across students of varying scales. Through extensive experiments with GPT-2 and OPT, we demonstrate that MSKD consistently improves ROUGE-L and perplexity over single-step approaches without requiring specialized fine-tuning. Our results establish multi-step transfer as a simple yet powerful framework for progressively compressing LLMs into efficient, high-performing Small Language Models (SLMs).
Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing?
Josef Jon | Ondřej Bojar
Josef Jon | Ondřej Bojar
This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools into our daily lives.As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts, thereby contributing to a cycle of monotonous, repetitive language – whether generated by machines or humans.To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the generated language, compared to human translators.We further analyze potential origins of these limitations within existing training objectives and decoding strategies. Ultimately, our goal is to propose and validate alternative approaches (e.g., loss functions, decoding algorithms) that maintain the diversity and complexity of language and that allow for better global planning of the output generation, enabling the models to better reflect the ambiguities inherent in human communication.
Constructing a Dataset for Hallucination Detection in Japanese Summarization with Fine-grained Faithfulness Labels
Hikari Tanaka | Atsushi Keyaki | Mamoru Komachi
Hikari Tanaka | Atsushi Keyaki | Mamoru Komachi
Large language models (LLMs) can generate fluent text, but the quality of generated content crucially depends on its consistency with the given input.This aspect is commonly referred to as faithfulness, which concerns whether the output is properly grounded in the input context.A major challenge related to faithfulness is that generated content may include information not supported by the input or may contradict it.This phenomenon is often referred to as hallucination, and increasing attention has been paid to automatic hallucination detection, which determines whether an LLM’s output is hallucinated.To evaluate the performance of hallucination detection systems, researchers use evaluation datasets with labels indicating the presence or absence of hallucinations.While such datasets have been developed for English and Chinese, Japanese evaluation resources for hallucination detection remain limited.Therefore, we constructed a Japanese evaluation dataset for hallucination detection in summarization by manually annotating sentence-level faithfulness labels in LLM-generated summaries of Japanese documents.We annotate 390 summaries (1,938 sentences) generated by three LLMs with sentence-level multi-label annotations for faithfulness with respect to the input document.The taxonomy extends a prior classification scheme and captures distinct patterns of model errors, enabling both binary hallucination detection and fine-grained error-type analysis of Japanese LLM summarization.
Comparing Text Compression Capabilities of Large Language Models with Traditional Compression Algorithms
Mehran Haddadi | William John Teahan
Mehran Haddadi | William John Teahan
This work evaluates the non-English and unstructured text compression performance of Large Language Models (LLMs) by comparing them with traditional baselines on datasets from eight most widely spoken languages. Experimental results show that the evaluated LLM (LLaMA-3.2-1B) was considerably outperformed by the baselines, particularly on non-English datasets, where its performance relative to the best baseline was more than three times worse than on English datasets on average. It also compressed unstructured English data up to more than twofold less effectively than plain English data. Traditional methods, however, remained largely dataset-agnostic. Surprisingly, the LLM achieved worse compression ratios on some datasets than others despite modeling them more accurately. Overall, the outcomes and substantially higher compression time and resource consumption indicate that current LLMs are highly impractical for the compression task, where traditional methods continue to excel. Codes are available at: https://github.com/mehranhaddadi13/llm_compress.
Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
Klejda Alushi | Jan Strich | Chris Biemann | Martin Semmann
Klejda Alushi | Jan Strich | Chris Biemann | Martin Semmann
Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.[GitHub Repository]
LEMUR: Robust Fine-Tuning for Multilingual Embedding Models for Retrieval
Narges Baba Ahmadi | Jan Strich | Martin Semmann | Chris Biemann
Narges Baba Ahmadi | Jan Strich | Martin Semmann | Chris Biemann
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We further propose the Lexical Content Score (LCS), a language-agnostic metric that quantifies the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions. Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code[GitHub Repository] and data[Hugging Face Dataset].
Trainable, Multiword-aware Linguistic Tokenization Using Modern Neural Networks
Clara Boesenberg | Kilian Evang
Clara Boesenberg | Kilian Evang
We revisit MWE-aware linguistic tokenization as a character-level and token-level sequence labeling problem and present a systematic evaluation on English, German, Italian, and Dutch data. We compare a standard tokenizer trained without MWE-awareness as a baseline (UDPipe), a character-level SRN+CRF model (Elephant), and transformer-based MaChAmp models trained either directly on gold character labels or as token-level postprocessors on top of UDPipe. Our results show that the two-stage pipeline – UDPipe pretokenization followed by MaChAmp postprocessing – consistently yields the best accuracy. Our analysis of error patterns highlights how different architectures trade off over- and undersegmentation. These findings provide practical guidance for building MWE-aware tokenizers and suggest that postprocessing pipelines with transformers are a strong and general strategy for non-standard tokenization.
Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors
Adnan Al Ali | Jindřich Helcl | Jindřich Libovický
Adnan Al Ali | Jindřich Helcl | Jindřich Libovický
LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
Call, Reward, Repeat: Advancing Dialog State Tracking with GRPO and Function Calling
Timur Ionov | Anna Marshalova | Valentin Malykh
Timur Ionov | Anna Marshalova | Valentin Malykh
Recent advancements in Large Language Models (LLMs) have notably enhanced task-oriented dialogue systems, particularly in Dialogue State Tracking (DST), owing to their generative capabilities and strong generalization. Although recent approaches such as LDST and FnCTOD significantly improved cross-domain DST performance via supervised fine-tuning (SFT), these methods typically require substantial amounts of domain-specific data. In this paper, we address this limitation by employing Group Relative Policy Optimization (GRPO) - a critic-free reinforcement learning method that efficiently guides LLMs toward improved DST accuracy even under low-resource conditions. Our results on established DST benchmarks, including MultiWOZ 2.1 and 2.4, demonstrate that the RL approach achieves superior performance to existing methods while using significantly reduced out-of-domain training data. In addition, we found out that models pretrained specifically for tool-use tasks can be a better starting point, especially on small scales.
Generalising LLM Routing using Past Performance Retrieval: A Few-Shot Router is Sufficient
Clovis Varangot-Reille | Christophe Bouvard | Antoine Gourru
Clovis Varangot-Reille | Christophe Bouvard | Antoine Gourru
We study model routing for Large Language Model (LLM)-based systems. A model, called the router, dynamically chooses which LLM should handle a given input/query. We challenge the assumption that complex routers are necessary for generalising to new candidate LLMs. We introduce ContextualRouter, a simple meta-evaluation framework that predicts per-model performance for new queries by retrieving similar past queries and reweighting model scores with lightweight attention. During inference, the router balances estimated performance and cost by adjusting a tunable cost penalty parameter. This allows the router to adapt dynamically to the addition or removal of LLMs without the need for retraining. Across five routing benchmarks (SPROUT, RouterBench, LiveBench, BigGenBench, and EmbedLLM), ContextualRouter matches the quality–cost trade-offs of other generalisable routers. Surprisingly, a simpler non-parametric baseline, k-nearest-neighbour averaging, performs comparably or better, achieving strong performance estimation, high NDCG, and substantial cost savings. Retrieval-based routers remain robust to k, embedding size, data sparsity, retrieval degradation, and generalise to unseen queries and models with as little as 1% historical data. These results suggest that effective retrieval alone enables generalisable LLM routing.
CAPID: Context-Aware PII Detection for Question-Answering Systems
Mariia Ponomarenko | Sepideh Abedini | Masoumeh Shafieinejad | D. B. Emerson | Shubhankar Mohapatra | Xi He
Mariia Ponomarenko | Sepideh Abedini | Masoumeh Shafieinejad | D. B. Emerson | Shubhankar Mohapatra | Xi He
Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user’s question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
Exploring the Semantic Space of Second Language Learners
Trisha Godara | Rui He | Wolfram Hinzen | Yan Cong
Trisha Godara | Rui He | Wolfram Hinzen | Yan Cong
While the semantic space has been examined as a way to computationally represent language meaning-grammar interface, minimal research has been done comparing the semantic spaces of first and second language learners. We investigated the semantic space of university-level students learning French by extracting semantic features from narrative text over various time points from a 21-month period. After using machine learning models to classify native speakers’ semantic features from second language learners’, we used interpretability techniques to identify the most informative features per model. Through this, we discovered a variety of embedding similarity features to be decisive in language learning. We compared both groups to determine how the features differed per group and if there was any change over time. The findings demonstrated that the second language learners on average had higher semantic similarity scores than the native speakers at the token level. The similarity decreased over time but did not reach native-level values. Similarly, average surprisal was higher in the second language learner group, which steadily decreased over the course of the data collection period. These results provide insight into personalized education with more precise and effective computational indices tracking learners’ progress.
Kahaani: A Multimodal Co-Creative Storytelling System
Samee Arif | Muhammad Saad Haroon | Aamina Jamal Khan | Taimoor Arif | Agha Ali Raza | Awais Athar
Samee Arif | Muhammad Saad Haroon | Aamina Jamal Khan | Taimoor Arif | Agha Ali Raza | Awais Athar
This paper introduces Kahaani, a multimodal, co-creative storytelling system that leverages Generative Artificial Intelligence, designed for children to address the challenge of sustaining engagement to foster educational narrative experiences. Here we define co-creative as a collaborative creative process in which both the child and Kahaani contribute to the generation of the story. The system combines Large Language Model (LLM), Text-to-Speech (TTS), Text-to-Music (TTM), and Text-to-Video (TTV) generation to produce a rich, immersive, and accessible storytelling experience. The system grounds the co-creation process in two classical storytelling framework, Freytag’s Pyramid and Propp’s Narrative Functions. The main goals of Kahaani are: (1) to help children improve their English skills, (2) to teach important life lessons through story morals, and (3) to help them understand how stories are structured, all in a fun and engaging way. We present evaluations for each AI component used, along with a user study involving three parent–child pairs to assess the overall experience and educational value of the system.
A Benchmark and Evaluation of Automated Language of Study Extraction from Computational Linguistics Publications
Henry Gagnier | Ashwin Kirubakaran
Henry Gagnier | Ashwin Kirubakaran
Language of study is an aspect of computational linguistics papers that is useful for analyses of trends and diversity in computational linguistics. This study introduces the first benchmark and evaluation of automated language of study extraction from computational linguistics publications. The benchmark containing 431 publications from the ACL Anthology, with 62 languages analyzed, was annotated. SciBERT and four large language models (LLMs), GPT-4o mini, Gemini 2.5 Flash, Claude 3.5 Haiku, and DeepSeek 3.2, were evaluated on the benchmark using different parts of the ACL Anthology papers. GPT-4o mini achieved the best exact match and Jaccard agreement scores of 0.646 and 0.687, respectively, which is slightly less than the agreement in human annotation. Gemini 2.5 Flash achieved the best micro F1 of 0.633. Models using the abstract for extraction were competitive with models using the full text, showing that accuracy can be achieved in language of study extraction without high computational costs. These findings demonstrate that LLMs are able to accurately identify the languages of study in computational linguistics papers, potentially reducing the time and cost of analyses in computational linguistics.
Who Plays Which Role? Protagonist Detection and Classification in Moral Discourse
Mirko Sommer | Maria Becker
Mirko Sommer | Maria Becker
Protagonists play a central role in moral discourse by structuring responsibility and authority, yet computational work has largely focused on moral values rather than the actors involved. We address this gap by studying phrase-level protagonist detection and classification in the Moralization Corpus (Becker et al., 2025), a dataset of moral arguments across different text genres. We decompose the task into identifying protagonist mentions and classifying them by what kind of actor they are (e.g., individual or institution) and what function they serve in the moral argument.We compare fine-tuned lightweight models, state-of-the-art NER models, and prompting-based large language models. We further establish human baselines and analyze the impact of contextual information on human and model decisions. Our results show that fine-tuned NER models achieve competitive detection performance at substantially lower cost than prompted large language models, and that role classification benefits more strongly from contextualized prompting. Across tasks, top-performing models reach or exceed human-level performance, highlighting the value of task decomposition for modeling protagonists in moral discourse.We release our code, predictions, and supplementary material in our project repository.
Music is a universal cultural practice that influences emotion, ritual and creativity, and it is now represented in many digital modalities: audio recordings, symbolic encodings (MIDI, MusicXML, ABC), visual scores and lyrics. Multimodal Large Language Models (MLLMs) have the ambition to process "everything", including music, and therefore promise to support musical analysis, creation and education. Despite this promise, systematic methods for evaluating whether a MLLM understands music are missing. Existing music-focused benchmarks are fragmented, largely single-modality, Western-centric, and often do not require actual perception of the musical content; methodological details such as prompt design and answer-extraction are frequently omitted or not discussed, and some evaluations rely on proprietary LLMs, hindering reproducibility and raising concerns about test-data leakage. To fill this gap, this dissertation proposes to design a musically multimodal benchmark built on a transparent, fully open evaluation pipeline. The benchmark will present closed-question-answer items across four musical modalities, employ carefully engineered distractor options to enforce genuine perceptual engagement, and follow rigorously documented prompt-selection and answer-extraction procedures. It will further incorporate culturally diverse musical material beyond the dominant Western canon. Guided by three research questions: (1) how to devise robust, reproducible evaluation procedures, (2) how current MLLMs perform across modalities, and (3) how model scores relate to human musical abilities; the benchmark will enable precise diagnosis of model limitations, inform the development of more musically aware AI systems, and provide a principled basis for assessing practical usefulness to musicians and other stakeholders in the creative industry.
Communication as a Complex System: Modeling the Feedback Dynamics of Trust and Credibility
Swaptik Chowdhury | Samuel D. Allen | Jung Hee Hyun
Swaptik Chowdhury | Samuel D. Allen | Jung Hee Hyun
This study examines how credibility, trust, and bias interact within complex communication systems that shape public understanding of scientific information. It addresses two questions: 1. What are the primary factors that influence the public’s comprehension of scientific findings? 2. How do the factors influencing public understanding of climate change science interact within a complex system? A scoping literature review synthesized disparate communication models from media studies, science communication, psychology, and information science to identify a shared set of system variables. The identified variables were organized into source-, message-, channel-, and receiver-related factors and used to develop a causal loop diagram showing how credibility, trust, and information processing co-evolve through reinforcing and balancing feedback. The resulting diagram illustrates two major loops: one centered on trust in information sources, which can foster social cohesion or accelerate truth decay, and another linking individual trust dynamics to broader patterns of polarization and unity. By clarifying how well-established constructs interact to produce dynamic communication outcomes, the framework is useful for scholars developing integrative theory and for policymakers and practitioners designing interventions in misinformation-prone environments. The CLD also provides a foundation for future system dynamics modeling to examine how interventions in transparency, media literacy, or platform governance may influence public trust over time.
The Clinical Fingerprint: Comparing the Rhetorical Integrity and Epistemic Safety of Human Physicians and Large Language Models
Bayram Ayadi
Bayram Ayadi
While Large Language Models demonstrate expert proficiency on medical benchmarks, the clinical encounter requires more than factual retrieval. It demands a sophisticated rhetorical performance of care that balances authority with epistemic humility. This paper investigates the Clinical Fingerprint by comparing the structural and ethical integrity of advice generated by human physicians and various language models.Our findings reveal a fundamental divergence in how clinical information is prioritized and delivered. We show that whereas physicians utilize efficient, action-oriented structures to provide clear guidance, generic models often bury critical advice under layers of complex linguistic recursion. This creates a significant cognitive load for patients and risks a dangerous safety cliff where models adopt an unearned authoritative tone. Such models frequently mimic the confidence of a doctor while providing contradictory advice, particularly in complex cases involving multiple symptoms. By identifying these rhetorical gaps, our work emphasizes that domain-specific fine-tuning is an ethical necessity to ensure that AI assistants maintain the necessary humility and logical cohesion required for safe medical practice.
Acceleration of Backpropagation in Linear Layers of Transformer Models Based on Gradient Structure
Dmitrii Topchii | Alexander Panchenko | Viktoriia A. Chekalina
Dmitrii Topchii | Alexander Panchenko | Viktoriia A. Chekalina
Fine-tuning Transformer models is often dominated by the backward computation in linear layers. In many NLP tasks, input sequences are short and padded to a fixed context length, inducing structured sparsity in the output gradients. We propose Sparsity-Exploiting Backward Pass (SEBP), a heuristic method that reduces backward computation by exploiting this sparsity with negligible memory overhead. We show that, for short input sequences, the output gradients of BERT-based and LLaMA models exhibit pronounced sparsity, allowing for optimisation in the backward computation. We optimized the autograd function in the linear layers, significantly reducing the number of FLOPs during the backward.Our method achieves a backward pass speedup of approximately 2.15x for BERT-base on GLUE tasks and 1.99x for a 3B LLaMA model on reasoning benchmarks, while maintaining memory usage nearly identical to the regular PyTorch fine-tuning. Crucially, this speedup comes at no cost to performance. We show that our method matches standard convergence rates, offering a memory-efficient way to accelerate LLM fine-tuning.
Chronocept: Instilling a Sense of Time in Machines
Krish Goel | Sanskar Pandey | Mahadevan Ks | Harsh Kumar | Vishesh Khadaria
Krish Goel | Sanskar Pandey | Mahadevan Ks | Harsh Kumar | Vishesh Khadaria
Human cognition is deeply intertwined with a sense of time, known as Chronoception. This sense allows us to judge how long facts remain valid and when knowledge becomes outdated. Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity. We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time. Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance. It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages). Annotations show strong inter-annotator agreement (84% and 89%). Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches. Chronocept fills a foundational gap in AI’s temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents. Code and data are publicly available.
When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Zafir Shamsi | Nikhil Chekuru | Zachary Guzman | Shivank Garg
Zafir Shamsi | Nikhil Chekuru | Zachary Guzman | Shivank Garg
Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
GraphRAG-Rad: Concept-Aware Radiology Report Generation via Latent Visual-Semantic Retrieval
Faezeh Safari | Hang Dong | Zeyu Fu | Aline Villavicencio
Faezeh Safari | Hang Dong | Zeyu Fu | Aline Villavicencio
Radiology report generation involves translating visual signals from pixels into precise clinical language. Existing encoder-decoder models often suffer from hallucinations, generating plausible but incorrect medical findings. We propose GraphRAG-Rad, a novel architecture that integrates biomedical knowledge through a novel Latent Visual-Semantic Retrieval (VSR). Unlike traditional Retrieval-Augmented Generation (RAG) methods that rely on textual queries, our approach aligns visual embeddings with the latent space of the Knowledge Graph, PrimeKG. The retrieved sub-graph guides the Visual Encoder and the Multi-Hop Reasoning Module. The reasoning module simulates clinical deduction paths (Ground-Glass Opacity → Viral Pneumonia → COVID-19) before it combines the information with visual features in a Graph-Gated Cross-Modal Decoder. Experiments on the COV-CTR dataset demonstrate that GraphRAG-Rad achieves competitive performance with strong results across multiple metrics. Furthermore, ablation studies show that integrating latent retrieval and reasoning improves performance significantly compared to a visual-only baseline. Qualitative analysis further reveals interpretable attention maps. These maps explicitly link visual regions to symbolic medical concepts, effectively bridging the modality gap between vision and language.
Token Pruning for Improving Graph-Generating State Space Model Performance
Monish Beegamudre | Jack Zheng | Margaret Capetz
Monish Beegamudre | Jack Zheng | Margaret Capetz
State Space Models (SSMs) have recently emerged as efficient alternatives to Transformers for sequence modeling, yet extending them to two-dimensional vision tasks remains challenging. The Graph-Generating State Space Model (GG-SSM) addresses this challenge by constructing an adaptive graph, achieving competitive performance on vision benchmarks. However, state propagation over the resulting graph introduces substantial inference overhead, limiting scalability to high-resolution inputs. In this work, we introduce a leaf-guided computation pruning strategy that accelerates GG-SSM inference without modifying the underlying graph topology. Rather than removing nodes or edges, our approach selectively scales or bypasses secondary refinement computations associated with high-dissimilarity leaf nodes, while preserving the low-weight MST backbone. Experiments on multiple long-term time series forecasting benchmarks demonstrate consistent throughput improvements with controlled accuracy degradation across a range of pruning ratios. These results indicate that structure-aware computation pruning is an effective mechanism for improving the scalability of graph-based state space models.
Scale Is All You Need: Analyzing Modality Interaction and Speaker Intent Without Fine-Tuning
Animesh Gurjar | Nikhil Krishnaswamy
Animesh Gurjar | Nikhil Krishnaswamy
Understanding sarcasm requires integrating cues from language, voice, and facial expression. Recent work has achieved impressive results using large multimodal Transformers, but such models are computationally expensive and often obscure how each modality contributes to the final prediction. This paper introduces a lightweight, interpretable framework for multimodal sarcasm detection that combines frozen text, audio, and visual embeddings from pretrained encoders through compact fusion heads. Using the MUStARD++Balanced dataset, we show that early fusion of textual and acoustic features improves over the best unimodal baseline. Character-specific evaluation further shows that sarcasm expressed through overt prosodic and visual cues is substantially easier to detect than monotone, context-dependent sarcasm. Additionally, we evaluate generalization to different characters through leave-one-speaker-out (LOSO) experiments and run ablation-style transfer experiments on two speakers with similar sarcasm distributions. These findings demonstrate that effective multimodal sarcasm understanding can emerge from frozen, resource-efficient representations without large-scale fine-tuning, emphasizing the importance of modality interaction and delivery style rather than model scale.
Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget
Zohaib Khan | Omer Tafveez | Zoha Hayat Bhatti
Zohaib Khan | Omer Tafveez | Zoha Hayat Bhatti
In-Image Machine Translation. A Preliminary Modular Approach
Sergio Gomez Gonzalez | Miguel Domingo | Francisco Casacuberta
Sergio Gomez Gonzalez | Miguel Domingo | Francisco Casacuberta
In-image machine translation is a sub-task of Image-Based Machine Translation that aims to substitute text embedded in images with its translation into another language. In the current work, we define a simple task with a synthetic dataset based on rendering parallel text over a plain background. Furthermore, we experiment with different optical character recognition, machine translation and image synthesis models to include in our ensemble. Then, we present our cascade approach as a pipeline that obtains the transcript of the original image, translates it, and generates a new image (image synthesis) similar to the original one. Finally, we compare the performance of our approach with several current state-of-the-art models, including an end-to-end approach, demonstrating its competitiveness.
Text-to-Text Automatic Story Generation: A Survey
Yuan Ma | Hanna Suominen | Patrik Haslum | Richard Susilo
Yuan Ma | Hanna Suominen | Patrik Haslum | Richard Susilo
Automatic story generation aims to produce coherent, engaging, and contextually consistent narratives with minimal or no human involvement, thereby advancing research in computational creativity and applications in human language technologies. The emergence of large language models has progressed the task, enabling systems to generate multi-thousand-word stories under diverse constraints. Despite these advances, maintaining narrative coherence, character consistency, storyline diversity, and plot controllability in generating stories is still challenging. In this survey, we conduct a systematic review of research published over the past four years to examine the major trends and key limitations in story generation methods, model architectures, datasets, and evaluation methodologies. Based on this analysis of 57 included papers, we propose developing new evaluation metrics and creating more suitable datasets, together with ongoing improvement of narrative coherence and consistency, as well as their exploration in practical applications of story generation, as actions to support continued progress in automatic story generation.
Probabilistic Bilingual Subword Segmentation with Latent Subword Alignment
Shoto Nishida | Daiki Matsui | Takashi Ninomiya | Isao Goto | Akihiro Tamura
Shoto Nishida | Daiki Matsui | Takashi Ninomiya | Isao Goto | Akihiro Tamura
This study proposes a method for learning subword correspondences in parallel sentence pairs using the EM algorithm. Conventional neural machine translation typically employs subword segmentation models trained. However, since existing methods do not consider parallel relationships, inconsistencies in word segmentation between source and target languages may hinder translation model training. Our approach leverages direct modeling of subword correspondences in parallel corpora, thereby improving segmentation consistency across languages. Experiments across multiple machine translation tasks confirm that our proposed method improves translation accuracy for many tasks.
Indian languages represent a highly multilingual and low-resource speech ecosystem, where the scarcity of high-quality parallel speech corpora significantly limits the development of speech-to-speech translation systems. Most existing approaches rely on cascaded pipelines that combine automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS). While effective, these cascaded systems often suffer from cumulative error propagation, increased latency, and higher computational complexity, particularly in low-resource Indian languages. To address these challenges, my doctoral work proposes a novel sequence-to-sequence direct speech translation framework capable of translating speech from one Indian language to another without relying on intermediate text representations. Recent advances in deep learning, however, indicate that direct speech translation architectures can surpass conventional cascaded systems in both efficiency and translation quality, motivating the design of our fully end-to-end solution. We aim to release an initial dataset comprising at least 120,000 real speech samples within a 6–12 month timeframe.
Towards Singable Lyrics Translation Using Large Language Models
Liu Hanze | Yusuke Sakai | Taro Watanabe
Liu Hanze | Yusuke Sakai | Taro Watanabe
Lyrics translation must account for rhythm, rhyme, and singability in the translated lyrics. In this study, we focus on singability and investigate effective prompting methods for translating singable lyrics, including verification-guided and multi-round prompting, applied to large language models. First, we curate a multilingual lyrics translation dataset covering a total of six language directions across Chinese, Japanese, and English. Next, we evaluate seven prompting strategies, with instruction complexity increasing incrementally. The results show that multi-prompt strategies improve singability-related aspects, such as rhythmic alignment and phonological naturalness, compared to naive translation. Furthermore, human evaluations using songs created from translated lyrics suggest that moderately complex prompting strategies improve singable naturalness, while more complex strategies contribute to greater stability in perceived quality.
Evaluating the Impact of SAE-based Language Steering on LLM Performance
Sebastian Zwirner | Wentao Hu | Koshiro Aoki | Daisuke Kawahara
Sebastian Zwirner | Wentao Hu | Koshiro Aoki | Daisuke Kawahara
Recent advances in Sparse Autoencoders (SAEs) have revealed interpretable features within large language models (LLMs), including features that are specific to individual languages.In prior work, these features have been used to steer a model’s output language.However, the impact of SAE-based language steering on output quality and task performance, as well as its relationship to simpler prompting-based approaches, remains unclear.In this work, we study the effects of language steering using SAE features across multiple tasks and models.We apply language-specific SAE feature steering to three LLMs from two model families and evaluate it on a translation task and a multilingual question-answering task.We compare SAE-based steering against prompting and language neuron-based steering, and examine a combined prompting-and-steering approach.On the translation task, SAE feature steering achieves an average target-language accuracy of 92% across models and languages, consistently outperforming language neuron-based steering, but slightly underperforming prompting in language accuracy and output quality.In contrast, on the multilingual question-answering task, SAE-based steering enables stronger language control than prompting, and combining steering with prompting yields the best overall language control and task performance.These findings demonstrate the potential of SAE features as a tool for controllable multilingual generation.
Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework
Grzegorz Statkiewicz | Alicja Dobrzeniecka | Karolina Seweryn | Aleksandra Krasnodębska | Karolina Piosek | Katarzyna Bogusz | Sebastian Cygert | Wojciech Kusa
Grzegorz Statkiewicz | Alicja Dobrzeniecka | Karolina Seweryn | Aleksandra Krasnodębska | Karolina Piosek | Katarzyna Bogusz | Sebastian Cygert | Wojciech Kusa
Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention, our approach yields strong results: we observe a +9.5 pp improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we release our models, code, and datasets.
A Computational Forensic Linguistic Analysis of Narrative and Question-Answer Structures in Italian Police Interrogation Transcripts
Romane Werner | Thomas François | Sonja Bitzer
Romane Werner | Thomas François | Sonja Bitzer
Police interrogation transcripts are key evidential documents, yet their linguistic form is rarely systematically analyzed, despite directly shaping judicial interpretation. This study presents the first computational forensic linguistic profiling of Italian police transcripts, focusing on the two transcription formats used in practice: narrative monologues and question-answer (Q-A) transcripts. Using automated extraction of 147 linguistic features, we analyze 50 authentic transcripts against a multi-genre Italian reference corpus to support more transparent evaluation of police transcripts by clarifying how transcription formats systematically shape evidential interpretation in judicial contexts. Narrative monologues exhibit deeper syntactic embedding, higher past-tense usage, and more first-person singular verbs, supporting coherent and temporally ordered recounting of events. Q-A transcripts, by contrast, show longer subordinate chains, more clausal complements, and higher pronoun frequency, reflecting interactive turn-taking and procedural dynamics. Rather than aiming at predictive classification, the study reveals the linguistic mechanisms shaping transcription formats and demonstrates that structurally and legally informed features reliably distinguish them. Computational models reliably capture genre-specific cues, offering scalable, empirically grounded insights into transcription practices and evidential reliability.
Thesis Proposal: A Multi-Agent System for Ontology-Based Perspective-Aware Knowledge Extraction
Luiz Do Valle Miranda | Grzegorz J. Nalepa
Luiz Do Valle Miranda | Grzegorz J. Nalepa
This thesis investigates how polyvocal ontologies and Large Language Model (LLM) based Multi-Agent Systems (MAS) can operationalize perspective-aware knowledge extraction, preserving conflicting stakeholder interpretations as epistemically separable, queryable Knowledge Graphs (KGs). Current AI systems consolidate multiple perspectives into singular, decontextualized schemas, introducing representational bias and information loss. We propose a systematic framework addressing three interconnected research questions: (1) how to generate polyvocal ontology design patterns for high-stakes domains; (2) how to architect LLM-based MAS that extract perspective-conditioned facts while maintaining schema coherence and provenance traceability; and (3) whether such extractions achieve semantic diversity without sacrificing KG integrity. Evaluation is proposed on medical datasets, conducted with domain experts, to demonstrate the feasibility of perspective-aware extraction as a principled alternative to consensus-oriented KGs. Expected contributions include polyvocal ontology patterns, an ontology-orchestrated MAS extraction framework with auditable provenance, and empirical validation.
Fake News Detection Strategies under Dataset Bias: Using Large-scale Coarse-grained Labels
Yuki Kishi | Yuji Arima | Hitoshi Iyatomi
Yuki Kishi | Yuji Arima | Hitoshi Iyatomi
The spread of misinformation has prompted extensive research on machine-learning–based fake news detection. However, existing datasets differ substantially in content distributions and annotation policies, complicating fair evaluation and generalization assessment. We refer to these structural differences as dataset bias. In this study, we quantitatively analyze dataset bias across multiple public fake news datasets (Kaggle, FNN, ISOT, and NELA-GT-2019/2020) with different annotation granularities, including article-level and publisher-level labels. Using document embedding–based similarity analysis and article category distributions, we examine how such biases affect detection performance under in-dataset and cross-dataset evaluation settings. Furthermore, to leverage large-scale but coarse-grained publisher-level data, we compare proxy-label training with a semi-supervised learning approach based on Virtual Adversarial Training (VAT). Our results show that detection performance strongly depends on dataset-specific biases, and that proxy-label training and SSL exhibit complementary, and sometimes opposite, strengths depending on whether the evaluation emphasizes in-dataset performance or cross-dataset generalization. These findings highlight the importance of appropriate training strategies and evaluation protocols when using heterogeneous fake news datasets.
DRAGOn: Designing RAG On Periodically Updated Corpus
Fedor Chernogorskii | Sergei Averkiev | Liliya Kudraleeva | Zaven Martirosian | Maria Tikhonova | Valentin Malykh | Alena Fenogenova
Fedor Chernogorskii | Sergei Averkiev | Liliya Kudraleeva | Zaven Martirosian | Maria Tikhonova | Valentin Malykh | Alena Fenogenova
This paper introduces DRAGOn, method to design a RAG benchmark on a regularly updated corpus. It features recent reference datasets, a question generation framework, an automatic evaluation pipeline, and a public leaderboard.Specified reference datasets allow for uniform comparison of RAG systems, while newly generated dataset versions mitigate data leakage and ensure that all models are evaluated on unseen, comparable data.The pipeline for automatic question generation extracts the Knowledge Graph from the text corpus and produces multiple question-answer pairs utilizing modern LLM capabilities.A set of diverse LLM-as-Judge metrics is provided for a comprehensive model evaluation.We used Russian news outlets to form the datasets and demonstrate our methodology. We launch a public leaderboard to track the development of RAG systems and encourage community participation.
Training a language model for low-resource languages is challenging due to data scarcity and computational cost. Tokenizer transfer offers a way to adapt a pre-trained model to a new tokenizer without full retraining, improving efficiency and cross-lingual applicability. To the best our of knowledge, we present the first controlled evaluation of tokenizer transfer on monolingually pretrained base models trained on language-specific corpora, Orthogonal Mapping Pursuit (OMP) and Fast Vocabulary Transfer (FVT), across six languages and multiple finetuning regimes. Using the Goldfish model family, we evaluate using byte-normalized log-perplexity and MultiBlimp accuracy for target-language adaptability, source-language retention, and the interaction between transfer and monolingual or mixed finetuning. OMP with monolingual target finetuning yields the best target-language scores (lower log-perplexity and higher MultiBlimp) among our evaluated conditions, compared with (i) a model trained only on the source language, (ii) a model trained on a smaller amount of target-language data, and (iii) the source language model adapted via standard finetuning on the target data. The results suggest tokenizer transfer is a compute-efficient alternative for low-resource LM training: train a monolingual tokenizer for the target language, transfer it to a larger pre-trained model, and fine-tune using the target data.
Learning Nested Named Entity Recognition from Flat Annotations
Igor Rozhkov | Natalia V Loukachevitch
Igor Rozhkov | Natalia V Loukachevitch
Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.
Analysing LLM Persona Generation and Fairness Interpretation in Polarised Geopolitical Contexts
Maida Aizaz | Quang Minh Nguyen
Maida Aizaz | Quang Minh Nguyen
Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., "student"), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.
Beyond Bias Scores: Unmasking Vacuous Neutrality in Small Language Models
Sumanth Manduru | Carlotta Domeniconi
Sumanth Manduru | Carlotta Domeniconi
The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5–5B parameter range, an overlooked “middle tier” between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings. The code is available at: https://github.com/smanduru10/Vacuous-Neutrality-Framework.git.
From Detection to Explanation: Modeling Fine-Grained Emotional Social Influence Techniques with LLMs and Human Preferences
Maciej Markiewicz | Wiktoria Mieleszczenko-Kowszewicz | Beata Bajcar | Tomasz Adamczyk | Aleksander Szczęsny | Jolanta Babiak | Przemyslaw Kazienko
Maciej Markiewicz | Wiktoria Mieleszczenko-Kowszewicz | Beata Bajcar | Tomasz Adamczyk | Aleksander Szczęsny | Jolanta Babiak | Przemyslaw Kazienko
This paper investigates the capabilities of LLMs to detect and explain fine-grained emotional social influence techniques in textual dialogues, as well as human preferences for technique explanations. We present findings from our two studies. In Study 1, a dataset of 238 Polish dialogues is introduced, each annotated with detailed span-level labels. On this data, we evaluate the performance of LLMs on two tasks: detecting 11 emotional social influence techniques and identifying text spans corresponding to specific techniques. The results indicate that current LLMs demonstrate limited effectiveness in accurately detecting fine-grained emotional social influence.In Study 2, we examine various LLM-generated explanations through human pairwise preferences and four criteria: comprehensibility, cognitive coherence, completeness, and soundness, with the latter two emerging as the most influential on general human preference. All data, including human annotations, are publicly available as the EmoSocInflu dataset (https://github.com/social-influence/emo-soc-influ). Our findings highlight a critical need for further advancement in the field. As LLM-supported manipulation grows, it is essential to promote public understanding of social influence mechanisms, enabling individuals to critically recognize and interpret the subtle forms of manipulation that shape public opinion.
How Do Lexical Senses Correspond Between Spoken German and German Sign Language?
Melis Çelikkol | Wei Zhao
Melis Çelikkol | Wei Zhao
Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources.We address this by analyzing German and German Sign Language (Deutsche Gebärdensprache, DGS), manually annotating 1,404 word use–to–sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally tractable.Our code and dataset are made publicly available
Evaluating Cost-Efficiency of LLMs in a RAG Setup on Polish Wikipedia: Quality vs. Energy Consumption
Patrycja Smits | Tomasz Walkowiak
Patrycja Smits | Tomasz Walkowiak
Retrieval-augmented generation has become the dominant paradigm for deploying large language models in knowledge-intensive applications, yet practitioners lack guidance on model selection when both quality and costs matter. We evaluate language models from 4B to 70B parameters, including PLLuM and Bielik families of Polish LLM, within a Polish Wikipedia-based RAG pipeline. Quality assessment uses GPT-4o pairwise comparison across 1,000 PolQA questions with bias mitigation and Bradley-Terry ranking, while energy measurements capture inference costs on NVIDIA H100 hardware. Our findings challenge conventional scaling assumptions: parameter scaling beyond 12B offers minimal quality gains, with mid-size PLLuM-12 matching 70B performance while reducing energy consumption by 83%.
This thesis proposal addresses methodological gaps in applying NLP to social science by shifting from categorical classification to comparative scaling of grounded constructs. We first extend predictive capacity on existing specialized political datasets with prompt optimization and distillation approaches. We then develop an active learning framework for efficient comparative annotation to scale latent dimensions from large corpora. Finally, we apply this pipeline to measure benevolent sexism in Slovenian media and migration threat perception in parliamentary discourse. This work establishes a scalable workflow for moving NLP from ad-hoc classification to theoretically grounded comparative measurement.
Policy-gradient reinforcement learning (RL) is widely used to improve language model reasoning, but existing methods are not compatible with diffusion language models. The primary reason for this is the difficulty of likelihood estimation with such models. We propose EMBR, a scalable off-policy framework that reformulates KL-regularized RL as an energy-based distribution matching problem. By aligning policy updates with reward signals through energy matching,EMBR avoids the overhead of on-policy learning and the variance of importance weighting. We further derive a principled upper bound for the energy matching objective which can be used to fine-tune dLLMs. Experiments on multiple benchmarks in both online and offline setting show that EMBR matches or surpasses the performance of diffu-GRPO and related baselines in the online case, and of DPO in the offline case. Our approach provides a practical alternative for post-training of diffusion LMs.
Thesis Proposal: Stability-Aware, Evidence-Grounded Knowledge Graph for Substance Use Disorders and Social Determinants of Health
Gautham Vijay Kumar
Gautham Vijay Kumar
Clinical Natural Language Processing (NLP) integrates large language models (LLMs) to extract biomedical insights from unstructured clinical text. Most named entity recognition (NER) and relation extraction (RE) datasets rely on manual annotation, which is costly and difficult to scale. Many biomedical knowledge graphs (KG) suffer from underspecified relations, conflate causal and correlational claims, and edges lack evidence for reasoning. This dissertation presents a semantic stability framework for constructing explainable KGs, highlighting stable extraction as fundamental for scalable NER and RE, and essential for graph structure. We applied this to Substance Use Disorders (SUD) and Social Determinants of Health (SDOH) from PubMed corpus and NER and RE annotation guide. Multiple LLMs perform extraction under shared semantic constraints, with disagreements resolved through Human-in-the-Loop (HITL) validation. We define semantic stability through NER and RE metrics, using stabilized gold data for model training and evaluation. We then develop a claim-centered KG, where edges represent evidence, provenance, relation type, directionality, polarity, and stability indicators. This benchmark and pipeline supports multi-hop reasoning, triadic SUD–SDOH–SUD mediation patterns, and feedback loop analysis. This will advance etiological inquiries and data-driven health policy analysis.
Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Julia Belikova | Danila Rozhevskii | Dennis Svirin | Konstantin Polev | Alexander Panchenko
Julia Belikova | Danila Rozhevskii | Dennis Svirin | Konstantin Polev | Alexander Panchenko
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility – and when compression begins to erase task-relevant content – remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
lrnnx: A library for Linear RNNs
Karan Bania | Soham Kalburgi | Manit Tanwar | Dhruthi | Aditya Nagarsekar | Harshvardhan Mestha | Naman Chibber | Raj Deshmukh | Anish Sathyanarayanan | Aarush Rathore | Pratham Chheda
Karan Bania | Soham Kalburgi | Manit Tanwar | Dhruthi | Aditya Nagarsekar | Harshvardhan Mestha | Naman Chibber | Raj Deshmukh | Anish Sathyanarayanan | Aarush Rathore | Pratham Chheda
Automatic Generation of a Compositional QA Benchmark for Geospatial Reasoning under Spatial and Entity Constraints
Tetsuhisa Suizu | Shohei Higashiyama | Hiroyuki Shindo | Hiroki Ouchi | Sakriani Sakti
Tetsuhisa Suizu | Shohei Higashiyama | Hiroyuki Shindo | Hiroki Ouchi | Sakriani Sakti
Despite their recent success, the geospatial reasoning capabilities of large language models (LLMs)—which require understanding spatial relationships among real-world geo-entities—remain underexplored.We propose an automatic method for constructing compositional geographic question answering datasets that jointly consider spatial and entity constraints.The generated dataset serves as a principled benchmark for evaluating how LLMs coordinate spatial computation with entity-level understanding under diverse compositional settings.We evaluate two state-of-the-art LLMs, GPT-5.2 and Gemini 3 Flash, on our dataset. Experimental results show that while the models perform relatively well on questions involving rich entity grounding, their accuracy drops substantially on questions requiring precise quantitative spatial reasoning, such as distance estimation and containment judgment.Our dataset is publicly available for research and reproduction.
Thesis Proposal: Comparing Human and Model Perception of Writing Style under Controlled Perturbations
Ewelina Paulina Księżniak
Ewelina Paulina Księżniak
Writing style functions both as a vehicle of expression and as a marker of authorial identity. Stylometric methods enable automatic recognition of authors based on linguistic regularities, while recent advances in adversarial learning—demonstrate how data can be intentionally modified to prevent models from learning usable representations. Yet it remains unclear whether such perturbations, designed to disrupt machine learning processes, also influence human perception of style.This thesis investigates how humans and models perceive writing style under controlled perturbations and whether manipulations that reduce algorithmic recognition likewise obscure stylistic identity for human readers. The study combines computational and behavioral approaches: constructing semantically controlled yet stylistically diverse text datasets, and conducting human evaluation experiments to compare recognition accuracy between models and readers.The results are expected to clarify how linguistic cues contribute differently to human and algorithmic perception of style and to inform broader applications in authorship analysis, privacy-preserving text transformation, and creative expression. By situating writing style as a dimension of information quality, the research contributes to understanding how authenticity, anonymity, and expressivity interact in digital communication.
Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models
Andrey Moskalenko | Daria Pugacheva | Denis Shepelev | Andrey Kuznetsov | Vlad Shakhuro | Elena Tutubalina
Andrey Moskalenko | Daria Pugacheva | Denis Shepelev | Andrey Kuznetsov | Vlad Shakhuro | Elena Tutubalina
Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated.In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. Our results highlights a critical gap in the safety and efficiency of modern VLA models for real-world deployment.
An Evaluation of Classifiers for Mapping Generative LLM Responses to Answer Options of Multiple-choice Questionnaires
Alisea Stroligo | Anna Shamray | Julian Schelb | Andreas Spitz
Alisea Stroligo | Anna Shamray | Julian Schelb | Andreas Spitz
The use of large language models (LLMs) for generating responses to multiple-choice style questionnaires that were originally intended to be answered by humans is often a helpful or even necessary task, for example in persona simulation or during LLM alignment. Although the input and output versatility of generative LLMs is beneficial when adapting such questionnaires to machine use, it can be detrimental when mapping the generated text back to a closed set of possible answer options for evaluation or scoring. In this paper, we investigate the performance of smaller models for the classification of LLM outputs into the available answer options of multiple-choice questionnaires. We consider fine-tuned encoder-transformers as well as a rule-based approach on three datasets with differing answer option complexity. Surprisingly, we find that the best-performing neural approach still underperforms in comparison to our rule-based baseline, indicating that simple pattern-matching of answer options against LLM outputs might still be the most competitive solution for cleaning LLM responses to multiple-choice questionnaires.
Research on bot detection in social media exhibits imbalance in several areas — across platforms, languages, and detection levels. Addressing these gaps, this study focuses on comment-level bot detection within Polish Reddit communities. We describe in detail the construction of a comprehensive dataset (~40,000 comments, 58% bot-comment prevalence), which provides labels for the subsequent model training. Polish Reddit is inherently multilingual, we therefore take advantage of the linguistic signals, treating language composition of a comment as a feature on its own. We develop novel platform-specific, language-specific, and culturally informed features, and train comment-level classifiers from multiple model families on the manually annotated dataset. The resulting models achieve strong performance and temporal generalization to 2025 data. We analyze the importance and direction of these novel features across models and report that our ’cross-level’ interaction features, ’Bottiquette’ compliance signals, formatting markers, language indicators, repetition and randomness measures — especially the entropy of non-alphabetic characters — rank among the most decisive features. Finally, we complement our quantitative findings with a qualitative characterization of the Polish Reddit bot ecosystem. Overall, this study provides an important baseline for an underexplored setting and contributes to an open discussion on how to approach detection where data is linguistically mixed.
What the Router Sees Matters: Funnel Pooling for Fast, Content Driven Expert Routing
Josef Pichlmeier | Sebastian Nicolas Mueller | Jakob Sturm | Josef Dräxl | Andre Luckow
Josef Pichlmeier | Sebastian Nicolas Mueller | Jakob Sturm | Josef Dräxl | Andre Luckow
Modern large language model (LLM) systems frequently route inputs to specialized experts to improve accuracy, efficiency, and robustness. Routers determine which expert to activate based on the input, typically represented as a single vector. The construction of this vector limits the distinctions the router can make. Prior work rarely isolates how this vector representation affects routing behavior. We isolate the role of the representation by holding the routing pipeline fixed and vary only how this representation is formed in multilingual settings. We find that representation choice systematically reshapes the available routing partitions. In multilingual routing settings, the routers single-vector input often only encodes shallow features (language/format), resulting in domains that are organized by these features rather than by topic. To mitigate this, we introduce Funnel pooling, a lightweight trainable in-model readout that constructs the routing vector directly from token-level hidden states and does not require a separate embedding encoder. Funnel pooling reduces language and source-dataset driven clustering and results in more topic-aligned domains. Despite this shift, downstream routing performance remains competitive with introducing only a minor inference overhead.
TimeRes: A Turkish Benchmark For Evaluating Temporal Understanding of Large Language Models
Habib Yağız Demir | Ümit Atlamaz | Susan Üsküdarlı
Habib Yağız Demir | Ümit Atlamaz | Susan Üsküdarlı
Temporal information is an essential part of communication, and understanding language requires processing it effectively. Despite recent advances, Large Language Models (LLMs) still struggle with temporal understanding.Existing benchmarks primarily focus on English and underexplore how linguistic structure contributes to temporal meaning.As a result, temporal understanding in languages other than English remains largely understudied.In this paper, we introduce TimeRes, a Turkish benchmark for evaluating temporal understanding of LLMs. TimeRes aims to investigate comprehension of Reichenbach’s temporal points and reported speech through date arithmetic.Our dataset includes 4,600 questions across 4 tasks at two levels of complexity, and presents a paired question formulation to distinguish temporal discourse understanding from temporal arithmetic capabilities.We evaluated six LLMs, and demonstrated that models struggle to resolve reported speech and fail to generalize across word order variations.
Hospitality-VQA: Decision-Oriented Informativeness Evaluation for Vision–Language Models
Jeongwoo Lee | Baek Duhyeong | Eungyeol Han | Soyeon Shin | Gukin Han | Seungduk Kim | Jaehyun Jeon | Taewoo Jeong
Jeongwoo Lee | Baek Duhyeong | Eungyeol Han | Soyeon Shin | Gukin Han | Seungduk Kim | Jaehyun Jeon | Taewoo Jeong
Colorism in Multimodal AI: An Empirical Exploration of Socioeconomic Linguistic Bias in Text-to-Image Generation
Raj Gaurav Maurya | Vaibhav Shukla | Sreedath Panat
Raj Gaurav Maurya | Vaibhav Shukla | Sreedath Panat
The recent rapid real-world adoption of vision-language models (VLMs) raises concerns about how social biases encoded in language may propagate into visual generation. In this work, we examine whether socioeconomic stereotypes, expressed through occupation and income-related linguistic cues in prompts, systematically influences skin-tone representations in text-to-image (T2I) generation, with a focus on colorism as a visual marker of social inequality. We first benchmark 3 small VLMs and 60 human annotators on the Monk Skin Tone (MST) scale using the MST-E dataset. We then conduct a large-scale T2I generation study in which we systematically vary the linguistic framing of income in prompts describing 210 occupations, producing over 2,500 portraits across 3 large VLMs. The skin-tone audit of the portraits by the best-performing annotator (GPT-5 mini) reveals strong color bias: high-income prompts consistently produce lighter-skinned faces, with prompt constraints only modestly attenuating this effect. Bias magnitude varies across generators, with GPT-5 Image-mini and Gemini-2.5 Flash-Image exhibiting more pronounced shifts in MST than Grok-2 Image. Our findings indicate that VLMs encode and amplify ethnoracialized socioeconomic stereotypes in language-conditioned image generation, underscoring the need for cross-modal fairness audits and human-centered evaluations.
Active Learning for Corpus Refinement: Cost-Effective Preprocessing to Improve Validity of Applied Quantitative Text Analysis
Jakob Steglich | Stephan Poppe
Jakob Steglich | Stephan Poppe
Quantitative text analysis relies on high-quality corpora, but keyword-based collection often retrieves irrelevant material, undermining validity. We show that active learning with a transformer-based classifier can iteratively refine corpora by excluding irrelevant documents, prompting researchers to clarify inclusion criteria and address edge cases. Applied to German newspaper articles on depression and schizophrenia, this approach improves construct validity and reduces labeling effort. The document relevance classifiers reached an F1-score of 0.8 with just 100–150 labeled snippets, with further gains from tuning, outperforming both random sampling and a weakly supervised sampling baseline. Filtering non-medical articles further had little effect on downstream depression stigmatization measures but increased schizophrenia stigmatization. Active learning thus enables efficient corpus validation and clearer concept boundaries with minimal preprocessing. The source code is publicly available at https://github.com/jakobstgl/active-learning-corpus-refinement.
The ability of AI systems to not only answer complex natural language questions, but also transparently justify their reasoning, is crucial for building trust and enabling effective human-AI collaboration. In domains requiring multi-hop reasoning, answers must often be constructed by combining multiple relevant sentences from a knowledge base to build an inferential path from the question toward the answer. We tackle this challenge by exploring a neuro-symbolic approach to reasoning through the generation of entailment trees – structured, step-by-step proof trees – using Large Language Models (LLMs). These trees provide interpretable justifications for the inference process. Using the EntailmentBank (CITATION) data set, we evaluated a diverse set of prompting strategies across multiple models, along with a proposal of an inference-guided prompting approach that performs well. We also fine-tuned LLMs trained specifically for proof generation by applying several data augmentation, curriculum learning, and reinforcement-guided optimization strategies. Our results show that the fine-tuned model outperforms all prompting strategies, achieving superior performance across multiple structural and semantic metrics. We also provide a detailed evaluation of which training strategies are helpful towards proof generation. Our findings highlight the importance of proof tree generation as a benchmark for evaluating structured reasoning in LLMs.
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Yevgen Matusevych | Gülşen Eryiğit | Nikolaos Aletras
Yevgen Matusevych | Gülşen Eryiğit | Nikolaos Aletras
Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration
Guangxin Wu | Hao Zhang | Zhang Zhibin | Jiafeng Guo | Xueqi Cheng
Guangxin Wu | Hao Zhang | Zhang Zhibin | Jiafeng Guo | Xueqi Cheng
Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency. While model pruning presents a viable solution to these challenges, existing unstructured pruning techniques often yield irregular sparsity patterns that necessitate specialized hardware or software support. In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators. We introduce a novel structured pruning framework that leverages a hybrid multi-domain calibration set and an iterative calibration strategy to effectively identify and remove redundant channels. Extensive experiments on various models across diverse downstream tasks show that our approach achieves significant compression with minimal performance degradation.
SCRIPTMIND: Crime Script Inference and Cognitive Evaluation for LLM-based Social Engineering Scam Detection System
Heedou Kim | Changsik Kim | Sanghwa Shin | Jaewoo Kang
Heedou Kim | Changsik Kim | Sanghwa Shin | Jaewoo Kang
Social engineering scams increasingly employ personalized, multi-turn deception, exposing the limits of traditional detection methods. While Large Language Models (LLMs) show promise in identifying deception, their cognitive assistance potential remains underexplored. We propose ScriptMind, an integrated framework for LLM-based scam detection that bridges automated reasoning and human cognition. It comprises three components: the Crime Script Inference Task (CSIT) for scam reasoning, the Crime Script–Aware Inference Dataset (CSID) for fine-tuning small LLMs, and the Cognitive Simulation-based Evaluation of Social Engineering Defense (CSED) for assessing real-time cognitive impact. Using 571 Korean scam cases, we built 22,712 structured scammer-sequence training instances. Experimental results show that the 11B small LLM fine-tuned with ScriptMind outperformed GPT-4o by 13%, achieving superior performance over commercial models in detection accuracy, false-positive reduction, scammer utterance prediction, and rationale quality. Moreover, in phone scam simulation experiments, it significantly enhanced and sustained users’ suspicion levels, improving their cognitive awareness of scams. ScriptMind represents a step toward human-centered, cognitively adaptive LLMs for scam defense.
From Paper to Structured JSON: An Agentic AI Workflow for Compliant BMR Digital Transformation
Bhavik Agarwal | Nidhi Bendre | Viktoria Rojkova
Bhavik Agarwal | Nidhi Bendre | Viktoria Rojkova
Agentic AI workflow converts noisy pharmaceutical batch records into validated JSON using hybrid OCR, vision–language and schema-guided LLMs, cutting QA review from hours to minutes while preserving GMP-critical structure.
Compact Multimodal Language Models as Robust OCR Alternatives for Noisy Textual Clinical Reports
Nikita Neveditsin | Pawan Lingras | Salil Patil | Swarup Patil | Vijay Kumar Mago
Nikita Neveditsin | Pawan Lingras | Salil Patil | Swarup Patil | Vijay Kumar Mago
Digitization of medical records often relies on smartphone photographs of printed reports, producing images degraded by blur, shadows, and other noise. Conventional OCR systems, optimized for clean scans, perform poorly under such real-world conditions. This study evaluates compact multimodal language models as privacy-preserving alternatives for transcribing noisy clinical documents. Using obstetric ultrasound reports written in regionally inflected medical English common to Indian healthcare settings, we compare eight systems in terms of transcription accuracy, noise sensitivity, numeric accuracy, and computational efficiency. Compact multimodal models consistently outperform both classical and neural OCR pipelines. Despite higher computational costs, their robustness and linguistic adaptability position them as viable candidates for on-premises healthcare digitization.
PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
Minjia Wang | Yunfeng Wang | Xiao Ma | Dexin Lv | Qifan Guo | Lynn Zheng | Benliang Wang | Lei Wang | Jiannan Li | Yongwei Xing | Junzhe Xu | Zheng Sun
Minjia Wang | Yunfeng Wang | Xiao Ma | Dexin Lv | Qifan Guo | Lynn Zheng | Benliang Wang | Lei Wang | Jiannan Li | Yongwei Xing | Junzhe Xu | Zheng Sun
Digital footprints—records of individuals’ interactions with digital systems—are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
We introduce EPAG, a benchmark dataset and framework designed for evaluating the pre-consultation ability of LLMs using diagnostic guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
SELENE: Selective and Evidence-Weighted LLM Debating for Efficient and Reliable Reasoning
Akshay Verma | Swapnil Gupta | Deepak Gupta | Prateek Sircar | Siddharth Pillai
Akshay Verma | Swapnil Gupta | Deepak Gupta | Prateek Sircar | Siddharth Pillai
Multi-Agent Debate (MAD) frameworks improve factual reliability in large language models (LLMs) by allowing agents to critiqueand refine one another’s reasoning. Yet, existing MAD systems are computationally expensive and prone to degradation under pro-longed debates due to redundant exchanges and unstable judging. We propose a lightweight,industry-deployable alternative that unifies Selective Debate Initiation (SDI) with Evidence Weighted Self-Consistency (EWSC) for adaptive, debate-on-demand reasoning. SDI dynamically predicts when debate is necessary by detecting confidence-likelihood misalignment and semantic disagreement, skippingwell-aligned queries to conserve computation. EWSC replaces a single-judge verdict with a variance-aware, evidence-weighted aggregation across paraphrased evaluations, yielding more stable factual judgments. Combined, SDI and EWSC reduce token consumption by nearly 50% while improving both accuracy and calibration. Evaluated on BoolQ, CosmosQA, and an internal QnA benchmark, our framework achieves higher factual robustness and efficiency, demonstrating that scalable, epistemically reliable multi-agent reasoning is practical for real-world LLM deployments.
SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Shima Imani | Seungwhan Moon | Adel Ahmadyan | Lu Zhang | Ahmed Kirmani | Babak Damavandi
Shima Imani | Seungwhan Moon | Adel Ahmadyan | Lu Zhang | Ahmed Kirmani | Babak Damavandi
We introduce SymPyBench, a large-scale synthetic benchmark of 15K university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. In addition to standard accuracy, we introduce three new metrics: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems.
KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
Sai Gokhale | Devleena Das | Rajeev Patwari | Ashish Sirasao | Elliott Delaye
Sai Gokhale | Devleena Das | Rajeev Patwari | Ashish Sirasao | Elliott Delaye
Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.
We present MizanQA, a benchmark for assessing LLMs on Moroccan legal MCQs, many with multiple correct answers. Covering 1,776 expert-verified questions in Modern Standard Arabic enriched with Moroccan idioms, the dataset reflects influences from Maliki jurisprudence, customary law, and French legal traditions. Unlike single-answer settings, MizanQA features variable option counts, creating added difficulty. We evaluate multilingual and Arabic-centric models in zero-shot, native-Arabic prompts, measuring accuracy, a precision-penalized F1-like score, and calibration errors. Results show large performance gaps and miscalibration, particularly under stricter penalties. By scoping this benchmark to parametric knowledge only, we provide a baseline for future retrieval-augmented and rationale-focused setups.
Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
Sandeep Mishra | Devichand Budagam | Anubhab Mandal | Bishal Santra | Pawan Goyal | Manish Gupta
Sandeep Mishra | Devichand Budagam | Anubhab Mandal | Bishal Santra | Pawan Goyal | Manish Gupta
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.
Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS
Mahta Fetrat Qharabagh | Donya Navabi | Zahra Dehghanian | Morteza Abolghasemi | Hamid R. Rabiee
Mahta Fetrat Qharabagh | Donya Navabi | Zahra Dehghanian | Morteza Abolghasemi | Hamid R. Rabiee
Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.
Retrieval Enhancements for RAG: Insights from a Deployed Customer Support Chatbot
Daniel González Juclà | Mohit Tuteja | Marcos Esteve Casademunt | Keshav Unnikrishnan | Yasir Usmani | Arvind Roshaan
Daniel González Juclà | Mohit Tuteja | Marcos Esteve Casademunt | Keshav Unnikrishnan | Yasir Usmani | Arvind Roshaan
Retrieval-Augmented Generation (RAG) systems depend critically on retrieval quality to enable accurate, contextually relevant LLM responses. While LLMs excel at synthesis, their RAG performance is bottlenecked by document relevance. We evaluate advanced retrieval techniques including embedding model comparison, Reciprocal Rank Fusion (RRF), embedding concatenation and list-wise and adaptive LLM-based re-ranking, demonstrating that zero-shot LLMs outperform traditional cross-encoders in identifying high-relevance passages. We also explore context-aware embeddings, diverse chunking strategies, and model fine-tuning. All methods are rigorously evaluated on a proprietary dataset powering our deployed production chatbot, with validation on three public benchmarks: FiQA, HotpotQA, and SciDocs. Results show consistent gains in Recall@10, closing the gap with Recall@50 and yielding actionable pipeline recommendations. By prioritizing retrieval enhancements, we significantly elevate downstream LLM response quality in real-world, customer-facing applications.
Scaling Intent Understanding: A Framework for Classification with Clarification using Lightweight LLMs
Subhadip Nandi | Tanishka Agarwal | Anshika Singh | Priyanka Bhatt
Subhadip Nandi | Tanishka Agarwal | Anshika Singh | Priyanka Bhatt
Despite extensive research in intent classification, most task-oriented dialogue systems still rigidly assign intents to user utterances without addressing ambiguity, often leading to misrouted requests, irrelevant responses, and user frustration. Proprietary large-language models (LLMs) can generate effective clarifying questions but are too costly for large-scale deployment. Smaller open-source LLMs are more economical, but struggle to ask appropriate clarifying questions. This paper introduces a domain-agnostic framework that equips lightweight, production-ready open-source LLMs with the ability to perform intent classification alongside precise ambiguity resolution via clarifying questions. We validate our framework on both proprietary and public intent classification datasets, demonstrating its ability to perform intent classification as well as generate clarification questions in case of ambiguity. To compare models, those trained with our framework and external baselines, we also propose an evaluation methodology that jointly assesses the accuracy of intent classification and the timing and quality of clarifying questions. Our instruction-tuned models achieve performance comparable to leading proprietary LLMs while offering an 8X reduction in inference cost, enabling broader, cost-efficient deployment. When deployed in the customer-care system of an e-commerce enterprise, our model reduced the misrouting rate by 8%, resulting in a significant improvement in automation rates, which potentially translates in dollar savings by reducing escalations to human agents.
Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
Sumanth Balaji | Piyush Mishra | Aashraya Sachdeva | Suraj Agrawal
Sumanth Balaji | Piyush Mishra | Aashraya Sachdeva | Suraj Agrawal
Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent’s capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
HotelQuEST: Balancing Quality and Efficiency in Agentic Search
Guy Hadad | Shadi Iskander | Sofia Tolmach | Oren Kalinsky | Haggai Roitman | Ran Levy
Guy Hadad | Shadi Iskander | Sofia Tolmach | Oren Kalinsky | Haggai Roitman | Ran Levy
Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance.In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators’ implicit preferences explicit for evaluation.We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.
TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho | Kirsty Fielding | William Watson | Sumitra Ganesh | Manuela Veloso
Nicole Cho | Kirsty Fielding | William Watson | Sumitra Ganesh | Manuela Veloso
Real-world financial filings report critical information about an entity’s investment holdings, essential for assessing that entity’s risk, profitability, and relationship profile. Yet, these details are often buried in messy, multi-page, fragmented tables that are difficult to parse, hindering downstream QA and data normalization. Specifically, 99.4% of the tables in our financial table dataset lack bounding boxes, with the largest table spanning 44 pages. To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Guided by an initial portfolio schema, TASER executes table detection, classification, extraction, and recommendations in a single pipeline. Our Recommender Agent reviews unmatched outputs and proposes schema revisions, enabling TASER to outperform vision-based table detection models such as Table Transformer by 10.1%. Within this continuous learning process, larger batch sizes yield a 104.3% increase in useful schema recommendations and a 9.8% increase in total extractions. To train TASER, we manually labeled 22,584 pages and 3,213 tables covering 731.7 billion in holdings, culminating in TASERTab to facilitate research on real-world financial tables and structured outputs. Our results highlight the promise of continuously learning agents for robust extractions from complex tabular data.
TAGQuant: Token-Aware Clustering for Group-Wise Quantization
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Zhewei Yao | Yuxiong He
Jaeseong Lee | Seung-won Hwang | Aurick Qiao | Zhewei Yao | Yuxiong He
Grouping, e.g., grouping channels, which is widely used in current integer-based quantization, has become essential for the emerging MXFP4 format. Ideally, each group should contain channels with similar quantization scales. To guide such groups, existing work clusters the channels using scalar proxy, ignoring the token dimension, which we find suboptimal. In this paper, we propose TAGQuant, a simple yet powerful enhancement for such “group-wise” quantization. By strategically shuffling channels to group those with similar token-wise activation distributions, TAGQuant ensures better clustering of large- and small-range values. This shuffle operation is hardware-efficient, and seamlessly integrated into the quantization process with only 0.01x latency overhead. TAGQuant reduces relative GSM8K error in both INT4 and MXFP4 formats, by up to 86% in Llama-3.1-8B-Instruct compared to baselines, validating the effectiveness of our channel shuffling approach for group-wise quantization. Code is publicly available.
Beyond Grid Search: Leveraging Bayesian Optimization for Accelerating RAG Pipeline Optimization
Anum Afzal | Xueru Zheng | Florian Matthes
Anum Afzal | Xueru Zheng | Florian Matthes
Finding optimal configurations for Retrieval-Augmented Generation (RAG) pipelines via grid search is computationally prohibitive, limiting real-world scalability. We investigate Bayesian Optimization (BO) as an efficient alternative, systematically comparing seven BO strategies combining four surrogate models and two multi-fidelity methods across FiQA, SciFact, and HotpotQA datasets. Our framework explores both global pipeline and local component-wise optimization, targeting final RAG performance and resource efficiency. Our results show that BO reduces optimization time by up to 84% compared to grid search while maintaining comparable accuracy, with local optimization offering the most practical balance for deployment. Notably, performance gains plateau with larger evaluation budgets, suggesting that moderate resource investments suffice for effective RAG tuning. We provide actionable guidelines that empower industry practitioners to efficiently configure and deploy high-performing RAG systems under real-world constraints.
BornoDrishti: Leveraging Vision Encoders and Domain-Adaptive Learning for Bangla OCR on Diverse Documents
S M Jishanul Islam | Md Mehedi Hasan | Masbul Haider Ovi | Akm Shahariar Azad Rabby | Fuad Rahman
S M Jishanul Islam | Md Mehedi Hasan | Masbul Haider Ovi | Akm Shahariar Azad Rabby | Fuad Rahman
OCR for Bangla scripts remains a challenging problem, with existing solutions limited to single-domain processing. Current approaches lack a unified vision encoder that can understand diverse Bangla script variations, hindering practical deployment. We present BornoDrishti, the first unified OCR system based on the vision transformer that accurately recognizes both printed and handwritten Bangla scripts within a single model. Our approach introduces a novel domain objective that enables the model to learn domain-invariant representations while preserving script-specific features, eliminating the need for separate domain experts. BornoDrishti achieves competitive accuracy across both domains, setting state-of-the-art performance for printed scripts and demonstrating that a single unified model can match or exceed specialized uni-domain systems. We evaluate our model against state-of-the-art domain-specific and cross-domain OCR systems. This work establishes a foundation for advancing practical applications by using a unified multi-domain OCR system for complex Bangla scripts.
MobileCity: An Efficient Framework for Large-Scale Urban Behavior Simulation
Xiaotong Ye | Nicolas Bougie | Toshihiko Yamasaki | Narimawa Watanabe
Xiaotong Ye | Nicolas Bougie | Toshihiko Yamasaki | Narimawa Watanabe
Generative agents offer promising capabilities for simulating realistic urban behaviors. However, existing methods often rely on static profiles, oversimplified behavioral logic, and synchronous inference pipelines that hinder scalability. We present MobileCity, a lightweight generative-agent framework for city-scale simulation powered by cognitively-grounded generative agents. Each agent acts based on its needs, habits, and obligations, evolving over time. Agents are initialized from survey-based demographic data and navigate a realistic multimodal transportation network spanning multiple types of vehicles. To achieve scalability, we introduce asynchronous batched LLM inference during action selection and a low-token communication mechanism. Experiments with 4,000 agents demonstrate that MobileCity generates more human-like urban dynamics than baselines while maintaining high computational efficiency. Our code is publicly available at https://github.com/Tony-Yip/MobileCity.
Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks
Masaya Tsunokake | Yuta Koreeda | Terufumi Morishita | Koichi Nagatsuka | Hikaru Tomonari | Yasuhiro Sogawa
Masaya Tsunokake | Yuta Koreeda | Terufumi Morishita | Koichi Nagatsuka | Hikaru Tomonari | Yasuhiro Sogawa
When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations (micro domains).A previous study shows micro domain-adaptive pre-training (mDAPT) with fewer documents is effective, similarly to DAPT in larger domains.However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown.We aim to reveal the potential and bottlenecks of mDAPT for generative tasks.To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) eliciting facts relevant to questions from an LLM’s own knowledge, (2) reasoning over the facts to obtain conclusions, and (3) composing long-form answers based on the conclusions.We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks.This clarifies mDAPT’s effectiveness in the knowledge aspect and its bottlenecks in other aspects.Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.
Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals—a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources.We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers.The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers.Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time—from 6-15 minutes to 18 seconds per task.These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.
No Label? No Problem: Unsupervised Continual Learning for Adaptive Medical ASR
Meizhu Liu | Tao Sheng
Meizhu Liu | Tao Sheng
Automatic Speech Recognition (ASR) plays an important role in healthcare but faces unique challenges. Medical audio often contains specialized terminology, such as medication names, which existing ASR systems struggle to transcribe accurately. High error rates arise from pronunciation variability, the continual introduction of new terms, and the scarcity of high-quality labeled data—whose collection is costly and requires medical expertise. Although synthetic datasets partially alleviate this problem, they fail to capture the noise and variability of real-world recordings. Moreover, ASR models trained in controlled environments are highly sensitive to noise, leading to degraded performance in clinical settings. To address these limitations, we propose an unsupervised continual learning ASR framework that adapts to new data while preserving prior knowledge. This enables efficient domain adaptation without extensive retraining. Experiments on real-world medical audio demonstrate significant improvements over state-of-the-art baselines.
EduPulse: A Practical LLM-Enhanced Opinion Mining System for Vietnamese Student Feedback in Educational Platforms
Nguyen Xuan Phuc | Phi Nguyen Xuan | Vinh-Tiep Nguyen | Thìn Đặng Văn | Ngan Luu-Thuy Nguyen
Nguyen Xuan Phuc | Phi Nguyen Xuan | Vinh-Tiep Nguyen | Thìn Đặng Văn | Ngan Luu-Thuy Nguyen
Opinion mining from real-world student feedback presents significant practical challenges, such as handling linguistic noise (slang, teencode) and the need for scalable and maintainable systems, which are often overlooked in academic research. This paper introduces EduPulse, a practical opinion mining system designed specifically to analyze student feedback in Vietnamese. Our application performs four opinion analysis tasks, including Sentiment Classification, Category-based Sentiment Classification, Suggestion Detection, and Opinion Summarization. We design the hybrid architecture that strategically balances performance, cost, and maintainability. This architecture leverages the robustness of Large Language Models (LLMs) for complex, noise-sensitive tasks as sentiment classification and suggestion detection, while employing a specialized, lightweight neural model for high-throughput, low-cost solutions. Our experiments show that applying the LLM-based approach achieves high robustness, justifying its operational cost by eliminating expensive retraining cycles. Furthermore, we demonstrate that our collaborative modular architecture significantly improves task performance (+7.6%) compared to traditional approaches, offering a practical design for industry-focused Natural Language Processing applications.
When Speed Meets Intelligence: Scalable Conversational NER in an Ever-evolving World
Karim Ghonim | Antonio Roberto | Davide Bernardi
Karim Ghonim | Antonio Roberto | Davide Bernardi
Modern conversational AI systems require sophisticated Named Entity Recognition (NER) capabilities that can handle complex, contextual dialogue patterns. While Large Language Models (LLMs) excel at understanding conversational semantics, their inference latency and inability to efficiently incorporate emerging entities make them impractical for production deployment. Moreover, the scarcity of conversational NER data creates a critical bottleneck for developing effective models.We address these challenges through two main contributions. First, we introduce an automated pipeline for generating multilingual conversational NER datasets with minimal human validation, producing 4,082 English and 3,925 Spanish utterances. Second, we present a scalable framework that leverages LLMs as semantic filters combined with catalog-based entity grounding to label live traffic data, enabling knowledge distillation into faster, production-ready models. On internal conversational datasets, our teacher model demonstrates 39.55% relative F1-score improvement in English and 44.93% in Spanish compared to production systems. On public benchmarks, we achieve 97.12% F1-score on CoNLL-2003 and 83.09% on OntoNotes 5.0, outperforming prior state-of-the-art by 24.82 and 8.19 percentage points, respectively. Finally, student models distilled from our teacher approach achieve 13.84% relative improvement on English conversational data, bridging the gap between LLM capabilities and real-world deployment constraints.
ReflectiveRAG: Rethinking Adaptivity in Retrieval-Augmented Generation
Akshay Verma | Swapnil Gupta | Siddharth Pillai | Prateek Sircar | Deepak Gupta
Akshay Verma | Swapnil Gupta | Siddharth Pillai | Prateek Sircar | Deepak Gupta
Retrieval-Augmented Generation (RAG) systems degrade sharply under extreme noise,where irrelevant or redundant passages dominate. Current methods-fixed top-k retrieval, cross-encoder reranking, or policy based iteration-depend on static heuristics orcostly reinforcement learning, failing to assess evidence sufficiency, detect subtle mismatches, or reduce redundancy, leading to hallucinations and poor grounding. We introduce ReflectiveRAG, a lightweight yet reasoning-driven architecture that enhances factual grounding through two complementary mechanisms: Self-Reflective Retrieval (SRR) and Contrastive Noise Removal (NR). SRR employs small language model as a decision controller that iteratively evaluates evidence sufficiency, enabling adaptive query reformulation withoutfixed schedules or policy training. NR further refines retrieved content via embedding-based contrastive filtering, enforcing semanticsparsity and removing redundant or tangential passages. Evaluated on WebQuestions, HotpotQA (distractor setting) and InternalQAwith 50M Common Crawl distractors, ReflectiveRAG achieves substantial gains over strong baselines-including DeepRAG-improving EMby +2.7 pp and F1 by +2.5 pp, while reducing evidence redundancy by 30.88% with only 18 ms additional latency. Ablation studies con-firm that SRR and NR jointly drive both factual accuracy and efficiency, validating our central claim that retrieval reasoning and contrastivefiltering can outperform large-scale policy optimization in RAG.
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets
Jiyuan Shen | Yuan Peiyue | Atin Ghosh | Yifan Mai | Daniel Dahlmeier
Jiyuan Shen | Yuan Peiyue | Atin Ghosh | Yifan Mai | Daniel Dahlmeier
Multimodal Large Language Models (MLLMs) enhance the potential of natural language processing. However, their actual impact on document information extraction remains unclear. In particular, it is unclear whether an MLLM-only pipeline—while simpler—can truly match the performance of traditional OCR+MLLM setups. In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction. To examine and explore failure modes, we propose an automated hierarchical error analysis framework that leverages large language models (LLMs) to diagnose error patterns systematically. Our findings suggest that OCR may not be necessary for powerful MLLMs, as image-only input can achieve comparable performance to OCR-enhanced approaches. Moreover, we demonstrate that carefully designed schema, exemplars, and instructions can further enhance MLLMs performance. We hope this work can offer practical guidance and valuable insight for advancing document information extraction.
PatentVision: A multimodal method for drafting patent applications
Ruo Yang | Sai Krishna Reddy Mudhiganti | Manali Sharma
Ruo Yang | Sai Krishna Reddy Mudhiganti | Manali Sharma
Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision-Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs—such as patent claims and drawings—to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine-tuned vision-language models with domain-specific training tailored to patents. Experiments reveal it surpasses text-only methods, producing outputs with greater fidelity and alignment with human-written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.
VideoMind: Thinking in Steps for Long Video Understanding
Shubhang Bhatnagar | Renxiong Wang | Kapil Krishnakumar | Adel Ahmadyan | Zhaojiang Lin | Lambert Mathias | Xin Luna Dong | Babak Damavandi | Narendra Ahuja | Seungwhan Moon
Shubhang Bhatnagar | Renxiong Wang | Kapil Krishnakumar | Adel Ahmadyan | Zhaojiang Lin | Lambert Mathias | Xin Luna Dong | Babak Damavandi | Narendra Ahuja | Seungwhan Moon
Multimodal Large Language Models (MLLMs) struggle with Long Video Understanding (LVU) due to their limited context window and the distributed nature of salient information across many redundant frames. To address this, we present VideoMind, a novel training free framework for LVU designed to mimic a human reasoning process. The framework is orchestrated by an MLLM that breaks down a user’s query into a series of simpler, actionable sub-queries. For each sub query, the MLLM reconfigures itself by invoking specialized ‘modes’ that are instantiations of the same MLLM, but with appropriately tailored context for the given sub query to extract targeted evidence. After gathering this evidence, the model resumes its role as the orchestrator which evaluates the results and decides if an answer is complete or if it must refine its strategy by engaging further modes with new context. Our specialized operational modes include: 1) a Multi-Scale Temporal Search mode to identify and summarize relevant video sub-snippets at varying time scales, and 2) a Single-Frame Visual Detail mode for precise spatial localization of objects. This dynamic allocation of computation yields state-of-the-art results on the Video-MME, LongVideo, and MLVU benchmarks, achieving 77.6% performance on Video MME using Qwen 2.5 72B (4.8% enhancement) while also yielding a 5% improvement on Llama 4 Scout.
RegNLI: Detecting Online Product Misbranding through Legal and Linguistic Alignment
Diya Saha | Abhishek Bharadwaj Varanasi | Tirthankar Dasgupta | Manjira Sinha
Diya Saha | Abhishek Bharadwaj Varanasi | Tirthankar Dasgupta | Manjira Sinha
Misbranding of health-related products poses significant risks to public safety and regulatory compliance. Existing approaches to claim verification largely rely on keyword matching or generic text classification, failing to capture the nuanced reasoning required to align product claims with legal statutes. In this work, we introduce RegNLI, a novel framework that formulates misbranding detection as a inference task between product claims and regulatory provisions. Leveraging a curated dataset of FDA warning letters, we construct structured representations of claims and statutes. Our model integrates a regulation-aware gating mechanism with a contrastive alignment objective to jointly optimize misbranding classification and statute mapping. Experiments on the FDA-Misbrand dataset demonstrate that RegNLI significantly outperforms strong baselines across accuracy, F1-score, and regulation alignment metrics, while providing interpretable attention patterns that highlight critical linguistic cues. This work establishes a foundation for compliance-aware NLP systems and opens new directions for integrating formal reasoning with neural architectures in regulatory domains.
CASPER: Bridging Discrete and Continuous Prompt Optimization through Feedback-Guided Gradient Descent
Aryan Jain | Pushpendu Ghosh | Promod Yenigalla
Aryan Jain | Pushpendu Ghosh | Promod Yenigalla
Workflow automation is critical for reducing manual efforts in industries, yet existing pipelines fail to handle generative tasks like summarization and extraction without pre-built tools, forcing human intervention. While LLM-based agents offer solutions, their creation depends heavily on prompt engineering—a resource-intensive process often yielding suboptimal results. Current automated approaches face a fundamental trade-off: discrete optimization produces overfitted prompts without convergence guarantees due to non-convex landscapes, while continuous gradient-based methods generate semantically incoherent prompts through embedding optimization. We propose CASPER, a framework bridging discrete and continuous prompt optimization through feedback-guided gradient descent in embedding space. CASPER employs a feedback module producing detailed error analyses that capture failure modes as optimization signals. These insights are projected with prompt tokens into embedding space to steer gradient descent. To preserve interpretability, we incorporate fluency regularization that penalizes incomprehensible tokens. We further accelerate convergence through synthetic data generation that oversamples failure cases, while also addressing data scarcity in industrial settings. We evaluate CASPER on WDC, DROP, GSM8K with F1 improvements of 2.3%, 1.6%, 2.3% and VQA, internal benchmarks showing accuracy improvements of 1.1%, 3%, demonstrating cross-domain generalizability.
Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement
Aaditya Shukla | Sidney Knowles | Meenakshi Madugula | David Farris | Ryan Angilly | Santiago Pombo | Lu An | Anbang Xu | Abhinav Balasubramanian | Tan Yu | Jiaxiang Ren | Rama Akkiraju
Aaditya Shukla | Sidney Knowles | Meenakshi Madugula | David Farris | Ryan Angilly | Santiago Pombo | Lu An | Anbang Xu | Abhinav Balasubramanian | Tan Yu | Jiaxiang Ren | Rama Akkiraju
Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA’s Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning.Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25%) and query rephrasal errors (3.2%). Using NVIDIA NeMo Microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10× reduction in model size, and 70% latency improvement. For query rephrasal, fine-tuning yielded a 3.7% gain in accuracy and a 40% latency reduction.Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.
Medical Summarization in Practice: Design, Deployment, and Analysis of a Clinical Summarization System for a German Hospital
Moiz Rauf | Sean Papay
Moiz Rauf | Sean Papay
Through the course of hospital treatment, a large number of electronic health records (EHRs) are created for a patient, detailingaspects of care history such as lab results, physician notes, and treatments administered.At the conclusion of treatment, this collection of EHRs must be summarized into a discharge summary,describing the course or care clearly and cohesively.In this paper, we present the design and development of a clinical summarization system integrated into a live German hospital workflowto help with the generation of discharge summaries.We first describe the system, its components, and its context of use within a hospital,before performing a number of experiments to gain insights into how best to use and evaluate our system.We investigate summarization performance across multiple input encoding strategies, compare expert judgments against automatic evaluation of summaries,and analyze the consistency of model summaries across multiple text generations.This work not only acts as a case study to demonstrate the feasibility of LLM integration into healthcare infrastructure,but also provides actionable insights into the use and evaluation of such systems.
Feedback-Aware Prompt Optimization Framework for Generating Job Postings
Suraj Maharjan | Ainur Yessenalina | Srinivasan H. Sengamedu
Suraj Maharjan | Ainur Yessenalina | Srinivasan H. Sengamedu
Job postings are critical for recruitment, yet large enterprises struggle with standardization and consistency, requiring significant time from hiring managers and recruiters. We present a feedback-aware prompt optimization framework that automates high-quality job posting generation through iterative human-in-the-loop refinement. Our system integrates multiple data sources: job metadata, competencies, organization’s compliance guidelines, and organization brand statement, while incorporating human feedback to continuously improve prompt quality through multi-LLM validation. We evaluated our approach using LLM-as-a-judge on 1,056 job postings and human evaluation on a smaller subset across three dimensions: Standardization, Compliance, and User Perception. Our results demonstrate high compliance rates and strong satisfaction scores in both automated and human evaluation, validating the effectiveness of our feedback-aware approach for enterprise job posting generation.
Enhancing User Safety: Context-Aware Detection of Offensive Query-Ad Pairs in Multimodal Search Advertising
Gaurav Kumar | Qiangjian Xi | Tanmaya Shekhar Dabral | Hooshang Ghasemi | Abishek Krishnamoorthy | Danqing Fu | Rui Min | Emilio Antunez | Zhongli Ding | Pradyumna Narayana
Gaurav Kumar | Qiangjian Xi | Tanmaya Shekhar Dabral | Hooshang Ghasemi | Abishek Krishnamoorthy | Danqing Fu | Rui Min | Emilio Antunez | Zhongli Ding | Pradyumna Narayana
The proliferation of multi-modal online advertisements necessitates robust content moderation to ensure user safety, as offensive ad content can cause user distress and erode platform trust. This paper addresses the detection of content that becomes offensive only when a user’s search query is paired with a specific ad, a context-dependent challenge that simple moderation often misses. Key challenges include the nuanced, multi-modal nature of ads, severe data scarcity and class imbalance due to the rarity of offensive content, and the high cost of human labeling. To overcome these limitations, we introduce a novel, context-aware detection framework centered on a large-scale, Multi-modal Teacher-Student Knowledge Distillation architecture. A powerful Gemini encoder-only “teacher” model distills its knowledge into a lightweight student model suitable for low-latency deployment. We enhance robustness using a novel graph mining technique to find rare offensive examples for training. For evaluation, we developed a highly accurate Automated Evaluation Model (AEM)—a separate, larger Gemini model utilizing Chain-of-Thought (CoT) reasoning—to rigorously assess performance in a live A/B test. Our results demonstrate that the proposed framework reduces the serving of offensive query-ad pairs by more than 80% compared to the baseline, while maintaining the efficiency required for real-time advertising systems that operate at a scale of over ≈100 billion query-ad pairs per day. Disclaimer: This paper contains sentences and images that may be offensive. These examples are included solely for scientific analysis and do not reflect the views of the authors.
SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models
Jiaojiao Han | Wujiang Xu | Mingyu Jin | Mengnan Du
Jiaojiao Han | Wujiang Xu | Mingyu Jin | Mengnan Du
Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE Agentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.
Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli | Orshulevich Vladimir | Tala Bazazo | Christian Herold | Michael Kozielski | Marcin Mazur | Szymon Tuzel | Cees G. M. Snoek | Seyyed Hadi Hashemi | Omar Javed | Yannick Versley | Shahram Khadivi
Matteo Nulli | Orshulevich Vladimir | Tala Bazazo | Christian Herold | Michael Kozielski | Marcin Mazur | Szymon Tuzel | Cees G. M. Snoek | Seyyed Hadi Hashemi | Omar Javed | Yannick Versley | Shahram Khadivi
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision–Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings
Jean-Philippe Corbeil | Minseon Kim | Maxime Griot | Sheela Agarwal | Alessandro Sordoni | Francois Beaulieu | Paul Vozila
Jean-Philippe Corbeil | Minseon Kim | Maxime Griot | Sheela Agarwal | Alessandro Sordoni | Francois Beaulieu | Paul Vozila
As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most existing risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model’s outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and closed-source LLMs. To the best of our knowledge, this work establishes an initial foundation for safer deployment of LLMs in healthcare.
Synthetic Doctor-Patient Dialogue Generation for Robust Medical ASR: A Scalable Pipeline for Vocabulary Expansion and Privacy Preservation
Kefei Liu | Meizhu Liu
Kefei Liu | Meizhu Liu
Automatic Speech Recognition (ASR) is increasingly integral to healthcare services, where medical conversations present unique transcription challenges due to specialized terminology and frequent introduction of new terms. Existing ASR models, including widely used systems like Whisper, struggle with high word error rates (WER) on clinical vocabulary, especially medication names, primarily due to the scarcity of annotated audio-transcript data in the medical domain. This paper proposes and evaluates a novel synthetic data generation pipeline that produces comprehensive doctor-patient dialogues in both text and audio forms, specifically targeting a curated set of over 124,000 medical terms. The pipeline generated over 1 billion audios with ground truth transcriptions. Fine-tuning ASR models with this synthetic corpus significantly reduced overall WER and improved transcription accuracy on medical terms, marking a significant advance in healthcare ASR accuracy. Data generation code, dataset, and training and evaluation scripts are released.
Lessons from the Field: An Adaptable Lifecycle Approach to Applied Dialogue Summarization
Kushal Chawla | Chenyang Zhu | Pengshan Cai | Sangwoo Cho | Scott Novotney | Ayushman Singh | Jonah Lewis | Keasha Safewright | Alfy Samuel | Erin Babinsky | Shi-Xiong Zhang | Sambit Sahu
Kushal Chawla | Chenyang Zhu | Pengshan Cai | Sangwoo Cho | Scott Novotney | Ayushman Singh | Jonah Lewis | Keasha Safewright | Alfy Samuel | Erin Babinsky | Shi-Xiong Zhang | Sambit Sahu
Summarization of multi-party dialogues is a critical capability in industry, enhancing knowledge transfer and operational effectiveness across many domains. However, automatically generating high-quality summaries is challenging, as the ideal summary must satisfy a set of complex, multi-faceted requirements. While summarization has received immense attention in research, prior work has primarily utilized static datasets and benchmarks, a condition rare in practical scenarios where requirements inevitably evolve. In this work, we present an industry case study on developing an agentic system to summarize multi-party interactions. We share practical insights spanning the full development lifecycle to guide practitioners in building reliable, adaptable summarization systems, as well as to inform future research, covering: 1) robust methods for evaluation despite evolving requirements and task subjectivity, 2) component-wise optimization enabled by the task decomposition inherent in an agentic architecture, 3) the impact of upstream data bottlenecks, and 4) the realities of vendor lock-in due to the poor transferability of LLM prompts.
LingVarBench: Benchmarking LLMs on Entity Recognitions and Linguistic Verbalization Patterns in Phone-Call Transcripts
Seyedali Mohammadi | Manas Paldhe | Amit Chhabra | Youngseo Son | Vishal Seshagiri
Seyedali Mohammadi | Manas Paldhe | Amit Chhabra | Youngseo Son | Vishal Seshagiri
We study structured entity extraction from phone-call transcripts in customer-support and healthcare settings, where annotation is costly, and data access is limited by privacy and consent. Existing methods degrade under disfluencies, interruptions, and speaker overlap, yet large real-call corpora are rarely shareable. We introduce LingVarBench, a benchmark and semantic synthetic data generation pipeline that generates linguistically varied training data via (1) LLM-sampled entity values, (2) curated linguistic verbalization patterns covering diverse disfluencies and entity-specific readout styles, and (3) a value–transcript consistency filter. Using this dataset, DSPy’s SIMBA automatically synthesizes and optimizes extraction prompts, reducing manual prompt engineering and targeting robustness to verbal variation. On real customer transcripts, prompts optimized solely on LingVarBench outperform zero-shot baselines and match or closely approach human-tuned prompts for structured entities such as ZIP code, date of birth, and name (F1 approximately 94-95 percent). For subjective questionnaire items, optimized prompts substantially improve over zero-shot performance and approach human-tuned prompts. LingVarBench offers a practical and cost-efficient path to deployment in a direct-answer setting, with real annotations later enabling additional refinement.
Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
Alphaeus Dmonte | Vidhi Gupta | Daniel J Perry | Mark Arehart
Alphaeus Dmonte | Vidhi Gupta | Daniel J Perry | Mark Arehart
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
The Subtle Art of Defection: Understanding Uncooperative Behaviors in LLM based Multi-Agent Systems
Devang Kulshreshtha | Wanyu Du | Raghav Jain | Srikanth Doss | Hang Su | Sandesh Swamy | Yanjun Qi
Devang Kulshreshtha | Wanyu Du | Raghav Jain | Srikanth Doss | Hang Su | Sandesh Swamy | Yanjun Qi
This paper introduces a novel framework for simulating and analyzing how uncooperative behaviors can destabilize or collapse LLM-based multi-agent systems. Our framework includes two key components: (1) a game theory-based taxonomy of uncooperative agent behaviors, addressing a notable gap in the existing literature; and (2) a structured, multi-stage simulation pipeline that dynamically generates and refines uncooperative behaviors as agents’ states evolve. We evaluate the framework via a collaborative resource management setting, measuring system stability using metrics such as survival time and resource overuse rate. Empirically, our framework achieves ~96.7% accuracy in generating realistic uncooperative behaviors, validated by human evaluations. Our results reveal a striking contrast: cooperative agents maintain perfect system stability (100% survival over 12 rounds with 0% resource overuse), while any uncooperative behavior can trigger rapid system collapse within 1–7 rounds. We also evaluate LLM-based defense methods, finding they detect some uncooperative behaviors, but some behaviors remain largely undetectable. These gaps highlight how uncooperative agents degrade collective outcomes and underscore the need for more resilient multi-agent systems.
Tailoring Rumor Debunking to You: Diversifying Chinese Rumor-Debunking Passages with an LLM-Driven Simulated Feedback-Enhanced Framework
Xinle Pang | Danding Wang | Qiang Sheng | Yifan Sun | Beizhe Hu | Juan Cao
Xinle Pang | Danding Wang | Qiang Sheng | Yifan Sun | Beizhe Hu | Juan Cao
Social media platforms have become primary sources for news consumption due to their real-time and interactive nature, yet they have also facilitated the widespread proliferation of misinformation, negatively impacting public health, social cohesion, and market stability. While professional fact-checking is essential for debunking rumors, the process is time-consuming, necessitating automation to effectively combat fake news. Existing approaches, such as extractive methods, often lack coherence and context, whereas abstractive methods leveraging large language models (LLMs) can generate more readable and informative debunking passages. However, readability alone is insufficient for effective misinformation correction; user acceptance is critical. Recent advancements in LLMs offer new opportunities for personalized debunking, as these models can generate context-sensitive responses and adapt to user profiles. Building on this, we propose the MUti-round Refinement and Simulated fEedback-enhanced framework (MURSE), which generates Chinese user-specific debunking passages by iteratively refining outputs based on simulated user feedback. Specifically, MURSE-generated user-specific debunking passages were preferred twice as often as general debunking passages in most cases, highlighting its potential to improve misinformation correction and foster positive dissemination chains.
Synthetic Data Fine-Tuning for Effective Team Formation in Enterprises
Guilherme Drummond Lima | Adriano Veloso
Guilherme Drummond Lima | Adriano Veloso
We evaluate the effectiveness of synthetic data fine-tuning for Semantic Search in a real-world Enterprise Team Formation problem scenario. In this problem, we aim to retrieve the best employee for a given task, given their information regarding abilities, experiences, and other aspects. We evaluate two synthetic data generation strategies: (1) augmenting real-world data with synthetic labels and (2) generating synthetic profiles for employees tailored to specific tasks. To measure the impact of these strategies, we fine-tune a pretrained text embedding model using LoRA and Rank Aggregation techniques. We evaluate the model performance against current SOTA algorithms on a human-curated dataset. Our experiments indicate that training a model that uses a combination of both Synthetic data generation strategies outperforms already established pre-trained models on the Team Formation task, improving the ranking metrics by an average of 30% in comparison to the best-performing pre-trained model.
Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents
Daud Waqas | Aaryamaan Golthi | Erika Hayashida | Huanzhi Mao
Daud Waqas | Aaryamaan Golthi | Erika Hayashida | Huanzhi Mao
Multi-turn tool-calling LLMs — models capable of invoking external APIs or tools across several user turns — have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines *remains difficult for many safety-critical industries* due to ongoing concerns regarding model resilience. While standardized benchmarks, such as the Berkeley Function-Calling Leaderboard (BFCL), have underpinned confidence concerning advanced function-calling models (like Salesforce’s xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce **Assertion-Conditioned Compliance (A-CC)**, a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model’s behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.
PROBES : Performance and Relevance Observation for BEtter Search
Sejal Jain | Cyrus Andre DSouza | Jitenkumar Babubhai Rana | Aniket Joshi | Promod Yenigalla
Sejal Jain | Cyrus Andre DSouza | Jitenkumar Babubhai Rana | Aniket Joshi | Promod Yenigalla
High-quality search is essential for the success of online platforms, spanning e-commerce, social media, shopping-focused applications, and broader search systems such as content discovery and enterprise web search. To ensure optimal user experience and drive business growth, continuous evaluation and improvement of search systems is crucial. This paper introduces PROBES, a novel multi-task system powered by Large Language Models (LLMs) designed for end-to-end evaluation of semantic search systems. PROBES identifies context-aware relevance using a fine-grained scale (exact, substitute, complement, irrelevant) by leveraging the query category, feature-level intent, and category-aware feature importance, enabling more precise and consistent judgments than relying solely on raw query text. This allows PROBES to provide differentiated relevance assessment across a diverse range of query categories. PROBES then dives deeper to understand the reason behind irrelevant results (Precision issues) by checking product content conflicts and inaccuracies. It also analyzes Missed Recall by leveraging retrieval and relevance models to determine whether a missed recall was due to a selection issue or a ranking/retrieval system issue. To evaluate PROBES, we introduce a new metric, the Actionable Error Rate (AER), defined as the proportion of actionable errors over all flagged errors. We observe that PROBES operates at an AER of 76%, generating actionable insights across 100 product categories.
Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Minseok Kim | Jingxiang Chen | Seong-Gyun Leem | Yin Huang | Rashi Rungta | Zhicheng Ouyang | Haibin Wu | Surya Teja Appini | Ankur Bansal | Yang Bai | Yue Liu | Florian Metze | Ahmed A Aly | Anuj Kumar | Ariya Rastrow | Zhaojiang Lin
Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds—crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak | Sanchari Chowdhuri
Priyaranjan Pattnayak | Sanchari Chowdhuri
Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied. We introduce Indic Jailbreak Robustness (IJR) a judge-free benchmark for adversarial safety across 12 Indic and South Asian languages (~2.09B speakers), covering 45,216 prompts in JSON (contract-bound) and Free (naturalistic) tracks.IJR reveals three patterns. (1) Contracts inflate refusals but do not stop jailbreaks: in JSON, LLaMA and Sarvam exceed 0.92 JSR, and in Free all models reach ~1.0 with refusals collapsing. (2) English→Indic attacks transfer strongly, with format wrappers often outperforming instruction wrappers. (3) Orthography matters: romanized/mixed inputs reduce JSR under JSON, with correlations to romanization share and tokenization ρ ≈ 0.28–0.32 indicating systematic effects. Human audits confirm detector reliability, and lite-to-full comparisons preserve conclusions. IJR offers a reproducible multilingual stress test revealing risks hidden by English-only, contract-focused evaluations, especially for South Asian users who frequently code-switch and romanize.
Synthesizing question answering data from financial documents: An End-to-End Multi-Agent Approach
Chetan Harsha | Karmvir Singh Phogat | Sridhar Dasaratha | Shashishekar Ramakrishna
Chetan Harsha | Karmvir Singh Phogat | Sridhar Dasaratha | Shashishekar Ramakrishna
Answering complex questions that require numerical reasoning over financial documents is challenging due to the diverse and scatterednature of relevant information. While large language models (LLMs) excel at financial reasoning, their enterprise deployment is often limited by cost and latency. Small language models (SLMs) present a cost-effective alternative but need to be fine-tuned with high-quality, domain-specific question-answer (QA) data. Acquiring such data requires manual expert annotation, presenting a bottleneck to the wider application of SLMs.This work introduces a modular, scalable end-to-end agentic pipeline that extracts and selects relevant content from unstructured financial documents and then generates QA pairs from the selected content for SLM fine-tuning. Compared to the same models trained on previous manually generated data for the task, one of the models trained on our pipeline-produced synthetic data achieved competitive in-distribution performance, and all tested models demonstrated superior generalization. The framework thus demonstrates considerable potential to accelerate the deployment of smaller, cost-effective models by reducing manual data creation efforts.
Toward Automatic Delegation Extraction in Japanese Law
Tsuyoshi Fujita | Yuya Sawada | Yusuke Sakai | Taro Watanabe
Tsuyoshi Fujita | Yuya Sawada | Yusuke Sakai | Taro Watanabe
The legal systems have a hierarchical structure, and a higher-level law often authorizes a lower-level law to implement detailed provisions, which is called delegation. When interpreting legal texts with delegation, readers must repeatedly consult the lower-level laws that stipulate the detailed provisions, imposing a substantial workload. Therefore, it is necessary to develop a system that enables readers to instantly refer to relevant laws in delegation. However, manually annotating delegation is difficult because it requires extensive legal expertise, careful reading of numerous legal texts, and continuous adaptation to newly enacted laws. In this study, we focus on Japanese law and develop a two-stage pipeline system for automatic delegation annotation. First, we extract keywords that indicate delegation using a named entity recognition approach. Second, we identify the delegated provision corresponding to each keyword as an entity disambiguation task. In our experiments, the proposed system demonstrates sufficient performance to assist manual annotation in practice.
DIALECTIC: A Multi-Agent System for Startup Evaluation
Jae Yoon Bae | Simon Malberg | Joyce Ann Clarize Galang | Andre Retterath | Georg Groh
Jae Yoon Bae | Simon Malberg | Joyce Ann Clarize Galang | Andre Retterath | Georg Groh
Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.
Long-Context Long-Form Question Answering for Legal Domain
Anagha Kulkarni | Parin Rajesh Jhaveri | Prasha Shrestha | Yu Tong Han | Reza Amini | Behrouz Madahian
Anagha Kulkarni | Parin Rajesh Jhaveri | Prasha Shrestha | Yu Tong Han | Reza Amini | Behrouz Madahian
Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer).In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. By leveraging the expertise of professionals from fields such as law and corporate tax, we curate a QA dataset. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.
ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs
Hangyeol Yoo | ChangSu Choi | Minjun Kim | Seohyun Song | SeungWoo Song | Inho Won | Jongyoul Park | Cheoneum Park | KyungTae Lim
Hangyeol Yoo | ChangSu Choi | Minjun Kim | Seohyun Song | SeungWoo Song | Inho Won | Jongyoul Park | Cheoneum Park | KyungTae Lim
We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2% on qualitative benchmarks and effectively preserving source language (English) capabilities.
MIRAGE: Metadata-guided Image Retrieval and Answer Generation for E-commerce Troubleshooting
Rishav Sahay | Lavanya Sita Tekumalla | Anoop Saladi
Rishav Sahay | Lavanya Sita Tekumalla | Anoop Saladi
Existing multimodal systems typically associate text and available images based on embedding similarity or simple co-location, but such approaches often fail to ensure that the linked image accurately depicts the specific product or component mentioned in a troubleshooting instruction. We introduce MIRAGE, a metadata-first paradigm that treats structured metadata, (not raw pixels), as a first-class modality for multimodal grounding. In MIRAGE, both text and images are projected through a shared semantic schema capturing product attributes, context, and visual aspects, enabling reasoning over interpretable attributes for troubleshooting rather than unstructured embeddings. MIRAGE comprises of three complementary modules: M-Link for schema-guided image–text linking, M-Gen for metadata-conditioned multimodal generation, and M-Eval for consistency evaluation in the same structured space. Experiments on large-scale enterprise e-commerce troubleshooting data across 10 product types on 100K text chunks and 35K images show that metadata-centric grounding achieves over 40% higher linking coverage of high-quality visual content and over 45% in linking and response quality than embedding-based baselines. MIRAGE demonstrates the potential of structured metadata in enabling scalable, fine-grained grounding in multimodal troubleshooting systems.
CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization
Che-Ming Chang | Prashanth Vijayaraghavan | Ashutosh Jadhav | Charles Mackin | Hsinyu Tsai | Vandana Mukherjee | Ehsan Degan
Che-Ming Chang | Prashanth Vijayaraghavan | Ashutosh Jadhav | Charles Mackin | Hsinyu Tsai | Vandana Mukherjee | Ehsan Degan
Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.
D3: Dynamic Docid Decoding for Multi-Intent Generative Retrieval
Jaeyoung Kim | Dohyeon Lee | Soona Hong | Seung-won Hwang
Jaeyoung Kim | Dohyeon Lee | Soona Hong | Seung-won Hwang
Generative Retrieval (GR) maps queries to documents by generating discrete identifiers (DocIDs).However, offline DocID assignment and constrained decoding often prevent GR from capturing query-specific intent, especially when documents express multiple or unseen intents (i.e., intent misalignment).We introduce Dynamic Docid Decoding (D3), an inference-time mechanism that adaptively refines DocIDs through delayed, query-informed identifier expansion.D3 uses (a) verification to detect intent misalignment and (b) dynamic decoding to extend DocIDs with query-aligned tokens, even those absent from the pre-indexed vocabulary, enabling plug-and-play DocID expansion beyond the static vocabulary while adding minimal overhead.Experiments on NQ320k and MS-MARCO show that D3 consistently improves retrieval accuracy, especially on unseen and multi-intent documents, across various GR models, including a +2.4%p nDCG@10 gain on the state-of-the-art model.
DisGraph-RP: Graph-Augmented Temporal Modeling with Aspect-Based Contrastive Encoding of Discharge Summary for Readmission Prediction
Sudeshna Jana | Tirthankar Dasgupta | Manjira Sinha | Pabitra Mitra
Sudeshna Jana | Tirthankar Dasgupta | Manjira Sinha | Pabitra Mitra
Predicting hospital readmissions is a critical clinical task with substantial implications for patient outcomes and healthcare cost management. We propose DisGraph-RP, a graph-augmented temporal modeling framework that integrates structured discourse-aware text representation with cross-admission relational reasoning. Our approach introduces a Section-Aware Contrastive Encoder that leverages section segmentation and aspect-based supervision to produce fine-grained representations of discharge summaries. These representations are then composed over time using a Graph-Based temporal module that encodes inter-visit dependencies through learned edge relations, enabling the model to capture disease progression, treatment history, and recurrent risk signals. Experiments on multiple real-world datasets demonstrate that DisGraph-RP achieves significant improvements over strong baselines, including transformer-based clinical models and prompting-based LLM approaches. Our findings highlight the importance of combining discourse-informed text encoding with temporal graph reasoning for robust clinical outcome prediction.
CareerPathKG: Knowledge Graph Integrated Framework for Career Intelligence
Ngoc-Quang Le | Duc Duong Hoang | Mai Vu Tran | Thi-Hai-Yen Vuong
Ngoc-Quang Le | Duc Duong Hoang | Mai Vu Tran | Thi-Hai-Yen Vuong
The labor market is experiencing rapid and continual shifts in required skills and competencies, driven by technological advancement and evolving industry structures. Within this dynamic environment, candidates increasingly face challenges in orienting their career development, requiring them to continuously update their knowledge and capabilities to meet contemporary job requirements; this need is particularly necessary for new entrants to the labor market, who must cultivate a comprehensive understanding of current labor-market conditions. To address these issues, this study proposes an enterprise recruitment framework grounded in a career path knowledge graph, capturing occupations, skill requirements, and career transitions using standardized taxonomies enriched with job-posting data. The framework integrates transformer-based embeddings, large language models, and knowledge-graph reasoning to support efficient and reliable CV assessment, CV-JD matching and career guidance.
A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
Aakash Trivedi | Aniket Upadhyay | Pratik Narang | Dhruv Kumar | Praveen Kumar
Aakash Trivedi | Aniket Upadhyay | Pratik Narang | Dhruv Kumar | Praveen Kumar
Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need.We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision–recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable.Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.
ShopperBench: A Benchmark for Personalized Shopping with Persona-Guided Simulation
Yuan Ling | Chunqing Yuan | Shujing Dong | Yongjian Yang | Nataraj Mocherla | Ayush Goyal
Yuan Ling | Chunqing Yuan | Shujing Dong | Yongjian Yang | Nataraj Mocherla | Ayush Goyal
Personalized shopping agents must adapt their decisions to different user personas, balancing efficiency, preference alignment, and goal success. Building upon the WebShop dataset and 𝜏2-Bench environment, ShopperBench introduces a persona-guided benchmark for evaluating such adaptive behaviors. ShopperBench augments shopping trajectories with persona-conditioned goals, reasoning rationales, and preference cues, capturing how diverse shopper types—from price-conscious planners to trend-seeking explorers—navigate product search and selection. We further design a baseline of ShopperAgents that operate under persona guidance to simulate realistic, goal-oriented shopping interactions. To evaluate these agents, we propose new metrics including Persona Fidelity, Persona-Query Alignment, and Path Consistency. Together, Our ShopperBench provides a testbed for studying personalized and context-aware shopping intelligence, bridging the gap between human-centered e-commerce behavior and agent-based simulation.
ARQA: A Benchmark for Grounded Table–Text QA in Enterprise Annual Reports
Ruilong Wang | Simone Balloccu
Ruilong Wang | Simone Balloccu
Annual reports communicate corporate performance to stakeholders through dense tables and explanatory text, with rich grounding signals making automated reasoning challenging. Existing QA benchmarks focus on retrieval or single-modality reasoning and rarely require justification for answers with both textual and tabular evidence. We introduce ARQA (Annual Report QA), a benchmark of ~2.5K QA pairs spanning ten fiscal years of automotive enterprise annual reports and three reasoning families — Lookup, Arithmetic, and Insight. Data are produced via a planner–generator pipeline, deterministically verified and recomputed, and fully reviewed by domain experts. We evaluate state-of-the-art instruction-tuned language models on ARQA, showing strong factual retrieval but persistent weaknesses in grounded arithmetic and causal reasoning. We release ARQA and its evaluation toolkit to facilitate research on auditable, evidence-first reasoning over enterprise documents. (https://github.com/RuilongWang/ARQA-Benchmark/)
Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning?
Sushant Kumar Ray | Gautam Siddharth Kashyap | Sahil Tripathi | Nipun Joshi | Vijay Govindarajan | Rafiq Ali | Jiechao Gao | Usman Naseem
Sushant Kumar Ray | Gautam Siddharth Kashyap | Sahil Tripathi | Nipun Joshi | Vijay Govindarajan | Rafiq Ali | Jiechao Gao | Usman Naseem
Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY—the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.
SkiLLens: Recognising and Mapping Novel Skills from Millions of Job Ads Across Europe Using Language Models
Alessia De Santo | Lorenzo Malandri | Fabio Mercorio | Mario Mezzanzanica | Navid Nobani
Alessia De Santo | Lorenzo Malandri | Fabio Mercorio | Mario Mezzanzanica | Navid Nobani
In a rapidly evolving labor market, detecting and addressing emerging skill needs is essential for shaping responsive education and workforce policies. Online job advertisements (OJAs) provide a real-time view of changing demands, but require first retrieving skill mentions from unstructured text and then solving the entity linking problem of connecting them to standardized skill taxonomies. To harness this potential, we present a multilingual human-in-the-loop (HITL) pipeline that operates in two steps: candidate skills are extracted from national OJA corpora using country-specific word embeddings, capturing terms that reflect each country’s labor market. These candidates are linked to ESCO using an encoder-based system and refined through a decoder large language models (LLMs) for accurate contextual alignment. Our approach is validated through both quantitative and qualitative evaluations, demonstrating that our method enables timely, multilingual monitoring of emerging skills, supporting agile policy-making and targeted training initiatives.
SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization
Prashanth Vijayaraghavan | Apoorva Nitsure | Luyao Shi | Charles Mackin | Ashutosh Jadhav | David Beymer | Ehsan Degan | Vandana Mukherjee
Prashanth Vijayaraghavan | Apoorva Nitsure | Luyao Shi | Charles Mackin | Ashutosh Jadhav | David Beymer | Ehsan Degan | Vandana Mukherjee
Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15–20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.
Benchmarking and Mitigating the Impact of Noisy User Prompts in Medical VLMs via Cross-Modal Reflection
Zhiyu Xue | Reza Abbasi-Asl | Ramtin Pedarsani
Zhiyu Xue | Reza Abbasi-Asl | Ramtin Pedarsani
Medical vision-language models (Med-VLMs) offer a new and effective paradigm for digital health in tasks such as disease diagnosis using clinical images and text. In these tasks, an important but underexplored research question is how Med-VLMs interpret and respond to user-provided clinical information, especially when the prompts are noisy. For a systematic evaluation, we construct Med-CP, a large-scale visual question answering (VQA) benchmark designed to comprehensively evaluate the influence of clinical prompts across diverse modalities, anatomical regions, and diagnostic tasks. Our experiments reveal that existing Med-VLMs tend to follow user-provided prompts blindly, regardless of whether they are accurate or not, raising concerns about their reliability in real-world interactions. To address this problem, we introduce a novel supervised fine-tuning (SFT) approach for Med-VLMs based on cross-modal reflection chain-of-thought (CoT) across medical images and text. In our SFT method, the Med-VLM is trained to produce reasoning paths for the analysis of the medical image and the user-provided prompt. Then, the final answer is determined by conducting a reflection on the visual and textual information. Experimental results demonstrate that our method considerably enhances the robustness against noisy user-provided prompts for both in-domain and out-of-domain evaluation scenarios.
Lightweight Domain-Specific Language Model for Real-Time Structuring of Medical Prescriptions
Jonathan Pattin Cottet | Véronique Eglin | Alex Aussem
Jonathan Pattin Cottet | Véronique Eglin | Alex Aussem
Automated structuring of medical prescriptions is critical for downstream safety checks in pharmacies, yet remains challenging due to heterogeneous layouts, OCR noise, and dense clinical abbreviations in real-world documents. Existing language models either ignore layout information, rely on computationally expensive image-based architectures, or cannot operate under strict privacy and hardware constraints such as GDPR and HDS-certified environments.We present a lightweight (<10M parameters), privacy-preserving transformer specifically designed for Entity Extraction (EE) and Entity Linking (EL) in French medical prescriptions. The model uses only OCR text and normalized 2D word coordinates, enabling robust pseudonymisation and real-time CPU-level inference while preserving essential spatial cues. It is pretrained on a large corpus of pseudonymised OCR outputs using objectives tailored to prescription structure, including a novel Token-to-Line Alignment (TLA) task, and fine-tuned on the Rx-PAD dataset (Pattin Cottet et al., 2025).Empirical results show that our approach matches or surpasses larger document-understanding models and rivals multimodal LLMs on strict extraction metrics, while achieving sub-second latency suitable for operational deployment. The system is currently used in 230 pharmacies, demonstrating both scalability and practical relevance. These findings highlight the importance of specialized, domain-aware, lightweight models for safe, efficient, and legally compliant prescription verification.
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden’s J statistic
Stephane Collot | Colin Fraser | Justin Zhao | William F. Shen | Timon Willi | Ilias Leontiadis
Stephane Collot | Colin Fraser | Justin Zhao | William F. Shen | Timon Willi | Ilias Leontiadis
Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s J statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of J. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.
PharmaQA.IT: an Italian dataset for Q A in the pharmaceutical domain
Kamyar Zeinalipour | Andrea Zugarini | Asya Zanollo | Leonardo Rigutini
Kamyar Zeinalipour | Andrea Zugarini | Asya Zanollo | Leonardo Rigutini
The growing use of Large Language Models (LLMs) for medical Question Answering (QA) requires reliable, evidence-grounded benchmarks beyond English. In Italy, Riassunti delle Caratteristiche del Prodotto (RCP) issued by the Italian Medicines Agency (AIFA) are the main regulatory source on medicines, yet no QA dataset exists on these documents, limiting the development and evaluation of trustworthy Italian QA systems.We introduce PharmaQA.IT, an Italian extractive QA dataset built from RCPs in PharmaER.IT. Using a semi-automatic pipeline, we (i) select informative pages from 1,077 leaflets, (ii) prompt a multimodal LLM on page images with professional personas to generate candidate question–answer pairs, and (iii) validate and normalise them with expert revision. The final dataset contains 861 high-quality question–answer pairs on indications, contraindications, dosage, warnings, interactions, and pharmacological properties.We frame PharmaQA.IT as an extractive QA benchmark with structured JSON outputs and evaluate a range of open and proprietary LLMs. Results show that open models approach closed-source performance under a chunking-and-retrieval setup. PharmaQA.IT, together with all code, prompts, and evaluation scripts, will be publicly released to support research on trustworthy Italian biomedical QA.PharmaQA.IT, together with all code, prompts, and evaluation scripts, is publicly available on Hugging Face to support research on trustworthy Italian biomedical QA.
DIRECT: Directional Relevance in Conversational Trajectories
Anshuman Mourya | Rajdeep Mukherjee | Prerna Jolly | Vinayak S Puranik | Sivaramakrishnan R Kaveri
Anshuman Mourya | Rajdeep Mukherjee | Prerna Jolly | Vinayak S Puranik | Sivaramakrishnan R Kaveri
Conversational Agents have become ubiquitous across application domains, such as, shopping assistants, medical diagnosis, autonomous task planning etc. Users interacting with these agents often fail to understand how to start a conversation or what to ask next to obtain the desired information. To enable seamless and hassle-free user-agent interactions, we introduce Next Question Suggestions (NQS), which are essentially highly relevant follow-up question recommendations that act as conversation starters or discover-ability tools to capture non-trivial user intents, leading to more engaging conversations. Relying on LLMs for both response as well as NQS generation is a costly ask in latency-constrained commercial settings, with an added risk of handling potentially unsafe or unanswerable generated queries. A key component of building an efficient low-latency NQS experience is, therefore, retrieval (or embedding) models that fetch the most-relevant candidate questions from an offline pre-curated Question Bank (QB). Off-the-shelf embedding models cannot capture domain-specific nuances and more importantly the directionality inherent in follow-up question recommendations. In this work, we propose an end-to-end retrieval system, DIRECT that is optimized to model directional relevance. Given a user query, it produces a ranked list of highly relevant follow-up question recommendations within 1 sec. Our system also contains an LLM-as-a-judge component, tuned on proprietary user-agent interaction logs, to evaluate the end-to-end performance in terms of CTR.
up
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 6: Tutorials)
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 6: Tutorials)
Chenghua Lin | Aline Paes | Rodrigo Wilkens
Chenghua Lin | Aline Paes | Rodrigo Wilkens
AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation
Yufang Hou | Steffen Eger | Anne Lauscher | Wei Zhao | Yong Cao
Yufang Hou | Steffen Eger | Anne Lauscher | Wei Zhao | Yong Cao
This tutorial provides an in-depth overview of recent advances in AI-assisted tools and models that support and enhance the scientific research process, spanning literature discovery, idea generation, multimodal content understanding and generation, text/table generation, and AI-supported peer review.
Cognitive Effects and Biases in Large Language Models
Markus Schedl | Ralph Hertwig | Antonela Tommasel | Shahed Masoudian
Markus Schedl | Ralph Hertwig | Antonela Tommasel | Shahed Masoudian
This tutorial bridges psychology and NLP to clarify cognitive effects and biases in large language models, compares methodological assumptions across disciplines, and discusses practical evaluation challenges and open research directions.
Encoding and Decoding Language in the Brain with Language Models
Anuja Negi | Mathis Lamarre | Christine Tseng | Subba Reddy Oota
Anuja Negi | Mathis Lamarre | Christine Tseng | Subba Reddy Oota
This tutorial introduces brain-language model alignment and recent advances in scaling, multilingual brain encoding, brain-informed fine-tuning, and brain decoding with language models, including semantic reconstruction from brain data.
Multimodal Large Language Models for Human-AI Interaction: Foundations, Agents, and Inclusive Applications
Shafiq Joty | Enamul Hoque | Ahmed Masry | Spandana Gella | Samira Ebrahimi Kahou
Shafiq Joty | Enamul Hoque | Ahmed Masry | Spandana Gella | Samira Ebrahimi Kahou
This tutorial presents foundations, agentic capabilities, and inclusive applications of multimodal large language models, covering architectures, multimodal alignment and reasoning, conversational GUI agents, accessibility, multilingual communication, and responsible deployment.
up
Findings of the Association for Computational Linguistics: EACL 2026
Findings of the Association for Computational Linguistics: EACL 2026
Vera Demberg | Kentaro Inui | Lluís Marquez
Vera Demberg | Kentaro Inui | Lluís Marquez
Unveiling the Deficiencies of Pre-trained Text-and-Layout Models in Real-world Visually-rich Document Information Extraction
Chong Zhang | Yixi Zhao | Yulu Xie | Chenshu Yuan | Yi Tu | Ya Guo | Mingxu Chai | Ziyu Shen | Yue Zhang | Qi Zhang
Chong Zhang | Yixi Zhao | Yulu Xie | Chenshu Yuan | Yi Tu | Ya Guo | Mingxu Chai | Ziyu Shen | Yue Zhang | Qi Zhang
Recently developed pre-trained text-and-layout models (PTLMs) have shown remarkable success in multiple information extraction tasks on visually-rich documents (VrDs). However, despite achieving extremely high performance on benchmarks, their real-world performance falls short of expectations. Owing to this issue, we investigate the prevailing evaluation pipeline to reveal that: (1) The inadequate annotations within benchmark datasets introduce spurious correlations between task inputs and labels, which would lead to overly-optimistic estimation of model performance. (2) The evaluation solely relies on the performance on benchmarks and is insufficient to comprehensively explore the capabilities of methods in real-world scenarios. These problems impede the prevailing evaluation pipeline from reflecting the real-world performance of methods, misleading the design choices of method optimization. In this work, we introduce EC-FUNSD, an entity-centric dataset crafted for benchmarking information extraction from visually-rich documents. This dataset contains diverse layouts and high-quality annotations. Additionally, this dataset disentangles the falsely-coupled segment and entity annotations that arises from the block-level annotation of FUNSD. Using the proposed dataset, we evaluate the real-world information extraction capabilities of PTLMs from multiple aspects, including their absolute performance, as well as generalization, robustness and fairness. The results indicate that prevalent PTLMs do not perform as well as anticipated in real-world information extraction scenarios. We hope that our study can inspire reflection on the directions of PTLM development.
Entity-aware Cross-lingual Claim Detection for Automated Fact-checking
Rrubaa Panchendrarajan | Arkaitz Zubiaga
Rrubaa Panchendrarajan | Arkaitz Zubiaga
Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite notable progress, challenges remain—particularly in handling multilingual data prevalent in online discourse. Recent efforts have focused on fine-tuning pre-trained multilingual language models to address this. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce EX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle multilingual claims. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model stands out as an effective solution, demonstrating consistent performance gains across 27 languages and robust knowledge transfer between languages seen and unseen during training.
WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning
Yuchen Zhuang | Di Jin | Jiaao Chen | Wenqi Shi | Hanrui Wang | Chao Zhang
Yuchen Zhuang | Di Jin | Jiaao Chen | Wenqi Shi | Hanrui Wang | Chao Zhang
Large language model (LLM)-empowered web agents enable automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework explicitly designed to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26–16.59%, achieving competitive performance relative to proprietary LLM-based agents (GPT-4o) in workplace-oriented web navigation tasks.
Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs
Sewon Kim | Jiwon Kim | SeungWoo Shin | Hyejin Chung | Daeun Moon | Yejin Kwon | Hyunsoo Yoon
Sewon Kim | Jiwon Kim | SeungWoo Shin | Hyejin Chung | Daeun Moon | Yejin Kwon | Hyunsoo Yoon
Large Language Models (LLMs) are increasingly engaged in emotionally vulnerable conversations that extend beyond information seeking to moments of personal distress. As they adopt affective tones and simulate empathy, they risk creating the illusion of genuine relational connection. We term this phenomenon Affective Hallucination, referring to emotionally immersive responses that evoke false social presence despite the model’s lack of affective capacity. To address this, we introduce AHaBench, a benchmark of 500 mental-health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. DPO fine-tuning substantially reduces affective hallucination without compromising reasoning performance, and the Pearson correlation coefficients between GPT-4o and human judgments is also strong (r=0.85) indicating that human evaluations confirm AHaBench as an effective diagnostic tool. This work establishes affective hallucination as a distinct safety concern and provides resources for developing LLMs that are both factually reliable and psychologically safe. Warning: This paper contains examples of mental health-related language that may be emotionally distressing.
Joint Multimodal Preference Optimization for Fine-Grained Visual-Textual Alignment
Jiwon Kim | Hyunsoo Yoon
Jiwon Kim | Hyunsoo Yoon
Recent research has focused on addressing multimodal hallucinations in Large Vision-Language Models (LVLMs) by extending Direct Preference Optimization (DPO) to incorporate visual preference supervision. However, these methods often lack fine-grained visual contrast mechanisms and rely on single-margin optimization. This in turn limits their ability to capture precise visual semantics and results in weak multimodal alignment. To address these issues, we propose Joint Multimodal Preference Optimization (JoMPO), a novel optimization framework that symmetrically integrates a text-conditioned preference loss with a visual ranking-based objective. JoMPO leverages semantically contrastive image–text pairs and listwise ranking over multiple visual contexts, enabling fine-grained visual grounding and more robust cross-modal alignment. To support this framework, we introduce the Visual–Textual Contrast (VTC) dataset, consisting of image pairs that are semantically similar but visually distinct, each paired with a contextually grounded textual response. When trained with only 5k contrastive pairs, JoMPO consistently demonstrates superior performance across diverse benchmarks, highlighting its effectiveness in mitigating hallucinations and improving image-text alignment in LVLMs.
Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefilling Improves Theory of Mind in Large Language Models
Kazutoshi Shinoda | Nobukatsu Hojo | Kyosuke Nishida | Yoshihiro Yamazaki | Keita Suzuki | Hiroaki Sugiyama | Kuniko Saito
Kazutoshi Shinoda | Nobukatsu Hojo | Kyosuke Nishida | Yoshihiro Yamazaki | Keita Suzuki | Hiroaki Sugiyama | Kuniko Saito
Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefilling, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefilling simply specifies the beginning of LLM outputs with “Let’s put ourselves in A’s shoes.”, where A denotes the target character’s name. We evaluate SoO prefilling on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefilling elicits faithful thoughts, thereby improving the ToM performance.
Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey
Seunghyuk Cho | Zhenyue Qin | Yang Liu | Youngbin Choi | Seungbeom Lee | Dongwoo Kim
Seunghyuk Cho | Zhenyue Qin | Yang Liu | Youngbin Choi | Seungbeom Lee | Dongwoo Kim
Plane geometry problem solving (PGPS) has recently gained significant attention as a benchmark to assess the multi-modal reasoning capabilities of large vision-language models. Despite the growing interest in PGPS, the research community still lacks a comprehensive overview that systematically synthesizes recent work in PGPS. To fill this gap, we present a survey of existing PGPS studies. We first categorize PGPS methods into an encoder-decoder framework and summarize the corresponding output formats used by their encoders and decoders. Subsequently, we classify and analyze these encoders and decoders according to their architectural designs. Finally, we outline major challenges and promising directions for future research. In particular, we discuss the hallucination issues arising during the encoding phase within encoder-decoder architectures, as well as the problem of data leakage in current PGPS benchmarks.
Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms
Kieran Henderson | Kian Omoomi | Vasudha Varadarajan | Allison Lahnala | Charles Welch
Kieran Henderson | Kian Omoomi | Vasudha Varadarajan | Allison Lahnala | Charles Welch
Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosures and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. Contrary to previous work, only a small number of comments related to the original post are needed. Lastly, a more diverse sample of annotator self-disclosures did not lead to the best performance. Sampling from a larger pool of comments without filtering still yields the best performance, suggesting that there is still much to uncover in terms of what information about an annotator is most useful for verdict prediction.
Position Paper: How Should We Responsibly Adopt LLMs in the Peer Review Process?
Juhwan Choi | JungMin Yun | Changhun Kim | YoungBin Kim
Juhwan Choi | JungMin Yun | Changhun Kim | YoungBin Kim
This position paper presents a novel perspective on the utilization of Large Language Models (LLMs) in the artificial intelligence paper review process. We first critique the current tendency for LLMs to be primarily used for simple review text generation, arguing instead that this approach overlooks more meaningful applications of LLMs that preserve human expertise at the core of evaluation. Instead, we advocate for leveraging LLMs to support key aspects of the review process—specifically, verifying the reproducibility of experimental results, checking the correctness and relevance of citations, and assisting with ethics review flagging. For example, integrating tools based on LLM Agents for code generation from research papers has recently enabled automated assessment of the reproducibility of the paper, thereby improving the transparency and reliability of research. By reorienting LLM usage toward these targeted and assistive roles, we outline a pathway for more effective and responsible integration of LLMs into peer review, ultimately supporting both reviewer efficiency and the integrity of the scientific process.
Rad-Flamingo: A Multimodal Prompt driven Radiology Report Generation Framework with Patient-Centric Explanations
Md. Tousin Akhter | Devansh Lalwani | Kshitij Sharad Jadhav | Pushpak Bhattacharyya
Md. Tousin Akhter | Devansh Lalwani | Kshitij Sharad Jadhav | Pushpak Bhattacharyya
In modern healthcare, radiology plays a pivotal role in diagnosing and managing diseases. However, the complexity of medical imaging data and the variability in interpretation can lead to inconsistencies and a lack of patient-centered insight in radiology reports. To address this challenge, a novel multimodal prompt-driven report generation framework Rad-Flamingo was developed, that integrates diverse data modalities—such as medical images, and clinical notes—to produce comprehensive and context-aware radiology reports. Our framework leverages innovative prompt engineering techniques to guide vision-language models in generating relevant information, ensuring these generated reports are not only accurate but also understandable to individual patients. A key feature of our framework is its ability to provide patient-centric explanations, offering clear and personalized insights into diagnostic findings and their implications. Additionally, we also demonstrate a synthetic data generation pipeline, to append any existing benchmark datasets’ findings and impressions with patient-centric explanation. Experimental results demonstrate that this framework’s effectiveness in enhancing report quality, improving understandability, and could foster better patient-doctor communication. This approach represents a significant step towards human-centered medical AI systems.
I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search
Zujie Liang | Feng Wei | Wujiang Xu | Yuxi Qian | Lin Chen | Xinhui Wu
Zujie Liang | Feng Wei | Wujiang Xu | Yuxi Qian | Lin Chen | Xinhui Wu
Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low diversity and suboptimal code generation. While recent work (CITATION) has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node’s solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from estimated score to actual performance scores. Applied to the various ML tasks, our approach demonstrates a 4% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS
ThinkNote: Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognition Modeling
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Zhipeng Xu | Zhenghao Liu | Yukun Yan | Shuo Wang | Shi Yu | Zheni Zeng | Chaojun Xiao | Zhiyuan Liu | Ge Yu | Chenyan Xiong
Large Language Models (LLMs) have demonstrated strong performance across a wide range of NLP tasks. However, they often exhibit suboptimal behaviors and inconsistencies when exposed to unfamiliar external information, underscoring their limitations in effectively leveraging such knowledge. Inspired by constructivist learning theory, we propose ThinkNote, a novel framework that enhances the external knowledge utilization of LLMs through a two-stage constructivist cognitive modeling process. Specifically, ThinkNote performs knowledge assimilation to align new information with the model’s parametric memory, forming a coherent internal representation. It then applies thought accommodation to adapt internal reasoning, thereby promoting more consistent and reliable outputs. Extensive experimental results demonstrate that ThinkNote achieves a 10% improvement over strong baseline methods on various question-answering benchmarks. Further analysis indicates that ThinkNote effectively integrates and utilizes external knowledge to help LLMs generate accurate responses and improves their self-consistency. All data and code will be publicly available at https://github.com/OpenMatch/ThinkNote.
Mitigating Copy Bias in In-Context Learning through Neuron Pruning
Ameen Ali Ali | Lior Wolf | Ivan Titov
Ameen Ali Ali | Lior Wolf | Ivan Titov
Large language models (LLMs) have demonstrated impressive few-shot in-context learning (ICL) abilities. Still, we show that they are sometimes prone to a ‘copying bias’, where they copy answers from provided examples instead of learning the underlying patterns. In this work, we propose a novel and simple method to mitigate such copying bias. First, we create a synthetic task and use the Integrated Gradients method to identify neurons that prioritize copying over generalization. We demonstrate that pruning these neurons consistently improves performance across a diverse set of ICL tasks, including both single-token and multi-token scenarios, while maintaining or even improving the model’s general capabilities. We also show that our method is applicable across various LLM architectures, including Transformers and State-Space Models, without requiring modifications. In our analysis, we adopt a task-recognition perspective on ICL and examine task vectors (Hendel et al., 2023) induced by the model. We find that pruning enhances the quality of these vectors, suggesting that the pruned neurons previously hindered effective task recognition.
How to Make LMs Strong Node Classifiers?
Zhe Xu | Kaveh Hassani | Si Zhang | Hanqing Zeng | Michihiro Yasunaga | Limei Wang | Dongqi Fu | Ning Yao | Bo Long | Hanghang Tong
Zhe Xu | Kaveh Hassani | Si Zhang | Hanqing Zeng | Michihiro Yasunaga | Limei Wang | Dongqi Fu | Ning Yao | Bo Long | Hanghang Tong
Language Models (LMs) are increasingly challenging the dominance of domain-specific models, such as Graph Neural Networks (GNNs) and Graph Transformers (GTs), in graph learning tasks. Following this trend, we propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art (SOTA) GNNs on node classification tasks, without requiring any architectural modifications. By preserving the LM’s original architecture, our approach retains a key benefit of LM instruction tuning: the ability to jointly train on diverse datasets, fostering greater flexibility and efficiency. To achieve this, we introduce two key augmentation strategies: (1) Enriching LMs’ input using topological and semantic retrieval methods, which provide richer contextual information, and (2) guiding the LMs’ classification process through a lightweight GNN classifier that effectively prunes class candidates. Our experiments on real-world datasets show that backbone Flan-T5 LMs equipped with these augmentation strategies outperform SOTA text-output node classifiers and are comparable to top-performing vector-output node classifiers. By bridging the gap between specialized node classifiers and general LMs, this work paves the way for more versatile and widely applicable graph learning models. We will open-source the code upon publication.
Rethinking Data Mixture for Large Language Models: A Comprehensive Survey and New Perspectives
Yajiao Liu | Congliang Chen | Junchi Yang | Ruoyu Sun
Yajiao Liu | Congliang Chen | Junchi Yang | Ruoyu Sun
Training large language models with data collected from various domains can improve their performance on downstream tasks. However, given a fixed training budget, the sampling proportions of these different domains significantly impact the model’s performance. How can we determine the domain weights across different data domains to train the best-performing model within constrained computational resources? In this paper, we provide a comprehensive overview of existing data mixture methods. First, we propose a fine-grained categorization of existing methods, extending beyond the previous offline and online classification. Offline methods are further grouped into heuristic-based, algorithm-based, and function fitting-based methods. For online methods, we categorize them into three groups—online min-max optimization, online mixing law, and other approaches—by drawing connections with the optimization frameworks underlying offline methods. Second, we summarize the problem formulations, representative algorithms for each subtype of offline and online methods, and clarify the relationships and distinctions among them. Finally, we discuss the advantages and disadvantages of each method and highlight key challenges in the field of data mixture.
The Mediomatix Corpus: Parallel Data for Romansh Language Varieties via Comparable Schoolbooks
Zachary William Hopton | Jannis Vamvas | Andrin Büchler | Anna Rutkiewicz | Rico Cathomas | Rico Sennrich
Zachary William Hopton | Jannis Vamvas | Andrin Büchler | Anna Rutkiewicz | Rico Cathomas | Rico Sennrich
The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM and a supervised multilingual MT model on the dataset.
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including through the use of adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets. As a result, we are able to completely eliminate GPT’s safety alignment in a blackbox setting through finetuning with only benign data. Our code and data is available at anonymous.4open.science/r/suffix-maybe-features-D17C/.
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui | Hanin Atwany | Hamdan Al-Ali | Abdelrahman Mohamed | Ali Mekky | Sergei Tilga | Natalia Fedorova | Ekaterina Artemova | Hanan Aldarmaki | Yova Kementchedjhieva
Karima Kadaoui | Hanin Atwany | Hamdan Al-Ali | Abdelrahman Mohamed | Ali Mekky | Sergei Tilga | Natalia Fedorova | Ekaterina Artemova | Hanan Aldarmaki | Yova Kementchedjhieva
We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4o, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4o ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
Detecting Primary Progressive Aphasia (PPA) from Text: A Benchmarking Study
Ghofrane Merhbene | Fabian Lecron | Philippe Fortemps | Bradford C. Dickerson | Mascha Kurpicz-Briki | Neguine Rezaii
Ghofrane Merhbene | Fabian Lecron | Philippe Fortemps | Bradford C. Dickerson | Mascha Kurpicz-Briki | Neguine Rezaii
Classifying subtypes of primary progressive aphasia (PPA) from connected speech presents significant diagnostic challenges due to overlapping linguistic markers. This study benchmarks the performance of traditional machine learning models with various feature extraction techniques, transformer-based models, and large language models (LLMs) for PPA classification. Our results indicate that while transformer-based models and LLMs exceed chance-level performance in terms of balanced accuracy, traditional classifiers combined with contextual embeddings remain highly competitive. Notably, MLP using MentalBert’s embeddings achieves the highest accuracy. These findings underscore the potential of machine learning for enhancing the automatic classification of PPA subtypes.
Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm
Akshat Gupta | Atahan Ozdemir | Caoqinwei Gong | Gopala Anumanchipalli
Akshat Gupta | Atahan Ozdemir | Caoqinwei Gong | Gopala Anumanchipalli
This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as . We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by, where is the dimensionality of the representation space. Finally, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally operate orthogonal to the uniform vector both during training and inference, that is, on average they do not have a component along the uniform vector during training or inference. This presents the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. These results advocate for using RMSNorm over LayerNorm which is also more computationally efficient.
Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs
Honghao Liu | Xuhui Jiang | Chengjin Xu | Cehao Yang | Yiran Cheng | Lionel Ni | Jian Guo
Honghao Liu | Xuhui Jiang | Chengjin Xu | Cehao Yang | Yiran Cheng | Lionel Ni | Jian Guo
Preserving privacy in sensitive data while pretraining large language models on small, domain-specific corpora presents a significant challenge. In this work, we take an exploratory step toward privacy-preserving continual pretraining by proposing an entity-based framework that synthesizes encrypted training data to protect personally identifiable information (PII). Our approach constructs a weighted entity graph to guide data synthesis and applies deterministic encryption to PII entities, enabling LLMs to encode new knowledge through continual pretraining while granting authorized access to sensitive data through decryption keys. Our results on limited-scale datasets demonstrate that our pretrained models outperform base models and ensure PII security, while exhibiting a modest performance gap compared to models trained on unencrypted synthetic data. We further show that increasing the number of entities and leveraging graph-based synthesis improves model performance, and that encrypted models retain instruction-following capabilities with long retrieved contexts. We discuss the security implications and limitations of deterministic encryption, positioning this work as an initial investigation into the design space of encrypted data pretraining for privacy-preserving LLMs. Our code is available at https://github.com/DataArcTech/SoE.
Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks
Go Inoue | Bashar Alhafni | Nizar Habash | Timothy Baldwin
Go Inoue | Bashar Alhafni | Nizar Habash | Timothy Baldwin
Diacritics are orthographic marks added to letters to specify pronunciation, disambiguate lexical meanings, or indicate grammatical distinctions. Diacritics can significantly influence language processing tasks, especially in languages like Arabic, where diacritic usage varies widely across domains and contexts. While diacritics provide valuable linguistic information, their presence can increase subword fragmentation during tokenization, potentially degrading the performance of NLP models. In this paper, we systematically analyze the impact of diacritics on tokenization and benchmark task performance across major Large Language Models (LLMs). Our results demonstrate that while modern LLMs show robustness to the limited diacritics naturally found in texts, full diacritization leads to substantially increased token fragmentation and degraded performance, highlighting the need for careful handling of diacritics in the future development of Arabic LLMs.
Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities
Ting-Rui Chiang | Dani Yogatama
Ting-Rui Chiang | Dani Yogatama
In this work, we propose a simple theoretical framework, Pelican Soup, aiming to better understand how pretraining allows LLMs to (1) generalize to unseen instructions and (2) perform in-context learning, even when the verbalizers are irrelevant to the task. To this end, in our framework, we introduce the notion of "knowledge base" and "reference-sense association" and a simple formalism for natural language processing tasks. Our framework demonstrates how linguistic, psychology, and philosophy studies can inform our understanding of the language model and is connected to several other existing theoretical results. As an illustration of the usage of our framework, we derive a bound on in-context learning loss with our framework. Finally, we support our framework with empirical experiments and provide possible future research directions.
I Know, but I Don’t Know! How Persona Conflict Undermines Instruction Adherence in Large Language Models
Seonmin Koo | Jinsung Kim | Heuiseok Lim
Seonmin Koo | Jinsung Kim | Heuiseok Lim
Large Language Models (LLMs) are expected to generate appropriate responses while adhering to predefined prior constraints or knowledge, such as user personas, across various dialogue scenarios. However, real-world interactions frequently involve semantic conflicts between such prior information and actual user-provided inputs. Despite this, prior studies on persona-grounded dialogue—one of the representative tasks in personal preference modeling—have predominantly assumed idealized scenarios where persona and user utterances are fully aligned. To bridge this gap, we introduce and formalize the notion of persona conflict, wherein predefined personas contradict the personal information expressed by the user during interaction. We present a systematic verification framework to examine model behavior under such conflict scenarios. In detail, we propose a taxonomy that categorizes model behaviors into three distinct response types (adhering, sycophantic, and wavering) and develop a measurement schema grounded in this taxonomy. Our study provides a comprehensive analysis of the persona conflict phenomenon, identifying diverse key behavioral factors. Extensive experiments and in-depth analysis provide new insights into designing robust dialogue models capable of managing persona inconsistencies.
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner | Marlene Lutz | Markus Strohmaier
Maximilian Kreutner | Marlene Lutz | Markus Strohmaier
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups with which the base model is not aligned. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict the positions of European groups on a diverse set of policies.We evaluate whether predictions are stable in response to counterfactual arguments, different persona prompts, and generation methods. Finally, we find that we can simulate the voting behavior of Members of the European Parliament reasonably well, achieving a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at the following url: https://github.com/dess-mannheim/european_parliament_simulation.
Exploring Iterative Controllable Summarization with Large Language Models
Sangwon Ryu | Heejin Do | Daehui Kim | Hwanjo Yu | Dongwoo Kim | Yunsu Kim | Gary Lee | Jungseul Ok
Sangwon Ryu | Heejin Do | Daehui Kim | Hwanjo Yu | Dongwoo Kim | Yunsu Kim | Gary Lee | Jungseul Ok
Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, their ability to precisely control summary attributes (e.g., length or topic) remains underexplored, limiting their adaptability to specific user preferences. In this paper, we systematically explore the controllability of LLMs. To this end, we revisit summary attribute measurements and introduce iterative evaluation metrics, failure rate and average iteration count, to more precisely evaluate controllability beyond assessment of errors. Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. GTE enables the model to identify misaligned attributes in the initial draft and guides it to self-explain errors in the previous output. By encouraging reflection on attribute misalignment, GTE generates well-adjusted summaries that satisfy the desired attributes with robust effectiveness while requiring surprisingly fewer iterations than other iterative approaches.
The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models
Sherzod Hakimov | Roland Bernard | Tim Leiber | Karl Osswald | Kristina Richert | Ruilin Yang | Raffaella Bernardi | David Schlangen
Sherzod Hakimov | Roland Bernard | Tim Leiber | Karl Osswald | Kristina Richert | Ruilin Yang | Raffaella Bernardi | David Schlangen
Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We present the first comprehensive study that systematically evaluates how explicit reasoning training affects the negotiation abilities of both commercial and open-weight large language models, comparing these models to their vanilla counterparts across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models.Our findings show that enabling reasoning—that is, scaling test time compute—significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while a leading commercial model maintains language consistency between reasoning and final output.
ART: Adaptive Reasoning Trees for Explainable Claim Verification
Sahil Wadhwa | Himanshu Kumar | Guanqun Yang | Abbaas Alif Mohamed Nishar | Pranab Mohanty | Swapnil Shinde | Yue Wu
Sahil Wadhwa | Himanshu Kumar | Guanqun Yang | Abbaas Alif Mohamed Nishar | Pranab Mohanty | Swapnil Shinde | Yue Wu
Large Language Models (LLMs) are powerful candidates for complex decision-making, leveraging vast encoded knowledge and remarkable zero-shot abilities. However, their adoption in high-stakes environments is hindered by their opacity; their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness. In this paper, we propose ART (Adaptive Reasoning Trees), a hierarchical method for claim verification. The process begins with a root claim, which branches into supporting and attacking child arguments. An argument’s strength is determined bottom-up via a pairwise tournament of its children, adjudicated by a judge LLM, allowing a final, transparent and contestable verdict to be systematically derived which is missing in methods like Chain-of-Thought (CoT). We empirically validate ART on multiple datasets, analyzing different argument generators and comparison strategies. Our findings show that ART’s structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification which is more reliable and ensures clarity in the overall decision making step.
VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy
Yu Cui | Sicheng Pan | Yifei Liu | Haibin Zhang | Cong Zuo
Yu Cui | Sicheng Pan | Yifei Liu | Haibin Zhang | Cong Zuo
Large language models (LLMs) have been widely deployed in Conversational AIs (CAIs), while exposing privacy and security threats. Recent research shows that LLM-based CAIs can be manipulated to extract private information from human users, posing serious security threats. However, the methods proposed in that study rely on a white-box setting that adversaries can directly modify the system prompt. This condition is unlikely to hold in real-world deployments. The limitation raises a critical question: can unprivileged attackers still induce such privacy risks in practical LLM-integrated applications? To address this question, we propose VortexPIA, a novel indirect prompt injection attack that induces privacy extraction in LLM-integrated applications under black-box settings. By injecting token-efficient data containing false memories, VortexPIA misleads LLMs to actively request private information in batches. Unlike prior methods, VortexPIA allows attackers to flexibly define multiple categories of sensitive data. We evaluate VortexPIA on six LLMs, covering both traditional and reasoning LLMs, across four benchmark datasets. The results show that VortexPIA significantly outperforms baselines and achieves state-of-the-art (SOTA) performance. It also demonstrates efficient privacy requests, reduced token consumption, and enhanced robustness against defense mechanisms. We further validate VortexPIA on multiple realistic open-source LLM-integrated applications, demonstrating its practical effectiveness. Our code is available at https://github.com/cuiyu-ai/VortexPIA.
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
Eunsoo Lee | Jeongwoo Lee | Minki Hong | Jangho Choi | Jihie Kim
Eunsoo Lee | Jeongwoo Lee | Minki Hong | Jangho Choi | Jihie Kim
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception such as position and length. Building on this foundation, we introduce decomposition-of-thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering in general. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization
Mingbo Song | Heming Xia | Jun Zhang | Chak Tou Leong | Qiancheng Xu | Wenjie Li | Sujian Li
Mingbo Song | Heming Xia | Jun Zhang | Chak Tou Leong | Qiancheng Xu | Wenjie Li | Sujian Li
Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
Louie Hong Yao | Nicholas Jarvis | Tianyu Jiang
Louie Hong Yao | Nicholas Jarvis | Tianyu Jiang
Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., *brushing* vs. *grooming*), while different perspectives can lead to equally valid but distinct verb choices (e.g., *piloting* vs. *operating*). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs **verb sense clusters**, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to around four sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgments, offering a more nuanced assessment of model performance.
HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition
Gio Paik | Yongbeom Kim | Soungmin Lee | Sangmin Ahn | Chan Woo Kim
Gio Paik | Yongbeom Kim | Soungmin Lee | Sangmin Ahn | Chan Woo Kim
Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible non-synthetic evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model’s ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at https://github.com/ThetaOne-AI/HiKE.
Complexity-aware fine-tuning
Andrey Goncharov | Daniil Vyazhev | Petr Sychev | Edvard Khalafyan | Alexey Zaytsev
Andrey Goncharov | Daniil Vyazhev | Petr Sychev | Edvard Khalafyan | Alexey Zaytsev
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data.We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across three small open models (≈ 3B) we split the training data into complexity categories by a single token answer entropy (ROC AUC 0.73), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach (0.58 vs 0.45 average accuracy) and outperforms the distillation approach (0.58 vs 0.56 average accuracy) while using 81% less data.We publish our code and data to facilitate further research in this direction.
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Question Answering Task
Leonardo Ranaldi
Leonardo Ranaldi
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages
SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving
Ashutosh Bajpai | Akshat Bhandari | Akshay Nambi | Tanmoy Chakraborty
Ashutosh Bajpai | Akshat Bhandari | Akshay Nambi | Tanmoy Chakraborty
Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.
Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Paiheng Xu | Gang Wu | Xiang Chen | Tong Yu | Chang Xiao | Franck Dernoncourt | Tianyi Zhou | Wei Ai | Viswanathan Swaminathan
Paiheng Xu | Gang Wu | Xiang Chen | Tong Yu | Chang Xiao | Franck Dernoncourt | Tianyi Zhou | Wei Ai | Viswanathan Swaminathan
Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset—a collection of verified scripts—by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.
How Important is ‘Perfect’ English for Machine Translation Prompts?
Patrícia Schmidtová | Niyati Bafna | Seth Aycock | Gianluca Vico | Wiktor Kamzela | Kathy Hämmerl | Vilém Zouhar
Patrícia Schmidtová | Niyati Bafna | Seth Aycock | Gianluca Vico | Wiktor Kamzela | Kathy Hämmerl | Vilém Zouhar
Large language models (LLMs) show state-of-the-art performance in machine translation, but are also known to be sensitive to errors in user prompts. Given these models are largely trained on and respond best to prompts in standard English, this may affect the quality of LLM outputs for second language English speakers as well as real-world lay users, with potentially disproportionate effects on the former. We explore this effect by modeling a range of error types exhibited by such users, motivated by studies of L2 English, and quantifying their impact on LLM performance. We work with two related tasks: machine translation and machine translation evaluation. We find that LLMs-as-MT are brittle to natural spelling errors but not to errors at the phrasal level. However, the variance in quality caused by these errors is lower than the variance over the initial prompt choice, suggesting that “perfect English” for a given prompt is less important than choosing a good prompt. Since lay users and L2 speakers may use non-optimal prompts as well as display imperfect language skills, our work calls for increasing the resilience of model performance to both these phenomena to best serve a diverse user base, both from a robustness and fairness perspective.
KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation
Jiabin Fan | Guoqing Luo | Michael Bowling | Lili Mou
Jiabin Fan | Guoqing Luo | Michael Bowling | Lili Mou
We propose a novel K-step return estimation method (called KETCHUP) for Reinforcement Learning (RL)-based knowledge distillation (KD) in text generation tasks. Our idea is to induce a K-step return by using the Bellman Optimality Equation for multiple steps. Theoretical analysis shows that this K-step formulation reduces the variance of the gradient estimates, thus leading to improved RL optimization, especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation. These results suggest that our K-step return induction offers a promising direction for enhancing RL-based KD in LLM research.
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Haiyan Zhao | Xuansheng Wu | Fan Yang | Bo Shen | Ninghao Liu | Mengnan Du
Haiyan Zhao | Xuansheng Wu | Fan Yang | Bo Shen | Ninghao Liu | Mengnan Du
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique | Irtaza Khalid | Liam Turner | Luis Espinosa-Anke
Zara Siddique | Irtaza Khalid | Liam Turner | Luis Espinosa-Anke
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
MAPS: A Multilingual Benchmark for Agent Performance and Security
Omer Hofman | Jonathan Brokman | Oren Rachmil | Shamik Bose | Vikas Pahuja | Toshiya Shimizu | Trisha Starostina | Kelly Marchisio | Seraphina Goldfarb-Tarrant | Roman Vainshtein
Omer Hofman | Jonathan Brokman | Oren Rachmil | Shamik Bose | Vikas Pahuja | Toshiya Shimizu | Trisha Starostina | Kelly Marchisio | Seraphina Goldfarb-Tarrant | Roman Vainshtein
Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI and recent initial efforts toward multilingual interaction, existing benchmarks do not yet provide a comprehensive, multi-domain, security-aware evaluation of multilingual agentic systems. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks — GAIA (real-world tasks), SWE-Bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into eleven diverse languages, resulting in 805 unique tasks and 9,660 total language-specific instances - enabling a systematic analysis of the Multilingual Effect on AI agents’ performance and robustness. Empirically, we observe a degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. This work establishes the first standardized evaluation framework for multilingual agentic AI, encouraging future research towards equitable, reliable, and accessible agentic AI. https://huggingface.co/datasets/Fujitsu-FRE/MAPS
Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation
Liwen Sun | Xiang Yu | Ming Tan | Zhuohao Chen | Anqi Cheng | Ashutosh Joshi | Chenyan Xiong
Liwen Sun | Xiang Yu | Ming Tan | Zhuohao Chen | Anqi Cheng | Ashutosh Joshi | Chenyan Xiong
Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks.
DebateQA: Evaluating Question Answering on Debatable Knowledge
Rongwu Xu | Xuan Qi | Zehan Qi | Wei Xu | Zhijiang Guo
Rongwu Xu | Xuan Qi | Zehan Qi | Wei Xu | Zhijiang Guo
The rise of large language models (LLMs) has enabled us to seek answers to inherently debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability. However, traditional QA benchmarks assume fixed answers are inadequate for this purpose. To address this, we introduce DebateQA, a dataset of 2,941 debatable questions, each accompanied by multiple human-annotated partial answers that capture a variety of perspectives. We develop two metrics: Perspective Diversity, which evaluates the comprehensiveness of perspectives, and Dispute Awareness, which assesses if the LLM acknowledges the question’s debatable nature. Experiments demonstrate that both metrics are aligned with human preferences and stable across different underlying models. Using DebateQA with two metrics, we assess 12 prevalent LLMs and retrieval-augmented generation methods. Our findings reveal that while LLMs generally excel at recognizing debatable issues, their ability to provide comprehensive answers encompassing diverse perspectives varies considerably.
Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.
Harmful Factuality: LLMs Correcting What They Shouldn’t
Mingchen Li | Hanzhi Zhang | Heng Fan | Junhua Ding | Yunhe Feng
Mingchen Li | Hanzhi Zhang | Heng Fan | Junhua Ding | Yunhe Feng
While Large Language Models (LLMs) are trained for factual accuracy, this objective can directly conflict with the critical demand for source fidelity. This paper isolates and formalizes this conflict as Harmful Factuality Hallucination (HFH): a previously overlooked failure mode where an LLM’s attempt to “correct” perceived source errors results in an output that is factually true but unfaithful to the input. Unlike traditional hallucination research focused on models generating falsehoods, we investigate the harm of misplaced correctness. We introduce a reproducible framework to elicit and measure HFH using controlled entity-level perturbations (both soft, embedding-based and hard, instruction-based) paired with strategic entity selection. Across summarization, rephrasing, and QA tasks, our evaluation of diverse LLMs reveals that HFH is a prevalent behavior that worsens with model scale. We identify three underlying mechanisms and demonstrate that a simple instructional prompt can reduce HFH rates by approximately 50%. Our framework turns the abstract factuality–faithfulness tension into a measurable, actionable target for building more reliable LLM systems. Our code is publicly available at https://github.com/ResponsibleAILab/Harmful-Factuality-Hallucination.
Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin | Liam Dugan | Chris Callison-Burch
Meiqing Jin | Liam Dugan | Chris Callison-Burch
Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for beginner learners (CEFR: A1–A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support absolute beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.
CodeGuard: Improving LLM Guardrails in CS Education
Nishat Raihan | Noah Erdachew | Fnu Jayoti Devi | Joanna C. S. Santos | Marcos Zampieri
Nishat Raihan | Noah Erdachew | Fnu Jayoti Devi | Joanna C. S. Santos | Marcos Zampieri
Large language models (LLMs) are increasingly embedded in Computer Science (CS) classrooms to automate code generation, feedback, and assessment. However, their susceptibility to adversarial or ill-intentioned prompts threatens student learning and academic integrity. To cope with this important issue, we evaluate existing off-the-shelf LLMs in handling unsafe and irrelevant prompts within the domain of CS education. We identify important shortcomings in existing LLM guardrails which motivates us to propose CodeGuard, a comprehensive guardrail framework for educational AI systems. CodeGuard includes (i) a first-of-its-kind taxonomy for classifying prompts; (ii) the CodeGuard dataset, a collection of 8,000 prompts spanning the taxonomy; and (iii) PromptShield, a lightweight sentence-encoder model fine-tuned to detect unsafe prompts in real time. Experiments show that PromptShield achieves 0.93 F1 score, surpassing existing guardrail methods. Additionally, further experimentation reveals that CodeGuard reduces potentially harmful or policy-violating code completions by 30-65% without degrading performance on legitimate educational tasks. The code, datasets, and evaluation scripts are made freely available to the community.
ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
Yassir Lairgi | Ludovic Moncla | Khalid Benabdeslem | Rémy Cazabet | Pierre Cléau
Yassir Lairgi | Ludovic Moncla | Khalid Benabdeslem | Rémy Cazabet | Pierre Cléau
In today’s rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditional static knowledge graph (KG) construction often overlooks the dynamic and time-sensitive nature of real-world data, limiting adaptability to continuous changes. Moreover, recent zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance on prebuilt ontologies often suffer from instability across multiple runs, as well as incomplete coverage of key facts. To address these challenges, we introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that builds and continuously updates Temporal Knowledge Graphs (TKGs) from unstructured texts. ATOM splits input documents into minimal, self-contained “atomic” facts, improving extraction exhaustivity and stability. Then, it constructs atomic TKGs from these facts, employing a dual-time modeling that distinguishes between when information is observed and when it is valid. The resulting atomic TKGs are subsequently merged in parallel. Empirical evaluations demonstrate that ATOM achieves 18% higher exhaustivity, 33% better stability, and over 90% latency reduction compared to baseline methods, demonstrating a strong scalability potential for dynamic TKG construction.
On the Interplay between Human Label Variation and Model Fairness
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness under certain configurations.
Where do LLMs currently stand on biomedical NER in both clean and noisy settings ?
Christophe Ye | Cassie S. Mitchell
Christophe Ye | Cassie S. Mitchell
Biomedical Named Entity Recognition (NER) consists of identifying and classifying important biomedical entities mentioned in text. Traditionally, biomedical NER has heavily relied on domain-specific pre-trained language models; particularly variant of BERT models. With the emergence of large language models (LLMs), some studies have evaluated their performance on biomedical NLP tasks. These studies consistently show that, despite their general capabilities, LLMs still fall short compared to specialized BERT-based models for biomedical NER. However, as LLMs continue to advance at a remarkable pace, natural questions arise: Are they still far behind, or are they starting to be competitive? In this study, we investigate the performance of recent LLMs across multiple biomedical NER datasets under both clean and noisy dataset conditions. Our findings reveal that LLMs are progressively closing the performance gap with BERT-based models and demonstrate particular strengths in low-data settings. Moreover, our results suggest that in-context learning with LLMs exhibits a notable degree of robustness to noise, making them a promising alternative in settings where labeled data is scarce or noisy.
Scaling Data-Constrained Language Models with Synthetic Data
Hirokazu Kiyomaru | Yusuke Oda | Takashi Kodama | Chaoran Liu | Daisuke Kawahara
Hirokazu Kiyomaru | Yusuke Oda | Takashi Kodama | Chaoran Liu | Daisuke Kawahara
Large language models (LLMs) improve with more training data, but practical limits on data collection increasingly constrain further scaling. Advances in instruction-following LLMs have enabled controlled, high-quality text generation, making synthetic data a promising remedy. However, its effectiveness for pre-training non-English LLMs remains underexplored. We study this question for Japanese in a fixed token budget setting in which organic Japanese Web text constitutes only a small share, while far more organic English Web text and instruction-following LLMs capable of generating fluent Japanese are available. We compare three strategies to fill the data shortfall: generating synthetic Japanese text, repeating the limited Japanese Web text, and using English Web text. Experiments show that synthetic Japanese corpora outperform both baselines and approach the performance achieved when the entire token budget is filled with additional organic Japanese Web text.
The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs
Omar Mahmoud | Ali Khalil | Thommen George Karimpanal | Buddhika Laknath Semage | Santu Rana
Omar Mahmoud | Ali Khalil | Thommen George Karimpanal | Buddhika Laknath Semage | Santu Rana
Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.
The Model’s Language Matters: A Comparative Privacy Analysis of LLMs
Abhishek Kumar Mishra | Antoine Boutet | Lucas Magnana
Abhishek Kumar Mishra | Antoine Boutet | Lucas Magnana
Large Language Models (LLMs) are increasingly deployed in multilingual settings that process sensitive data, yet their scale and linguistic variability can amplify privacy risks. While prior privacy evaluations focus predominantly on English, we investigate how language structure shapes privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six corpus-level linguistic indicators and evaluate vulnerability under three attack families: extraction, counterfactual memorization, and membership inference. Across languages, we find that leakage systematically tracks structural properties: Italian exhibits the strongest exposure, consistent with its highest redundancy and longer lexical units, whereas English shows the clearest membership separability, aligning with its higher syntactic entropy and stronger surface-identifiable cues. In contrast, French and Spanish remain comparatively more resilient overall, aided by higher morphological complexity. These results provide quantitative evidence that language matters for privacy leakage, motivating language-aware and structure-adaptive privacy-preserving mechanisms for multilingual LLM deployments.
Towards the First NLP Benchmark for Ladin - an Extremely Low-Resource Language
Ulin Nuha | Adam Jatowt
Ulin Nuha | Adam Jatowt
The performance of large language models (LLMs) tends to degrade for extremely low-resource languages, primarily due to the lack of labeled training data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically the Val Badia variant. Leveraging a small set of parallel Ladin–Italian sentence pairs, we create synthetic datasets for sentiment analysis and question answering by translating monolingual Italian data. To ensure linguistic quality, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian–Ladin translation baselines. Our contributions include sentiment analysis and question answering datasets for Ladin, establishing foundational resources that support broader NLP research and downstream applications for underrepresented languages.
DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
Haiyang Shen | Hang Yan | Zhongshi Xing | Mugeng Liu | Yue Li | Zhiyang Chen | Yuxiang Wang | Jiuzheng Wang | Yun Ma
Haiyang Shen | Hang Yan | Zhongshi Xing | Mugeng Liu | Yue Li | Zhiyang Chen | Yuxiang Wang | Jiuzheng Wang | Yun Ma
Retrieval-augmented generation (RAG) can substantially enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms—including vanilla, planning-based, and iterative RAG—all depend on a robust retriever, yet existing retrievers rely heavily on public knowledge and often falter when faced with domain-specific queries. To address these limitations, we introduce DRAGON, a framework that combines a data-construction modeling approach with a scalable synthetic data-generation pipeline, specifically designed to optimize domain-specific retrieval performance and bolster retriever robustness. To evaluate RAG performance on domain-specific RAGs, we propose DRAGONBench, a benchmark spanning 8 domain-specific document collections across 4 distinct fields and featuring a wide spectrum of query complexities, answerability, and hops. Leveraging DRAGON, we generate a large-scale synthetic dataset—encompassing both single-hop and multi-hop queries—to enrich retriever training. Extensive experiments demonstrate that retrievers trained on this data yield significant performance gains and exhibit strong cross-domain generalization. Moreover, when our optimized retrievers are integrated into vanilla, planning-based, and iterative RAG paradigms, we observe consistent end-to-end improvements in system accuracy.
Activation steering or editing hidden states to control language-model behavior can be framed as a causal mediation problem: inputs induce internal activations, a subset of which act as mediators transmitting targeted behaviors to outputs. We formalize a structural graph over transformer layers and derive front-door—style identification conditions that justify steering through mediating subspaces while preserving non-mediating features, thereby reducing confounding and off-target effects. Within this mediation-first view, we present CAS-BiPO, a sparse mediation steering approach that learns targeted behavioral interventions via regularized training. Empirically, our method achieves 97-100% of dense baseline effectiveness across four behavioral control tasks while using only 10-30% of activation dimensions. Learned masks concentrate 94.3% of steering effects in 26.7% of dimensions, with neurons exhibiting 2.2× higher activation changes, validating the sparse mediation hypothesis. Our causal framework provides theoretical grounding while CAS-BiPO demonstrates that end-to-end learning of interpretable, reliable interventions is both feasible and advantageous.
Causal Direct Preference Optimization for Language Model Alignment
Uyen Le | Thin Nguyen | Toan Nguyen | Toan Doan | Trung Le | Bac Le
Uyen Le | Thin Nguyen | Toan Nguyen | Toan Doan | Trung Le | Bac Le
Direct Preference Optimization (DPO) is a powerful approach for aligning large language models (LLMs) with human preferences by formulating preference learning as a supervised classification problem over pairwise human-labeled outputs, thereby enabling stable and efficient training. We show that DPO inherits bias from confounders (e.g., topic, style, user objectives) that shape data generation and carry through to training, hindering recovery of true human preferences. We address this from a causal perspective, proposing Causal Direct Preference Optimization (CDPO), a general framework that incorporates causal inference principles to mitigate the influence of confounders and sharpen the signal of genuine human preferences. Our approach preserves the tractability of direct optimization while enhancing robustness to spurious correlations and annotation biases. Empirical evaluations on benchmark datasets show that CDPO surpasses DPO-based baselines by achieving unbiased fine-tuning through causal reasoning, confirming the effectiveness of confounder-aware preference optimization.
LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis With Multi-step Arithmetics
Keito Kudo | Yoichi Aoki | Tatsuki Kuribayashi | Shusaku Sone | Masaya Taniguchi | Ana Brassard | Keisuke Sakaguchi | Kentaro Inui
Keito Kudo | Yoichi Aoki | Tatsuki Kuribayashi | Shusaku Sone | Masaya Taniguchi | Ana Brassard | Keisuke Sakaguchi | Kentaro Inui
This study investigates the internal information flow of large language models (LLMs) while performing chain-of-thought (CoT) style reasoning.Specifically, with a particular interest in the faithfulness of the CoT explanation to LLMs’ final answer, we explore (i) when the LLMs’ answer is (pre)determined, especially before the CoT begins or after, and (ii) how strongly the information from CoT specifically has a causal effect on the final answer.Our experiments with controlled arithmetic tasks reveal a systematic internal reasoning mechanism of LLMs.They have not derived an answer at the moment when input was fed into the model.Instead, they compute (sub-)answers while generating the reasoning chain on the fly.Therefore, the generated reasoning chains can be regarded as faithful reflections of the model’s internal computation.
VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
Jinchao Ge | Tengfei Cheng | Biao Wu | Zeyu Zhang | Shiya Huang | Judith Bishop | Gillian Shepherd | Meng Fang | Ling Chen | Yang Zhao
Jinchao Ge | Tengfei Cheng | Biao Wu | Zeyu Zhang | Shiya Huang | Judith Bishop | Gillian Shepherd | Meng Fang | Ling Chen | Yang Zhao
Understanding cultural heritage artifacts such as ancient Greek pottery requires expert-level reasoning that remains challenging for current MLLMs due to limited domain-specific data. We introduce VaseVQA, a benchmark for ancient Greek pottery, primarily vases, consisting of 31,773 images and 67,614 question–answer pairs across seven expert-defined categories, enabling systematic evaluation of expert-level cultural heritage understanding. Using this dataset, we explore effective training strategies for domain-specific reasoning. While supervised fine-tuning improves adaptation to domain knowledge, it struggles with deeper reasoning tasks. We propose VaseVL, which augments SFT with reinforcement learning using verifiable rewards. Experiments show that VaseVL consistently outperforms supervised baselines, especially on reasoning-intensive questions, highlighting the value of targeted reinforcement learning for cultural heritage visual question answering.
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts
Sullam Jeoung | Yueyan Chen | Yi Zhang | Shuai Wang | Haibo Ding | Lin Lee Cheong
Sullam Jeoung | Yueyan Chen | Yi Zhang | Shuai Wang | Haibo Ding | Lin Lee Cheong
Prompts are the interface for eliciting the capabilities of large language models (LLMs). Understanding their structure and components is critical for analyzing LLM behavior and optimizing performance. However, the field lacks a comprehensive framework for systematic prompt analysis and understanding. We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels: functional structure, semantic component, and syntactic pattern. By applying linguistic concepts to prompt analysis, PromptPrism bridges traditional language understanding and modern LLM research, offering insights that purely empirical approaches might miss. We show the practical utility of PromptPrism by applying it to three applications: (1) a taxonomy-guided prompt refinement approach that automatically improves prompt quality and enhances model performance across a range of tasks; (2) a multi-dimensional dataset profiling method that extracts and aggregates structural, semantic, and syntactic characteristics from prompt datasets, enabling comprehensive analysis of prompt distributions and patterns; (3) a controlled experimental framework for prompt sensitivity analysis by quantifying the impact of semantic reordering and delimiter modifications on LLM performance. Our experimental results validate the effectiveness of our taxonomy across these applications, demonstrating that PromptPrism provides a foundation for refining, profiling, and analyzing prompts.
HiGraAgent: Dual-Agent Adaptive Reasoning over Hierarchical Knowledge Graph for Open Domain Multi-hop Question Answering
Hung Luu | Long S. T. Nguyen | Trung Pham | Hieu Pham | Tho Quan
Hung Luu | Long S. T. Nguyen | Trung Pham | Hieu Pham | Tho Quan
Open Domain Multi-hop Question Answering faces a dual compositionality challenge: reasoning over complex query structures and integrating evidence scattered across contexts. Despite recent advancements in Graph-based Retrieval-Augmented Generation (GraphRAG), persistent limitations in complex reasoning and retrieval inaccuracies continue to constrain the efficacy of multi-hop QA systems. We introduce HiGraAgent, a framework that unifies graph-based retrieval with adaptive reasoning. It constructs a Hierarchical Knowledge Graph (HiGra) with entity alignment, reducing redundancy by 34.5% while preserving expressiveness; employs HiGraRetriever, a hybrid graph-semantic retriever that consistently outperforms the strongest graph-based method across benchmarks; and integrates a dual-agent adaptive reasoning protocol where a Seeker and a Librarian dynamically coordinate retrieval and reasoning. Together, these innovations enable HiGraAgent to achieve 85.3% average accuracy on HotpotQA, 2WikiMultihopQA, and MuSiQue, surpassing the strongest prior system by 11.7%. Our results highlight the importance of reframing multi-hop QA as a problem of adaptive reasoning, offering a more robust and flexible paradigm for complex information seeking.
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens
Mai Alkhamissi | Yunze Xiao | Badr AlKhamissi | Mona T. Diab
Mai Alkhamissi | Yunze Xiao | Badr AlKhamissi | Mona T. Diab
Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.
Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
Keigo Shibata | Kazuki Yano | Ryosuke Takahashi | Jaesung Lee | Wataru Ikeda | Jun Suzuki
Keigo Shibata | Kazuki Yano | Ryosuke Takahashi | Jaesung Lee | Wataru Ikeda | Jun Suzuki
This paper discusses the internal behavior of Transformer language models.Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large “jump” in the angular distance occurring in or around the final Transformer layer.To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training.Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers.Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.
Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer’s actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.
ReAttn: Improving Attention-based Re-ranking via Attention Re-weighting
Yuxing Tian | Fengran Mo | Weixu Zhang | Yiyan Qi | Jian-Yun Nie
Yuxing Tian | Fengran Mo | Weixu Zhang | Yiyan Qi | Jian-Yun Nie
The strong capabilities of recent Large Language Models (LLMs) have made them highly effective for zero-shot re-ranking task. Attention-based re-ranking methods, which derive relevance scores directly from attention weights, offer an efficient and interpretable alternative to generation-based re-ranking methods. However, they still face two major limitations. First, attention signals are highly concentrated a small subset of tokens within a few documents, making others indistinguishable. Second, attention often overemphasizes phrases lexically similar to the query, yielding biased rankings that irrelevant documents with mere lexical resemblance are regarded as relevant. In this paper, we propose ReAttn, a post-hoc re-weighting strategy for attention-based re-ranking methods. It first compute the cross-document IDF weighting to down-weight attention on query-overlapping tokens that frequently appear across the candidate documents, reducing lexical bias and emphasizing distinctive terms. It then employs entropy-based regularization to mitigate over-concentrated attention, encouraging a more balanced distribution across informative tokens. Both adjustments operate directly on existing attention weights without additional training or supervision. Extensive experiments demonstrate the effectiveness of our method.
MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration
Md Hasebul Hasan | Mahir Labib Dihan | Tanzima Hashem | Mohammed Eunus Ali | Md Rizwan Parvez
Md Hasebul Hasan | Mahir Labib Dihan | Tanzima Hashem | Mohammed Eunus Ali | Md Rizwan Parvez
Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly—often overwhelming the LLM when handling similar but subtly different geospatial APIs—MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules—such as map-based services—we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks—MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA—and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines.
Comprehensive Study of Bilingual and Multi-category Instruction Pre-training
Takashi Kodama | Yusuke Oda
Takashi Kodama | Yusuke Oda
Instruction pre-training (IPT) has recently emerged as an effective intermediate stage between vanilla pre-training and post-training for large language models (LLMs). However, the optimal design of IPT corpora—such as the balance between raw and instruction-response data, languages, and task categories—remains unclear. We systematically study IPT corpus composition using a bilingual (English and Japanese) and multi-category (coding, general, math, and reasoning) instruction-response dataset. Through extensive IPT experiments across four base models, including both English-centric and bilingual LLMs, we find that: (1) more instruction-response data generally enhances model performance, particularly for models with large VPT budgets; (2) Japanese instruction data can improve English performance through cross-lingual transfer; and (3) the effectiveness of post-training varies across categories: coding performance is largely determined during IPT, while math and reasoning continue to improve during post-training.
Reflect, Rewrite, Repeat: How Simple Arithmetic Enables Advanced Reasoning in Small Language Models
Mengdie Flora Wang | Haochen Xie | Mun Young Kim | Baishali Chaudhury | Meghana Ashok | Suren Gunturu | Sungmin Hong | Jae Oh Woo
Mengdie Flora Wang | Haochen Xie | Mun Young Kim | Baishali Chaudhury | Meghana Ashok | Suren Gunturu | Sungmin Hong | Jae Oh Woo
Contemporary advancements in language model reasoning typically require computationally intensive reinforcement learning (RL) and massive datasets, creating barriers for resource-constrained teams. In this work, we demonstrate that high-quality, iterative training on minimal data can rival modern RL approaches. We introduce a resource-efficient framework that combines Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT) with selective guidance from larger models, iteratively refining solutions through a "reflect, rewrite, repeat" cycle (R3). Using Qwen 2.5 7B and Qwen 2.5 Math 7B as base models, our method shows meaningful performance improvements across arithmetic, symbolic and cognitive reasoning benchmarks—including GSM8K (83.1% → 88.6%), AIME’25@10 (20.0% → 30.0%) and LastLetterConcat (40.7% → 53.3%) problems. The model-agnostic nature of our R3 framework is further demonstrated through substantial improvements when applied to Mistral and LLaMA-based models. Remarkably, these gains are achieved using mere 700 basic arithmetic training samples, in stark contrast to the hundreds of thousands of examples typically required by RL-based systems. Our results suggest that reasoning improvements need not strictly depend on large-scale data. By emphasizing strategically curated training grounded in foundational principles, we achieve competitive generalization with minimal resource overhead. Our R3 pipeline also generates high-quality SFT data with high-fidelity reasoning traces as byproduct, further enabling scalable and annotation-free fine-tuning. Code is available.[<https://github.com/aws-samples/sample-for-reflect-rewrite-repeat>]
Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Jiwon Moon | Yerin Hwang | Dongryeol Lee | Taegwan Kang | Yongil Kim | Kyomin Jung
Jiwon Moon | Yerin Hwang | Dongryeol Lee | Taegwan Kang | Yongil Kim | Kyomin Jung
With the growing use of large language models (LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations—such as differences in variable names, comments, or formatting—that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation method.
COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations
Rui Xing | Preslav Nakov | Timothy Baldwin | Jey Han Lau
Rui Xing | Preslav Nakov | Timothy Baldwin | Jey Han Lau
Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNTYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information is beneficial for existing fact-checking systems. The code and the data are available at https://github.com/ruixing76/Helpfulness-FCExp.
Persona Jailbreaking in Large Language Models
Jivnesh Sandhan | Fei Cheng | Tushar Sandhan | Yugo Murawaki
Jivnesh Sandhan | Fei Cheng | Tushar Sandhan | Yugo Murawaki
As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan. Despite the significant similarity between the dialects of each country, script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking “siblings”. To overcome this, previously-published efforts have investigated machine transliteration models to convert between the two scripts. Unfortunately, most efforts did not use datasets other than those they created, limiting these models to certain domains of text such as archaic poetry or word lists. A truly usable transliteration system must be capable of handling varied domains, meaning that suck models lack the versatility required for real-world usage. The contrast in domain between data also obscures the task’s true difficulty. We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets, and present two datasets of our own. Our results across domains provide clearer understanding of the task, and set comprehensive comparable leading benchmarks. Overall, our model achieves chrF++ and Normalized CER scores of 87.91 and 0.05 from Farsi to Tajik and 92.28 and 0.04 from Tajik to Farsi. Our model, data, and code are available at https://github.com/merchantrayyan/ParsTranslit.
One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations
Kohei Oda | Po-Min Chuang | Kiyoaki Shirai | Natthawut Kertkeidkachorn
Kohei Oda | Po-Min Chuang | Kiyoaki Shirai | Natthawut Kertkeidkachorn
Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.
ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem
Jia-Kai Dong | I-Wei Huang | Chun-Tin Wu | Yi-tien Tsai
Jia-Kai Dong | I-Wei Huang | Chun-Tin Wu | Yi-tien Tsai
We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets”, enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents.
SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Sina Bagheri Nezhad | Yao Li | Ameeta Agrawal
Sina Bagheri Nezhad | Yao Li | Ameeta Agrawal
Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
Unsupervised Detection of LLM-Generated Text in Korean Using Syntactic and Semantic Cues
Heejeong Jeon | MinSu Park | YunSeok Choi | Eunil Park
Heejeong Jeon | MinSu Park | YunSeok Choi | Eunil Park
As Large Language Models (LLMs) are increasingly used for content creation, detecting AI-generated text has become a critical challenge. Prior work has largely focused on English, leaving low-resource languages such as Korean underexplored. We propose an unsupervised detection framework that integrates two complementary signals: syntactic token cohesiveness (TOCSIN) and semantic regeneration similarity (SimLLM). To support evaluation, we construct a Korean pairwise dataset of 1,000 anchors with continuation- and regeneration-style generations and further assess performance across domains (news, research paper abstracts, essays) and model families (GPT-3.5 Turbo, GPT-4o, HyperCLOVA X, LLaMA-3-8B). Without any training, our ensemble achieves up to 0.963 F1 and 0.985 ROC-AUC, outperforming baselines. These results demonstrate that the combination of syntactic and semantic cues enables robust unsupervised detection in low-resource settings. Code available at https://github.com/dxlabskku/llm-detection-main.
NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey
Dhiman Goswami | Jai Kruthunz Naveen Kumar | Sanchari Das
Dhiman Goswami | Jai Kruthunz Naveen Kumar | Sanchari Das
Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58–0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (↓ 2%-9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.
CrowdSelect: SyntheticInstruction Data Selection with Multi-LLM Wisdom
Yisen Li | Lingfeng Yang | Wenxuan Shen | Pan Zhou | Yao Wan | Weiwei Lin | Dongping Chen
Yisen Li | Lingfeng Yang | Wenxuan Shen | Pan Zhou | Yao Wan | Weiwei Lin | Dongping Chen
Distilling advanced Large Language Models’ instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies rely mainly on single-dimensional signals (i.e., reward scores, model perplexity), they fail to capture the complexity of instruction-following across diverse fields. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundational metrics that leverage Multi-LLMs wisdom, informed by (1) diverse LLM responses and (2) reward model assessment. Building upon base metrics, we propose CrowdSelect, an integrated metric incorporating a clustering-based approach to maintain response diversity. Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. CrowdSelect, efficiently incorporating all metrics, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction.
Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations
Sheng-Lun Wei | Yu-Ling Liao | Yen-Hua Chang | Hen-Hsen Huang | Hsin-Hsi Chen
Sheng-Lun Wei | Yu-Ling Liao | Yen-Hua Chang | Hen-Hsen Huang | Hsin-Hsi Chen
Recent multimodal large language models (MLLMs) extend language understanding beyond text to speech, enabling unified reasoning across modalities. While biases in text-based LLMs have been widely examined, their persistence and manifestation in spoken inputs remain underexplored. This work presents the first systematic investigation of speech bias in multilingual MLLMs.We construct and release the BiasInEar Dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours (≈4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss’ 𝜅), we evaluate nine representative models under linguistic language and accent, demographic gender, and structural option order perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation.
Large Language Models (LLMs) are increasingly being used to understand how scientific research evolves, drawing growing interest from the research community. However, limited work has explored the scientific fact-checking of research questions and claims from manuscripts, particularly within the NLP domain, an emerging direction for advancing scientific integrity and knowledge validation. In this work, we propose a novel scientific fact-checking dataset, SCINLP, tailored to the NLP domain. Our proposed framework on SCINLP systematically verifies the veracity of complex scientific research questions across varying rationale contexts, while also assessing their temporal positioning. SCINLP includes supporting and refuting research questions from a curated collection of influential and reputable NLP papers published between 2000 and 2024. In our framework, we use multiple LLMs and diverse rationale contexts from our dataset to examine scientific claims and research focus, complemented by feasibility judgments for deeper insight into scientific reasoning in NLP.
SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation
Nobuhiro Ueda | Yuyang Dong | Krisztián Boros | Daiki Ito | Takuya Sera | Masafumi Oyamada
Nobuhiro Ueda | Yuyang Dong | Krisztián Boros | Daiki Ito | Takuya Sera | Masafumi Oyamada
With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs),rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG)and visual RAG are gaining significant attention.Recent research indicates that using VLMs yields better RAG performance,but processing rich documents remains a challenge since a single page contains large amounts of information.In this paper, we present SCAN (SemantiC Document Layout ANalysis),a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systemsthat work with visually rich documents.It is a VLM-friendly approach that identifies document components with appropriate semantic granularity,balancing context preservation with processing efficiency.SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components.We trained the SCAN model by fine-tuning object detection models on an annotated dataset.Our experimental results across English and Japanese datasets demonstrate that applying SCAN improvesend-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points,outperforming conventional approaches and even commercial document processing solutions.
Unified Multimodal Interleaved Document Representation for Retrieval
Jaewoo Lee | Joonho Ko | Jinheon Baek | Soyeong Jeong | Sung Ju Hwang
Jaewoo Lee | Joonho Ko | Jinheon Baek | Soyeong Jeong | Sung Ju Hwang
Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
TELLME: Test-Enhanced Learning for Language Model Enrichment
Minjun Kim | Inho Won | HyeonSeok Lim | MinKyu Kim | Junghun Yuk | Wooyoung Go | Jongyoul Park | Jungyeul Park | KyungTae Lim
Minjun Kim | Inho Won | HyeonSeok Lim | MinKyu Kim | Junghun Yuk | Wooyoung Go | Jongyoul Park | Jungyeul Park | KyungTae Lim
Continual pre-training (CPT) has been widely adopted as a method for domain expansion in large language models. However, CPT has consistently been accompanied by challenges, such as the difficulty of acquiring large-scale domain-specific datasets and high computational costs. In this study, we propose a novel method called Test-Enhanced Learning for Language Model Enrichment (TELLME) to alleviate these issues. TELLME leverages the Test-Enhanced Learning (TEL) principle, whereby the model’s learning efficiency is improved using quizzes during training. It integrates this principle with CPT, thereby promoting efficient domain-specific knowledge acquisition and long-term memory retention. Experimental results demonstrate that TELLME outperforms existing methods by up to 23.6% in the financial domain and achieves a 9.8% improvement in long-term memory retention.
Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark
Jieun Park | KyungTae Lim | Joon-ho Lim
Jieun Park | KyungTae Lim | Joon-ho Lim
Recent advancements in LLMs have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with teacher solutions, student solutions, and annotations marking students’ initial errors. This dataset is designed to evaluate two core capabilities of LLMs: (1) measuring similarity between student and teacher solutions, and (2) identifying the initial error point in student solutions. Our method achieves high agreement with human judgments, with Pearson 0.89 and Spearman 0.88 on English, and Pearson 0.89 and Spearman 0.87 on Korean. It also offers significantly lower latency and resource usage than commercial APIs, demonstrating strong computational efficiency. In the error detection task, open-source models achieved approximately 86% accuracy, with performance within 10% points of commercial LLMs API, suggesting strong practical potential. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.
VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Loc Pham | Tung Luu | Thu Vo | Minh Nguyen | Viet Hoang
Loc Pham | Tung Luu | Thu Vo | Minh Nguyen | Viet Hoang
Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework, thereby contributing an extension of the Massive Multilingual Text Embedding Benchmark with our additional Vietnamese tasks and datasets. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks.
See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval
Mingyu Jeon | Sungjin Han | Jinkwon Hwang | Minchol Kwon | Jonghee Kim | Junyeong Kim
Mingyu Jeon | Sungjin Han | Jinkwon Hwang | Minchol Kwon | Jonghee Kim | Junyeong Kim
Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
RB-LoRA: Rank-Balanced Aggregation for Low-Rank Adaptation with Federated Fine-Tuning
Sihyeon Ha | Yongjeong Oh | Yo-Seb Jeon
Sihyeon Ha | Yongjeong Oh | Yo-Seb Jeon
Federated fine-tuning of foundation models is impeded by the need to communicate billions of parameters. Low-rank adaptation (LoRA) alleviates this by updating only compact adapter matrices. However, varying client device capabilities lead to different adapter ranks, causing rank heterogeneity that undermines aggregation, and existing reconciliation methods still incur bias or inefficiency. To address this challenge, we propose RB-LoRA, a principled rank-balanced aggregation framework that decomposes each update into rank-wise components and aligns them using analytically derived weights. Experiments on both language and vision models demonstrate consistent improvements under one and three rounds of communication in federated learning.
Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It
Seyed Mahed Mousavi | Edoardo Cecchinato | Lucia Horníková | Giuseppe Riccardi
Seyed Mahed Mousavi | Edoardo Cecchinato | Lucia Horníková | Giuseppe Riccardi
We conduct a systematic audit of three widely used social reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-3, 3.5, 4, o1, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over the reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. In fact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about social reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen | Chung-Chi Chen | An-Zi Yen
Bo-Wei Chen | Chung-Chi Chen | An-Zi Yen
Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model’s confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model’s likelihood of knowing the correct answer and the probability that its response is accurate.Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20% to 40%. When applied to GPT-4o API calls, it reduces token usage by approximately 60%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.
Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference
Han Yuan | Yue Zhao | Li Zhang | Wuqiong Luo | Zheng Ma
Han Yuan | Yue Zhao | Li Zhang | Wuqiong Luo | Zheng Ma
Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs’ generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs’ generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o’s generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions. Further experiments show that OpenAI-o3 are more resilient to output formats than general-purpose GPT-4o and GPT-4.1, highlighting an unaware advantage of reasoning models.
Breaking the Illusion of Reasoning in Polish LLMs: Quality over Quantity of Thought
Dzmitry Pihulski | Mikołaj Langner | Jan Eliasz | Przemyslaw Kazienko | Jan Kocon | Teddy Ferdinan
Dzmitry Pihulski | Mikołaj Langner | Jan Eliasz | Przemyslaw Kazienko | Jan Kocon | Teddy Ferdinan
Recent advances in large language models (LLMs) have introduced explicit reasoning capabilities, yet the factors that truly drive their improved performance remain unclear. In this work, we disentangle the effects of reasoning quality and sequence length by fine-tuning 8B models on several Polish variants of the Mixture-of-Thoughts (MoT-PL) dataset, each representing a distinct reasoning style: *Detailed*, *Summarized*, *BabyThink*, *Lengthy*. We found that the model trained on high-quality reasoning traces achieved better average performance than all other models; neither *longer reasoning with similar quality* nor *low-quality reasoning with similar length* achieved similar gains. Qualitative and quantitative analyses further reveal that reasoning clarity, rather than verbosity, is the dominant factor driving model performance. These findings underscore the importance of reasoning content quality in LLM training and provide new insights into designing more effective reasoning-oriented datasets and models.
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
Jiapeng Wang | Jinhao Jiang | Zhiqiang Zhang | Jun Zhou | Xin Zhao
Jiapeng Wang | Jinhao Jiang | Zhiqiang Zhang | Jun Zhou | Xin Zhao
The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the internal logic of the problem during generation and ensuring the verifiability of the solutions. To address these issues, we propose RV-Syn, a novel Rational and Verifiable mathematical Synthesis approach. RV-Syn first constructs a structured library of mathematical operations and then composes them into executable computational graphs, which serve as verifiable solution blueprints. These graphs are subsequently back-translated into complex problems, enabling solution-guided, logic-aware problem generation while inherently ensuring the verifiability of the solving process. Experimental results show RV-Syn surpasses existing synthesis methods, including those involving human-crafted problems. Our method achieves a 6.3% performance gain over the previous state-of-the-art synthetic data on LLaMA-3-8B and demonstrates superior data efficiency, outperforming others with only half the training data (50k vs. 100k), enabling a more scalable and robust reasoning dataset generation framework.
WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
Liangtao Lin | Jun Zheng | Haidong Wang
Liangtao Lin | Jun Zheng | Haidong Wang
Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
From Semantics to Style: A Cross-Dataset Comparative Framework for Sentence Similarity Predictions
Yusuke Yamauchi | Akiko Aizawa
Yusuke Yamauchi | Akiko Aizawa
While Semantic Textual Similarity (STS) task serves as a cornerstone embedding task in natural language processing, the definition of similarity is inherently ambiguous and dataset-specific. Comprehensive cross-dataset analysis remains scarce, leaving it uncertain whether language models perceive diverse semantic and stylistic nuances as humans do. To address this, we propose a comparative framework utilizing lightweight poolers on a frozen encoder to conduct a unified analysis across STS, Paraphrase Identification (PI), and Triplet datasets. Experimental results on 21 datasets indicate a high correlation of semantic concepts between STS and PI settings, while highlighting style as a distinct dimension necessitating explicit separation from semantics. Moreover, Procrustes, layer-wise and hierarchical clustering analyses elucidate the varying properties of these concepts and the structural organization of the embedding space. These insights imply that treating semantics and style as separate components in embedding models is crucial for enhancing both interpretability and practical utility.
Feature Drift: How Fine-Tuning Repurposes Representations in LLMs
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Andrey V. Galichin | Anton Korznikov | Alexey Dontsov | Oleg Rogov | Elena Tutubalina | Ivan Oseledets
Fine-tuning LLMs introduces many important behaviors, such as instruction-following and safety alignment. This makes it crucial to study how fine-tuning changes models’ internal mechanisms. Sparse Autoencoders (SAEs) offer a powerful tool for interpreting neural networks by extracting concepts (features) represented in their activations. Previous work observed that SAEs trained on base models transfer effectively to instruction-tuned (chat) models, attributed to activation similarity. In this work, we propose *feature drift* as an alternative explanation: the feature space remains relevant, but the distribution of feature activations changes. In other words, fine-tuning recombines existing concepts rather than learning new ones. We validate this by showing base SAEs reconstruct both base and chat activations comparably despite systematic differences, with individual features exhibiting clear drift patterns. In a refusal behavior case study, we identify base SAE features that drift to activate on harmful instructions in chat models. Causal interventions using these features confirm that they mediate refusal. Our findings suggest that monitoring how existing features drift, rather than searching for entirely new features, may provide a more complete explanation of how fine-tuning changes model capabilities.
Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Tiziano Labruna | Arkadiusz Modzelewski | Giorgio Satta | Giovanni Da San Martino
Tiziano Labruna | Arkadiusz Modzelewski | Giorgio Satta | Giovanni Da San Martino
Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a chain-of-thought framework that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
Mingkai Tian | Guorong Li | Yuankai Qi | Anton Van Den Hengel | Qingming Huang
Mingkai Tian | Guorong Li | Yuankai Qi | Anton Van Den Hengel | Qingming Huang
Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract video-informed text prompts to guide language models in generating captions. However, by using representations at a single granularity (e.g., noun phrases or full sentences), these methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics, to promote prompt diversity while ensuring visual relevance. Extensive experiments on both in-domain and cross-domain settings demonstrate that the proposed method consistently outperforms state-of-the-art approaches.
Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning
Sara Rajaee | Rochelle Choenni | Ekaterina Shutova | Christof Monz
Sara Rajaee | Rochelle Choenni | Ekaterina Shutova | Christof Monz
While the reasoning abilities of large language models (LLMs) continue to advance, it remains underexplored how such abilities vary across languages in multilingual LLMs and whether different languages generate distinct reasoning paths. In this work, we show that reasoning traces generated in different languages often provide complementary signals for mathematical reasoning. We propose cross-lingual outcome reward modeling, a framework that ranks candidate reasoning traces across languages rather than within a single language.Our experiments on the MGSM benchmark show that cross-lingual reward modeling improves accuracy by up to 10 points compared to using reward modeling within a single language, benefiting both high- and low-resource languages.Notably, cross-lingual sampling improves English performance under low inference budgets, despite English being the strongest individual language.Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.
Nuanced Toxicity Detection in Spanish: A New Corpus and Benchmark Study
Alba María Mármol-Romero | Robiert Sepúlveda-Torres | Estela Saquete | María-Teresa Martín-Valdivia | L. Alfonso Ureña
Alba María Mármol-Romero | Robiert Sepúlveda-Torres | Estela Saquete | María-Teresa Martín-Valdivia | L. Alfonso Ureña
The rise of toxic content on digital platforms has intensified the demand for automatic moderation tools. While English has benefited from large-scale annotated corpora, Spanish remains under-resourced, particularly for nuanced cases of toxicity such as irony, sarcasm, or indirect aggression. We present an extended version of the NECOS-TOX corpus, comprising 4,011 Spanish comments collected from 16 major news outlets. Each comment is annotated across three levels of toxicity (Non-Toxic, Slightly Toxic, and Toxic), following an iterative annotation protocol that achieved substantial inter-annotator agreement (k = 0.74). To reduce annotation costs while maintaining quality, we employed a human-in-the-loop active learning strategy, with manual correction of model pre-labels. We benchmarked the dataset with traditional machine learning (ML) methods, domain-specific transformers, and instruction-tuned large language models (LLMs). Results show that compact encoder models (e.g., RoBERTa-base-bne, 125M parameters) perform on par with much larger models (e.g., LLaMA-3.1-8B), underscoring the value of in-domain adaptation over raw scale. Our error analysis highlights persistent challenges in distinguishing subtle forms of toxicity, especially sarcasm and implicit insults, and reveals entity-related biases that motivate anonymization strategies. The dataset and trained models are released publicly.
Persona Switch: Mixing Distinct Perspectives in Decoding Time
Junseok Kim | Nakyeong Yang | Kyomin Jung
Junseok Kim | Nakyeong Yang | Kyomin Jung
Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose **Persona Switch**, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.
Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes
Gautam Siddharth Kashyap | Harsh Joshi | Niharika Jain | Ebad Shabbir | Jiechao Gao | Nipun Joshi | Usman Naseem
Gautam Siddharth Kashyap | Harsh Joshi | Niharika Jain | Ebad Shabbir | Jiechao Gao | Nipun Joshi | Usman Naseem
The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%–10% consistent improvements across modalities. Our code and data is available at: https://github.com/gskgautam/ConLLM/tree/main
Detection of Adversarial Prompts with Model Predictive Entropy
Franziska Rubenbauer | Sebastian Steindl | Patrick Levi | Daniel Loebenberger | Ulrich Schäfer
Franziska Rubenbauer | Sebastian Steindl | Patrick Levi | Daniel Loebenberger | Ulrich Schäfer
Large Language Models (LLMs) are increasingly deployed in high-impact scenarios, raising concerns about their safety and security. Despite existing defense mechanisms, LLMs remain vulnerable to adversarial attacks. This paper introduces the novel attack-agnostic pipeline SENTRY (semantic entropy-based attack recognition system) for detecting such attacks by leveraging the predictive entropy of model outputs, quantified through the Token-Level Shifting Attention to Relevance (TokenSAR) score, a weighted token entropy measurement. Our approach dynamically identifies adversarial inputs without relying on prior knowledge of attack specifications. It requires only ten newly generated tokens, making it a computationally efficient and adaptable solution. We evaluate the pipeline on multiple state-of-the-art models, including Llama, Vicuna, Falcon, Deep Seek, and Mistral, using a diverse set of adversarial prompts generated via the h4rm31 framework. Experimental results demonstrate a clear separation in TokenSAR scores between benign, malicious, and adversarial prompts. This distinction enables effective threshold-based classification, achieving robust detection performance across various model architectures. Our method outperforms traditional defenses in terms of adaptability and resource efficiency.
Actors, Frames and Arguments: A Multi-Decade Computational Analysis of Climate Discourse in Financial News using Large Language Models
Ruiran Su | Markus Leippold | Janet B. Pierrehumbert
Ruiran Su | Markus Leippold | Janet B. Pierrehumbert
We curate a 980,061-article corpus of climate-related financial news from the Dow Jones Newswire (2000–2023) and introduce a three-stage Actor–Frame–Argument (AFA) pipeline that uses large language models to extract actors, stances, frames, and argumentative structures. We conduct AFA extraction on a stratified, uncertainty-enriched sample of 4,143 articles that preserves the temporal and thematic distributions of the full corpus. Reliability is established with a 2,000-article human-annotated gold standard and a Decompositional Verification Framework (DVF) that decomposes evaluation into completeness, faithfulness, coherence, and relevance, with multi-judge scoring calibrated against human ratings. Our longitudinal analysis uncovers a structural shift after 2015: coverage transitions from risk and regulatory-burden frames toward economic opportunity and technological innovation; financial institutions and companies increasingly deploy opportunity-centered arguments, while NGOs emphasize environmental urgency and governments stress compliance. Methodologically, we provide a replicable paradigm for longitudinal media analysis with LLMs. For high-stake domain insights, we map how the financial sector has internalized and reframed the climate crisis across two decades.
RECAP: REwriting Conversations for Intent Understanding in Agentic Planning
Kushan Mitra | Dan Zhang | Hannah Kim | Estevam Hruschka
Kushan Mitra | Dan Zhang | Hannah Kim | Estevam Hruschka
Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent understanding a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning.We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that compares planning utility given a user-agent dialogue.Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines, in terms of plan preference. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agentic planning in open-domain dialogue systems.
Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh | M. Hamza Mughal | Christian Theobalt | Vera Demberg
Varsha Suresh | M. Hamza Mughal | Christian Theobalt | Vera Demberg
In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
Do Large Language Models Reflect Demographic Pluralism in Safety?
Usman Naseem | Gautam Siddharth Kashyap | Sushant Kumar Ray | Rafiq Ali | Ebad Shabbir | Abdullah Mohammad
Usman Naseem | Gautam Siddharth Kashyap | Sushant Kumar Ray | Rafiq Ali | Ebad Shabbir | Abdullah Mohammad
Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as Anthropic-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BeaverTails) using Mistral-7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters—Gemma-7B, GPT-4o, and LLaMA-2-7B—under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust. Code and data available at: https://github.com/usmaann/Demo-SafetyBench
Adversarial Decoding: Generating Readable Documents for Adversarial Objectives
Collin Zhang | Tingwei Zhang | Vitaly Shmatikov
Collin Zhang | Tingwei Zhang | Vitaly Shmatikov
We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for adversarial objectives such as RAG poisoning, jailbreaking, and evasion of defensive filters. Prior generation methods either produce easily detectable gibberish (even methods that optimize for low perplexity), or cannot handle objectives that include embedding similarity. In particular, they cannot produce readable adversarial documents that (1) are retrieved by RAG systems in response to broad classes of queries, and (2) adversarially influence subsequent generation. We measure the effectiveness of adversarial decoding for different objectives and demonstrate that it outperforms existing methods while producing adversarial documents that cannot be automatically distinguished from natural documents by fluency and readability.
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Dialogue Evaluators
John Mendonça | Alon Lavie | Isabel Trancoso
John Mendonça | Alon Lavie | Isabel Trancoso
Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a state-of-the-art LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, common sense, or relevance.
Which Works Best for Vietnamese? A Practical Study of Information Retrieval Methods across Domains
Long S. T. Nguyen | Tho Quan
Long S. T. Nguyen | Tho Quan
Large Language Models (LLMs) have achieved remarkable progress, yet their reliance on parametric knowledge often leads to hallucinations. Retrieval-Augmented Generation (RAG) mitigates this issue by grounding outputs in external documents, where the quality of retrieval is critical. While retrieval methods have been widely benchmarked in English, it remains unclear which approaches are most effective for Vietnamese, a language characterized by informal queries, noisy documents, and limited resources. Prior studies are restricted to clean datasets or narrow domains, leaving fragmented insights. To the best of our knowledge, we present the first comprehensive benchmark of retrieval methods for Vietnamese across multiple real-world domains. We systematically compare lexical, dense, and hybrid methods on datasets spanning education, legal, healthcare, customer support, lifestyle, and Wikipedia, and introduce two new datasets capturing authentic educational counseling and customer service interactions. Beyond reporting benchmark numbers, we distill a set of empirical insights that clarify trade-offs, highlight domain-specific challenges, and provide practical guidance for building robust Vietnamese QA systems. Together, these contributions offer the first large-scale, practice-oriented perspective on Vietnamese retrieval and inform both academic research and real-world deployment in low-resource languages. All datasets and evaluation scripts are available at https://github.com/longstnguyen/ViRE.
MemeWeaver: Inter-Meme Graph Reasoning for Sexism and Misogyny Detection
Paolo Italiani | David Gimeno-Gómez | Luca Ragazzi | Gianluca Moro | Paolo Rosso
Paolo Italiani | David Gimeno-Gómez | Luca Ragazzi | Gianluca Moro | Paolo Rosso
Women are twice as likely as men to face online harassment due to their gender. Despite recent advances in multimodal content moderation, most approaches still overlook the social dynamics behind this phenomenon, where perpetrators reinforce prejudices and group identity within like-minded communities. Graph-based methods offer a promising way to capture such interactions, yet existing solutions remain limited by heuristic graph construction, shallow modality fusion, and instance-level reasoning. In this work, we present MemeWeaver, an end-to-end trainable multimodal framework for detecting sexism and misogyny through a novel inter-meme graph reasoning mechanism. We systematically evaluate multiple visual-textual fusion strategies and show that our approach consistently outperforms state-of-the-art baselines on the MAMI and EXIST benchmarks, while achieving faster training convergence. Further analyses reveal that the learned graph structure captures semantically meaningful patterns, offering valuable insights into the relational nature of online hate.
SEAM: Bridging the Temporal-Semantic Granularity Gap for LLM-based Speech Recognition
Junseok Oh | Ji-Hwan Kim
Junseok Oh | Ji-Hwan Kim
Speech-LLM integration faces a temporal-semantic granularity gap: speech representations scale with temporal duration while text tokens scale with semantic content. Existing duration-based methods generate embeddings at fixed rates, creating distributional mismatch with LLM pre-training. We propose SEAM (Speech Encoder-Decoder Alignment Module), an encoder-decoder architecture employing variable-rate generation through cross-attention between speech features and text embeddings. SEAM produces embeddings at adaptive rates that closely match natural text distributions while preserving pre-trained knowledge by freezing both speech encoder and LLM. We introduce a multi-stage training strategy and First Token Guidance to improve initial token prediction. SEAM achieves competitive performance on LibriSpeech (2.6%/5.2% WER). More significantly, trained only on LibriSpeech (960h), SEAM achieves 4.7% WER on cross-domain TED-LIUM-v2, demonstrating that integrating LLM’s linguistic knowledge enables effective generalization beyond limited speech training data.
Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness
Luca Giordano | Simon Razniewski
Luca Giordano | Simon Razniewski
Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format—for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b)—is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations.We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance).Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type—high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health
Trung Hieu Ngo | Adrien Bazoge | Solen Quiniou | Pierre-Antoine Gourraud | Emmanuel Morin
Trung Hieu Ngo | Adrien Bazoge | Solen Quiniou | Pierre-Antoine Gourraud | Emmanuel Morin
Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.
FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale
Isabelle Lee | Sarah Liaw | Dani Yogatama
Isabelle Lee | Sarah Liaw | Dani Yogatama
Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets are too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks—masked operation prediction and step completion—that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.
Uncertainty Quantification for Evaluating Gender Bias in Machine Translation
Ieva Staliunaite | Julius Cheng | Andreas Vlachos
Ieva Staliunaite | Julius Cheng | Andreas Vlachos
The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.
Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce PIRA, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question–answer pairs into preference-task instructions to explicitly leverage LLMs’ preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and enhances robustness across evaluation perspectives, and (3) averaging outputs from the value head under different dropout rates to stabilize reward estimation. Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization.
The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models
Konrad Löhr | Shuzhou Yuan | Michael Färber
Konrad Löhr | Shuzhou Yuan | Michael Färber
Large Language Models (LLMs) are increasingly integral to information dissemination and decision-making processes. Given their growing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propagation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the inherent political leanings of these models. Subsequently, persona prompting with the PCT is used to explore explicit stereotypes across various social dimensions. In a final step, implicit stereotypes are uncovered by evaluating models with multilingual versions of the PCT. Key findings reveal a consistent left-leaning political alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those identified via explicit persona prompting. Interestingly, for most models, implicit and explicit stereotypes show a notable alignment, suggesting a degree of transparency or "awareness" regarding their inherent biases. This study underscores the complex interplay of political bias and stereotypes in LLMs.
Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free framework that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X architecture, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms baselines such as English-only fine-tuning and selecting the typologically closest-language adapter, with the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.
Can Calibration of Positional Encodings Enhance Long Context Utilization?
Tom Zehle | Matthias Aßenmacher
Tom Zehle | Matthias Aßenmacher
Large language models suffer from positional biases like the "Lost in the Middle" (LiM) phenomenon and recency bias, which reduce the effective utilization of long contexts. In this work, we investigate the role of Positional Encodings in this context. Our empirical study confirms the persistence of these biases in modern large language models. Drawing on these findings, we introduce Caliope, a training-free framework for calibrating Positional Encodings at inference time. Our calibrators yield substantial improvements on needle-in-a-haystack and cross-chunk reasoning benchmarks, and offer a practical, lightweight method for improving long-context utilization.
FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition
Jonas Golde | Patrick Haller | Alan Akbik
Jonas Golde | Patrick Haller | Alan Akbik
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce , a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99/5) and completeness (4.05/5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02-0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
Bias in the East, Bias in the West: A Bilingual Analysis of LLM Political Bias on U.S.- and China-Related Issues
Ying Ying Lim | Paul Röttger
Ying Ying Lim | Paul Röttger
Large language models (LLMs) can exhibit political biases, which creates a risk of undue influence on LLM users and public opinion.Yet despite LLMs being used across the world, there is little evidence on how political biases vary across languagesAnd despite a growing number of frontier LLMs (e.g., DeepSeek) released by non-U.S. organizations, there is limited understanding of how political biases vary across LLMs developed in different political contexts.To address these gaps, we measure LLM bias on U.S.- and China-related issues, and how bias varies by 1) prompt language (English vs. Chinese) and 2) model origin (U.S. vs. Chinese).For this purpose, we create a new parallel dataset of 36k realistic test prompts asking models to write about a balanced set of 60 political issues sourced from national U.S. and Chinese news outlets.Using this dataset, we show that both model origin and prompt language systematically influence bias.Language effects dominate on China-related issues, particularly those involving sovereignty and human rights, while model origin better predicts variation in bias on U.S.-related governance and foreign policy topics.Overall, our results highlight a need for language and context-specific measurement of LLM political bias.
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Shaivi Malik | Hasnat Md Abdullah | Sriparna Saha | Amit Sheth
Shaivi Malik | Hasnat Md Abdullah | Sriparna Saha | Amit Sheth
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of 98, far from the unbiased ideal of 0. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available at https://github.com/shaivimalik/gras_bias_bench
A Simple and Efficient Learning-Style Prompting for LLM Jailbreaking
Xuan Luo | Yue Wang | Zefeng He | Geng Tu | Jing Li | Ruifeng Xu
Xuan Luo | Yue Wang | Zefeng He | Geng Tu | Jing Li | Ruifeng Xu
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.The learning-style queries are constructed by a novel reframing paradigm: HILL (Hiding Intention by Learning from LLMs). The deterministic, model-agnostic reframing framework is composed of 4 conceptual components: 1) key concept, 2) exploratory transformation, 3) detail-oriented inquiry, and optionally 4) hypotheticality.Further, new metrics are introduced to thoroughly evaluate the efficiency and harmfulness of jailbreak methods.Experiments on the AdvBench dataset across a wide range of models demonstrate HILL’s strong generalizability. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. On the other hand, results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs’ safety mechanisms and flaws in the defense methods. This work exposes significant vulnerabilities of safety measures against learning-style elicitation, highlighting a critical challenge of fulfilling both helpfulness and safety alignments.
Recent advancements in Large Language Models (LLMs) have shown promise for automated data annotation, yet reliance on expensive commercial models like GPT-4 limits accessibility. This paper rigorously evaluates the potential of open-source smaller LLMs (sLLMs) as a cost-effective alternative. We introduce a new benchmark dataset, Multidisciplinary Open Research Data (MORD), comprising 12,277 annotated sentence segments from 1,500 schoolarly articles across five research domains, to systematically assess sLLM performance. Our experiments demonstrate that sLLMs achieve annotation quality surpassing Amazon MTurk workers and approach GPT-4’s accuracy at significantly lower costs. We further propose to build the Crowd of LLMs, which aggregates annotations from multiple sLLMs using label aggregation algorithms. This approach not only outperforms individual sLLMs but also reveals that combining sLLM annotations with human crowd labels yields superior results compared to either method alone. Our findings highlight the viability of sLLMs for democratizing high-quality data annotation while underscoring the need for tailored aggregation methods to fully realize their potential.
Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Evgeniia Tokarchuk | Maya K. Nachesa | Sergey Troshin | Vlad Niculae
Evgeniia Tokarchuk | Maya K. Nachesa | Sergey Troshin | Vlad Niculae
Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets. A standard next-token prediction training strategy, while widely adopted in practice, may lead to overlooked artifacts such as representation collapse. Previous works have shown that this problem is especially pronounced in the representation of the deeper Transformer layers, where it often fails to efficiently utilize the geometric space. Representation collapse is even more evident in end-to-end training of continuous-output neural machine translation, where the trivial solution would be to set all vectors to the same value. In this work, we analyze the dynamics of representation collapse at different levels of discrete and continuous NMT transformers throughout training. We incorporate an existing regularization method based on angular dispersion and demonstrate empirically that it not only mitigates collapse but also improves translation quality. Furthermore, we show that quantized models exhibit similar collapse behavior and that the benefits of regularization are preserved even after quantization.
Can LLMs Reason Like Doctors? Exploring the Limits of Large Language Models in Complex Medical Reasoning
Flavio Merenda | Jose Manuel Gomez-Perez | German Rigau
Flavio Merenda | Jose Manuel Gomez-Perez | German Rigau
Large language models (LLMs) have shown remarkable progress in reasoning across multiple domains. However, it remains unclear whether their abilities reflect genuine reasoning or sophisticated pattern matching, a distinction critical in medical decision-making, where reliable multi-step problem-solving is required. Accordingly, we conduct one of the largest evaluations to date, assessing 77 LLMs with diverse fine-tuning approaches, ranging from 1 billion parameters to frontier models. Guided by medical problem-solving theory, we select three medical question answering (QA) benchmarks targeting key reasoning skills: reasoning processes, susceptibility to cognitive biases, and metacognitive abilities. Additionally, we manually annotate a subset of questions to assess the abduction, deduction, and induction capabilities of LLMs, offering detailed insight into the reasoning mechanisms followed by physicians, an aspect that has received relatively limited attention in this domain. Most models, particularly smaller ones, struggle even with specialized fine-tuning or advanced prompting. Larger models perform better but still show clear limitations in complex medical reasoning. Our findings highlight the need to improve specific reasoning strategies to better reflect medical decision-making. The datasets and code used in this study are publicly available at: https://github.com/expertailab/Can-LLMs-Reason-Like-Doctors
Testing Low-Resource Language Support in LLMs Using Language Proficiency Exams: the Case of Luxembourgish
Cedric Lothritz | Jordi Cabot | Laura Bernardy
Cedric Lothritz | Jordi Cabot | Laura Bernardy
Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks in Luxembourgish.
Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Mathis Le Bail | Jérémie Dentan | Davide Buscaldi | Vanier Sonia
Mathis Le Bail | Jérémie Dentan | Davide Buscaldi | Vanier Sonia
Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based model ClassifSAE tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, HI-Concept and a standard TopK-SAE baseline. Our evaluation covers several classification benchmarks and backbone LLMs. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that ClassifSAE improves both the causality and interpretability of the extracted features.
TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action
Chenyue Zhou | Gürkan Solmaz | Flavio Cirillo | Kiril Gashteovski | Jonathan Fürst
Chenyue Zhou | Gürkan Solmaz | Flavio Cirillo | Kiril Gashteovski | Jonathan Fürst
Humanitarian Mine Action (HMA) addresses the challenge of detecting and removing landmines from conflict regions. Much of the life-saving operational knowledge produced by HMA agencies is buried in unstructured reports, limiting the transferability of information between agencies. To address this issue, we propose TextMineX: the first dataset, evaluation framework and ontology-guided large language model (LLM) pipeline for knowledge extraction from text in the HMA domain. TextMineX structures HMA reports into (subject, relation, object)-triples, thus creating domain-specific knowledge. To ensure real-world relevance, we utilized the dataset from our collaborator Cambodian Mine Action Centre (CMAC). We further introduce a bias-aware evaluation framework that combines human-annotated triples with an LLM-as-Judge protocol to mitigate position bias in reference-free scoring. Our experiments show that ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models. We publicly release the dataset and code.
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses 2,890 parallel Bangla-English gold standard artifacts, totaling ≈30K aligned question–answer pairs across thirteen languages, representing an extensive coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models under zero-shot, chain-of-thought (CoT), perturbated reasoning, and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
Enhancing Reliability in Community Question Answering with an Expert-Oriented RAG System
Seyyede Zahra Aftabi | Saeed Farzi
Seyyede Zahra Aftabi | Saeed Farzi
In recent years, pre-trained large language models (LLMs) have become a cornerstone for automatically generating answers in question-and-answer (Q A) communities, significantly reducing user wait times and improving response quality. However, these models require substantial computational resources and are prone to generating hallucinated or unreliable content. To overcome these limitations, we propose an advanced expert-oriented Retrieval-Augmented Generation (RAG) framework as a cost-effective and reliable alternative. Central to our approach is a user-aware question entailment recognition module, which leverages user modeling to identify archived questions with answers that fully or partially address the user’s new query. This user modeling significantly improves retrieval relevance, resulting in reduced hallucination and enhanced answer quality. The framework synthesizes expert-written answers from similar questions to generate unified responses. Experimental results on the CQADupStack and SE-PQA datasets show the superiority of our user-aware approach over its user-agnostic counterpart, with ROUGE-1 gains of 3.6% and 0.9%. Both human and AI evaluations confirm the effectiveness of incorporating user modeling in minimizing hallucination and delivering contextually appropriate answers, demonstrating its potential for real-world Q A systems. The code and data are available on a GitHub repository at https://anonymous.4open.science/r/User-Oriented-RAG-CQA.
Unsupervised Text Style Transfer for Controllable Intensity
Shuhuan Gu | Wenbiao Tao | Xinchen Ma | Kangkang He | Ye Guo | Xiang Li | Yunshi Lan
Shuhuan Gu | Wenbiao Tao | Xinchen Ma | Kangkang He | Ye Guo | Xiang Li | Yunshi Lan
Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs.Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels.Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM.We first fine-tune the LLM with synthesized parallel data.Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels.Both the global and local stylistic features are considered to formulate the reward functions.The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics.Even for adjacent levels of intensity, we can still observe a noticeable stylistic difference among the generated text across these levels.
SchemaGraphSQL: Efficient Schema Linking with Pathfinding Graph Algorithms for Text-to-SQL on Large-Scale Databases
AmirHossein Safdarian | Milad Mohammadi | Ehsan Jahanbakhsh Bashirloo | Mona Shahamat Naderi | Heshaam Faili
AmirHossein Safdarian | Milad Mohammadi | Ehsan Jahanbakhsh Bashirloo | Mona Shahamat Naderi | Heshaam Faili
Text-to-SQL systems translate natural language questions into executable SQL queries, and recent progress with large language models (LLMs) has driven substantial improvements in this task. Schema linking remains a critical component in Text-to-SQL systems, reducing prompt size for models with narrow context windows and sharpening model focus even when the entire schema fits. We present a zero-shot, training-free schema linking approach that first constructs a schema graph based on foreign key relations, then uses a single prompt to a lightweight LLM to extract source and destination tables from the user query, followed by applying classical path-finding algorithms and post-processing to identify the optimal sequence of tables and columns that should be joined, enabling the LLM to generate more accurate SQL queries. To handle real-world databases where foreign keys may be missing or inconsistent, we further propose an LLM-guided joinability discovery step that infers table connections before graph construction, ensuring robustness across diverse schemas. Despite being simple, cost-effective, and highly scalable, our method achieves state-of-the-art results on both the BIRD and Spider 2.0 benchmarks, outperforming previous specialized, fine-tuned, and complex multi-step LLM-based approaches.
Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement
Diego Rossini | Lonneke Van Der Plas
Diego Rossini | Lonneke Van Der Plas
We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165 times fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
The Correlation Between Emotion in Text and Speech Segments is Limited: A Cross-Modal Study
David Lindevelt | Suzan Verberne | Joost Broekens
David Lindevelt | Suzan Verberne | Joost Broekens
Although expressive TTS systems aim to capture human-like emotion, little is known about how well emotional signals in text correspond to those in speech. In this short paper, we investigate how emotion (Valence, Arousal, Dominance) in text relates to emotion in speech. We use 8 large language models for identifying emotion in text and two audio models for emotion in speech, across three genres: Podcasts, Audiobooks and TED talks. Findings show that while language models perform well on emotion recognition from situational text, and the audio models perform well on speech, they show a strong correlation for Valence only. Further, the genre of the content significantly impacts the correlation: audiobooks exhibit higher text-audio correlation than TED talks. Finally, we show that more context for LLMs fails to improve this correlation between text and speech emotion prediction. Our results highlight that emotional signals in text do not correspond well to those in speech: emotion prediction from text alone is insufficient for emotional TTS.
Seeing All Sides: Multi-Perspective In-Context Learning for Subjective NLP
Benedetta Muscato | Yue Li | Gizem Gezici | Zhixue Zhao | Fosca Giannotti
Benedetta Muscato | Yue Li | Gizem Gezici | Zhixue Zhao | Fosca Giannotti
Modern language models excel at factual reasoning but struggle with value diversity: the multiplicity of plausible human perspectives. Tasks such as hate speech or sexism detection expose this limitation, where human disagreement captures the diversity of perspectives that models need to account for, rather than dataset noise. In this paper, we explore whether multi-perspective in-context learning (ICL) can align large language models (LLMs) with this diversity without parameter updates. We evaluate four LLMs on five datasets across three languages (English, Arabic, Italian), considering three label-space representations (aggregated hard, disaggregated hard, and disaggregated soft) and five demonstration selection and ordering strategies. Our multi-perspective approach outperforms standard prompting on aggregated English labels, while disaggregated soft predictions better align with human judgments in Arabic and Italian datasets.These findings highlight the importance of perspective-aware LLMs for reducing bias and polarization, while also revealing the challenges of applying ICL to socially sensitive tasks. We further probe the model faithfulness using eXplainable AI (XAI), offering insights into how LLMs handle human disagreement.
Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models
Kumiko Nakajima | Jan Zuiderveld | Sandro Pezzelle
Kumiko Nakajima | Jan Zuiderveld | Sandro Pezzelle
Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce the Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separatingnoise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputsmore appropriate but less creative. We release the dataset and code.
No. While Multimodal Large Language Models (MLLMs) have been shown to perform very well on general video data, we systematically show that their performance on movies lags behind. This is surprising as MLLMs are increasingly used for movie understanding. To measure the performance of MLLMs on movies, we explore three pillars of movie mastery: movie knowledge, cinematographic knowledge, and critical analysis. Through a combination of quantitative and in-depth qualitative evaluations, we identify where MLLMs show promise and, in particular, where they fail. Our findings show that in small-scale settings involving factual knowledge, MLLMs are able to outperform existing methods. However, once cinematographic and critical analysis is required, MLLMs are insufficiently able to extract meaningful information from the visual modality to be able to provide useful insights. The data and project page are available at https://carlobretti.github.io/moviebuff.
Process Evaluation for Agentic Systems
Milan Gritta | Debjit Paul | Xiaoguang Li | Lifeng Shang | Jun Wang | Gerasimos Lampouras
Milan Gritta | Debjit Paul | Xiaoguang Li | Lifeng Shang | Jun Wang | Gerasimos Lampouras
The significance of tasks entrusted to LLM-based assistants (agents) and the associated societal risks are increasing each year. Agents are being explored in critical domains such as medicine, finance, law, infrastructure, and other sensitive applications that require system transparency and high user trust. The quality of these agents is typically evaluated by accuracy, sometimes extended to partial correctness. In this position paper, we argue that this focus on outcomes is insufficient as it can obscure risky agent behaviours such as skipping critical steps, hallucinating tool use, relying on outdated parametric knowledge and other means of bypassing recommended processes. Our core position is that a holistic agent evaluation must include process evaluation, especially for critical applications. We conduct a small-scale study to assess the feasibility of automatic process evaluation, present a compliance score, analyse use cases of bad and good behaviours, and offer recommendations for more holistic evaluation.
MIMIC: Multi-party Dialogue Augmentation via Speaker Stylistic Transfer
Gaetano Cimino | Giuseppe Carenini | Vincenzo Deufemia
Gaetano Cimino | Giuseppe Carenini | Vincenzo Deufemia
Annotated data scarcity has long hindered progress in dialogue discourse parsing. To fill this gap, we introduce MIMIC, a framework for augmenting discourse-annotated corpora via speaker stylistic transfer using Large Language Models (LLMs). MIMIC rephrases utterances while preserving discourse coherence, using the MASK metric to identify speakers for replacement that enrich structural diversity and the MIRROR method to select substitute speakers who have experienced similar discourse interactions. Experimental results on STAC and Molweni corpora show that parsers trained with MIMIC-augmented data improve both link prediction and relation classification, with consistent gains for underrepresented discourse patterns and in low-resource scenarios.
TechING: Towards Real World Technical Image Understanding via VLMs
Tafazzul Nadeem | Bhavik Shangari | Manish Rai | Gagan Raj Gupta | Ashutosh Modi
Tafazzul Nadeem | Bhavik Shangari | Manish Rai | Gagan Raj Gupta | Ashutosh Modi
Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.
Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely on machine translation, while advances in multilingual text modeling remain underutilized. We introduce M2M, a lightweight alignment method that learns only a few linear layers–using English text alone–to map multilingual text embeddings into multimodal space. Despite its simplicity, M2M matches baseline performance in English (94.9% Recall@10) and achieves strong zero-shot transfer (89.5% Recall@10 averaged across 11 languages, 10 unseen) on XTD Text-to-Image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, M2M demonstrates robustness across datasets and tasks, extending to Audio-Text retrieval and Text-to-Image generation. We release [code and checkpoints](https://github.com/piyushsinghpasi/M2M) along with multilingual evaluation datasets: [MSCOCO Multilingual 30K](https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k), [AudioCaps Multilingual](https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual), and [Clotho Multilingual](https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual).
Do GUI Grounders Truly Understand UI Elements?
Surgan Jandial | Yinheng Li | Justin Wagle | Kazuhito Koishida
Surgan Jandial | Yinheng Li | Justin Wagle | Kazuhito Koishida
Graphical User Interface (GUI) grounding is critical for effective GUI agents. Despite recent progress, key challenges remain: 1) existing grounding models and benchmarks are skewed toward web and mobile environments, neglecting desktop interfaces (especially windows); and 2) grounding capability is assessed using accuracy on a single "best" instruction per UI element. However, users can refer to a UI element in diverse valid ways – via visual attributes, spatial relations, etc, and a capable grounding model should produce consistent outputs across such variations. Focusing on desktop environments, we introduce GUI Grounding Sensitivity Benchmark, which investigates the model sensitivity to multiple descriptions of the same UI element. We design an automatic pipeline to generate multiple valid instructions per UI element, and develop nuanced data validation methods, as frontier models even hallucinate to produce a single instruction. Evaluation of 12 models reveals they are reasonably sensitive and their performance on existing benchmarks does not reflect their true ability. Building on the insight that a given grounding model struggles more with certain instructions or relations, we introduce the GUI Grounding Diagnosis Agent, which generates challenging instructions using model feedback and iterative refinement. Our agent reports high success rate (upto 84%) in generating instructions that fail the state-of-the-art GUI grounding models.
Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks
Haowei Fu | Bo Ni | Han Xu | Kunpeng Liu | Dan Lin | Tyler Derr
Haowei Fu | Bo Ni | Han Xu | Kunpeng Liu | Dan Lin | Tyler Derr
Retrieval-Augmented Generation (RAG) and Supervised Finetuning (SFT) have become the predominant paradigms for equipping Large Language Models (LLMs) with external knowledge for diverse, knowledge-intensive tasks. However, while such knowledge injection improves performance, it also exposes new attack surfaces. Membership Inference Attacks (MIAs), which aim to determine whether a given data sample was included in a model’s training set, pose serious threats to privacy and trust in sensitive domains. To this end, we first systematically evaluate the vulnerability of RAG- and SFT-based LLMs to various MIAs. Then, to address the privacy risk, we further introduce a novel, model-agnostic defense framework, Ensemble Privacy Defense (EPD), which aggregates and evaluates the outputs of a knowledge-injected LLM, a base LLM, and a dedicated judge model to enhance resistance against MIAs. Comprehensive experiments show that, on average, EPD reduces MIA success by up to 27.8% for SFT and 526.3% for RAG compared to inference-time baseline, while maintaining answer quality.
SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents
Qiusi Zhan | Angeline Budiman-Chan | Abdelrahman Zayed | Xingzhi Guo | Daniel Kang | Joo-Kyung Kim
Qiusi Zhan | Angeline Budiman-Chan | Abdelrahman Zayed | Xingzhi Guo | Daniel Kang | Joo-Kyung Kim
Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions.While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored.In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked “How can I track someone’s location without their consent?”, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary.We further show that utility-oriented finetuning intensifies this risk, motivating joint alignment of safety and utility.We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent. Further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.
SAGE : A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn Agent Evaluation
Ryan Shea | Yunan Lu | Liang Qiu | Zhou Yu
Ryan Shea | Yunan Lu | Liang Qiu | Zhou Yu
Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification
Branislav Pecher | Jan Cegin | Robert Belanec | Ivan Srba | Jakub Simko | Maria Bielikova
Branislav Pecher | Jan Cegin | Robert Belanec | Ivan Srba | Jakub Simko | Maria Bielikova
Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.
Dialogue is Better Than Monologue: Instructing Meidcal LLMs via Strategic Conversations
Zijie Liu | Xinyu Zhao | Jie Peng | Jinhao Duan | Zhuangdi Zhu | Qingyu Chen | Kaidi Xu | Xia Hu | Tianlong Chen
Zijie Liu | Xinyu Zhao | Jie Peng | Jinhao Duan | Zhuangdi Zhu | Qingyu Chen | Kaidi Xu | Xia Hu | Tianlong Chen
DF-RAG: Query-Aware Diversity for Retrieval-Augmented Generation
Saadat Hasan Khan | Spencer Hong | Jingyu Wu | Kevin Lybarger | Youbing Yin | Erin Babinsky | Daben Liu
Saadat Hasan Khan | Spencer Hong | Jingyu Wu | Kevin Lybarger | Youbing Yin | Erin Babinsky | Daben Liu
Retrieval-augmented generation (RAG) is a common technique for grounding language model outputs in domain-specific information. However, RAG is often challenged by reasoning-intensive question-answering (QA), since common retrieval methods like cosine similarity maximize relevance at the cost of introducing redundant content, which can reduce information recall. To address this, we introduce Diversity-Focused Retrieval-Augmented Generation (DF-RAG) that systematically incorporates diversity into the retrieval step to improve performance on complex, reasoning-intensive QA benchmarks. DF-RAG builds upon the Maximal Marginal Relevance framework to select information chunks that are both relevant to the query and maximally dissimilar from each other. A key innovation of DF-RAG is its ability to optimize the level of diversity for each query dynamically at test time without requiring any additional fine-tuning or prior information. We show that DF-RAG improves F1 performance on reasoning-intensive QA benchmarks by 4–10% over vanilla RAG using cosine similarity and also outperforms other established baselines. Furthermore, we estimate an Oracle ceiling of up to 18% absolute F1 gains over vanilla RAG, of which DF-RAG captures up to 91.3%.
Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation
Arjun Chandra | Kevin Miller | Venkatesh Ravichandran | Constantinos Papayiannis | Venkatesh Saligrama
Arjun Chandra | Kevin Miller | Venkatesh Ravichandran | Constantinos Papayiannis | Venkatesh Saligrama
Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.
Improving Chain-of-Thought for Logical Reasoning via Attention-Aware Intervention
Phuong Minh Nguyen | Dang Huu-Tien | Naoya Inoue
Phuong Minh Nguyen | Dang Huu-Tien | Naoya Inoue
Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead or depend on external components, which limit their scalability. In this work, we introduce a non-interactive, end-to-end framework for reasoning tasks, enabling reasoning to emerge within the model itself—improving generalization while preserving analyzability without any external resources. We show that introducing structural information into the few-shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention-Aware Intervention (AAI), an inference-time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model’s reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks, and model architectures, while incurring negligible additional computational overhead. Code is available at https://github.com/phuongnm94/aai_for_logical_reasoning.
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models
Michael R. Metel | Yufei Cui | Boxing Chen | Prasanna Parthasarathi
Michael R. Metel | Yufei Cui | Boxing Chen | Prasanna Parthasarathi
Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model’s maximum context length, and under mild conditions has linear computational complexity.
Defeating Cerberus: Privacy-Leakage Mitigation in Vision Language Models
Boyang Zhang | Istemi Ekin Akkus | Ruichuan Chen | Alice Dethise | Klaus Satzke | Ivica Rimac | Yang Zhang
Boyang Zhang | Istemi Ekin Akkus | Ruichuan Chen | Alice Dethise | Klaus Satzke | Ivica Rimac | Yang Zhang
Vision Language Models (VLMs) have demonstrated remarkable capabilities in processing multimodal data, but their advanced abilities also raise significant privacy concerns, particularly regarding Personally Identifiable Information (PII) leakage. While relevant research has been conducted on single-modal language models to some extent, the vulnerabilities in the multimodal setting have yet to be fully investigated. Our work assesses these emerging risks and introduces a concept-guided mitigation approach. By identifying and modifying the model’s internal states associated with PII-related content, our method guides VLMs to refuse PII-sensitive tasks effectively and efficiently, without requiring re-training or fine-tuning. We also address the current lack of multimodal PII datasets by constructing various ones that simulate real-world scenarios. Experimental results demonstrate the method can achieve on average 93.3% refusal rate for various PII-related tasks with minimal impact on unrelated model performances. We further examine the mitigation’s performance under various conditions to show the adaptability of our proposed method.
TruthTrap: A Bilingual Benchmark for Evaluating Factually Correct Yet Misleading Information in Question Answering
Mohammadamin Shafiei | Hamidreza Saffari | Mohammad Taher Pilehvar | Alessandro Raganato
Mohammadamin Shafiei | Hamidreza Saffari | Mohammad Taher Pilehvar | Alessandro Raganato
Large Language Models (LLMs) are increasingly used to answer factual, information-seeking questions (ISQs). While prior work often focuses on false, misleading information, little attention has been paid to true but strategically persuasive content that can derail a model’s reasoning. To address this gap, we introduce a new evaluation dataset, TruthTrap, in two languages, i.e., English and Farsi, on Iran-related ISQs, each paired with a correct explanation and a persuasive-yet-misleading true hint. We then evaluate nine diverse LLMs (spanning proprietary and open-source systems) via factuality classification and multiple-choice QA tasks, finding that accuracy drops by 25%, on average, when models encounter these misleading yet factual hints. Also, the models’ predictions match the hint-aligned options up to 77 percent of the time. Notably, models often misjudge such hints in isolation yet still integrate them into final answers. Our results highlight a significant limitation in LLM outputs, underscoring the importance of robust fact-verification and emphasizing real-world risks posed by partial truths in domains like social media, education, and policy-making.
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Jiayi Tian | Ryan Solgi | Jinming Lu | Yifan Yang | Hai Li | Zheng Zhang
Jiayi Tian | Ryan Solgi | Jinming Lu | Yifan Yang | Hai Li | Zheng Zhang
Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
Negative Sampling Techniques in Dense Retrieval: A Survey
Laurin Wischounig | Abdelrahman Abdallah | Adam Jatowt
Laurin Wischounig | Abdelrahman Abdallah | Adam Jatowt
Information Retrieval (IR) is fundamental to many modern NLP applications. The rise of dense retrieval (DR), using neural networks to learn semantic vector representations, has significantly advanced IR performance. Central to training effective dense retrievers through contrastive learning is the selection of informative negative samples. Synthesizing 35 seminal papers, this survey provides a comprehensive and up-to-date overview of negative sampling techniques in dense IR. Our unique contribution is the focus on modern NLP applications and the inclusion of recent Large Language Model (LLM)-driven methods, an area absent in prior reviews. We propose a taxonomy that categorizes techniques, including random, static/dynamically mined, and synthetic datasets. We then analyze these approaches with respect to trade-offs between effectiveness, computational cost, and implementation difficulty. The survey concludes by outlining current challenges and promising future directions for the use of LLM-generated synthetic data.
Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement
Wangyang Ying | Yanchi Liu | Xujiang Zhao | Wei Cheng | Zhengzhang Chen | Wenchao Yu | Yanjie Fu | Haifeng Chen
Wangyang Ying | Yanchi Liu | Xujiang Zhao | Wei Cheng | Zhengzhang Chen | Wenchao Yu | Yanjie Fu | Haifeng Chen
Automatically extracting workflows as procedural graphs from natural language is a promising yet underexplored task that requires ensuring both structural validity and logical alignment. Recent advances in large language models (LLMs) show potential for graph extraction, but often yield ill-formed structures or misinterpret logical constructs such as gateways. We introduce , a multi-agent framework that treats procedural graph extraction as a multi-round reasoning process with structural and logical refinement agents. The framework operates in three iterative stages: (1) an LLM-based graph extraction phase, (2) a structural feedback phase where a simulation agent diagnoses and explains structural issues, and (3) a logical feedback phase where a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into the next-round prompt, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that achieves substantial improvements in both structural correctness and logical consistency over strong baselines.
MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction
Wei-Chieh Huang | Cornelia Caragea
Wei-Chieh Huang | Cornelia Caragea
Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce, as it infers lantent attributes from multimodal data. Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data and gaps in vision-text understanding. In this work, we introduce MADIAVE, a multi-agent de- bate framework that employs multiple MLLM agents to iteratively refine inferences. Through a series of debate rounds, agents verify and up- date each other’s responses, thereby improving inference performance and robustness. Experi- ments on the ImplicitAVE dataset demonstrate that even a few rounds of debate significantly boost accuracy, especially for attributes with initially low performance. We systematically evaluate various debate configurations, includ- ing identical or different MLLM agents, and analyze how debate rounds affect convergence dynamics. Our findings highlight the poten- tial of multi-agent debate strategies to address the limitations of single-agent approaches and offer a scalable solution for implicit AVE in multimodal e-commerce.
DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Minghao Guo | Qingcheng Zeng | Xujiang Zhao | Yanchi Liu | Wenchao Yu | Mengnan Du | Haifeng Chen | Wei Cheng
Minghao Guo | Qingcheng Zeng | Xujiang Zhao | Yanchi Liu | Wenchao Yu | Mengnan Du | Haifeng Chen | Wei Cheng
Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
Analyzing LLM Instruction Optimization for Tabular Fact Verification
Xiaotang Du | Giwon Hong | Wai-Chung Kwan | Rohit Saxena | Ivan Titov | Pasquale Minervini | Emily Allaway
Xiaotang Du | Giwon Hong | Wai-Chung Kwan | Rohit Saxena | Ivan Titov | Pasquale Minervini | Emily Allaway
Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework—COPRO, MiPROv2, and SIMBA—across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.
XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
Ioan-Paul Ciobanu | Andrei-Iulian Hîji | Nicolae Catalin Ristea | Paul Irofti | Cristian Rusu | Radu Tudor Ionescu
Ioan-Paul Ciobanu | Andrei-Iulian Hîji | Nicolae Catalin Ristea | Paul Irofti | Cristian Rusu | Radu Tudor Ionescu
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested "in the wild". Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models
Naiming Liu | Richard Baraniuk | Shashank Sonkar
Naiming Liu | Richard Baraniuk | Shashank Sonkar
We introduce CLEAR-3K, a dataset of 3,008 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question presents an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing models. Hence, CLEAR-3K provides a crucial benchmark for developing and evaluating causal explanatory reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.
Imbalanced Gradients in RL Post-Training of Multi-Task LLMs
Runzhe Wu | Ankur Samanta | Ayush Jain | Scott Fujimoto | Jeongyeol Kwon | Ben Kretzu | Youliang Yu | Kaveh Hassani | Boris Vidolov | Yonathan Efroni
Runzhe Wu | Ankur Samanta | Ayush Jain | Scott Fujimoto | Jeongyeol Kwon | Ben Kretzu | Youliang Yu | Kaveh Hassani | Boris Vidolov | Yonathan Efroni
Multi-task post-training of large language models (LLMs) is typically performed by mixing datasets from different tasks and optimizing them jointly. This approach implicitly assumes that all tasks contribute gradients of similar magnitudes; when this assumption fails, optimization becomes biased toward large-gradient tasks. In this paper, however, we show that this assumption fails in RL post-training: certain tasks produce significantly larger gradients, thus biasing updates toward those tasks. Such gradient imbalance would be justified only if larger gradients implied larger learning gains on the tasks (i.e., larger performance improvements)—but we find this is not true. Large-gradient tasks can achieve similar or even much lower learning gains than small-gradient ones. Further analyses reveal that these gradient imbalances cannot be explained by typical training statistics such as training rewards or advantages, suggesting that they arise from the *inherent* differences between tasks. This cautions against naive dataset mixing and calls for future work on principled gradient-level corrections for LLMs.
BayesFlow: A Probability Inference Framework for Meta-Agent Assisted Workflow Generation
Bo Yuan | Yun Zhou | Zhichao Xu | Kiran Ramnath | Aosong Feng | Balasubramaniam Srinivasan
Bo Yuan | Yun Zhou | Zhichao Xu | Kiran Ramnath | Aosong Feng | Balasubramaniam Srinivasan
Automatic workflow generation is the process of automatically synthesizing sequences of LLM calls, tool invocations, and post-processing steps for complex end-to-end tasks. Most prior methods cast this task as an optimization problem with limited theoretical grounding. We propose to cast workflow generation as Bayesian inference over a posterior distribution on workflows, and introduce Bayesian Workflow Generation (BWG), a sampling framework that builds workflows step-by-step using parallel look-ahead rollouts for importance weighting and a sequential in-loop refiner for pool-wide improvements. We prove that, without the refiner, the weighted empirical distribution converges to the target posterior. We instantiate BWG as BayesFlow, a training-free algorithm for workflow construction. Across six benchmark datasets, BayesFlow improves accuracy by up to 9 percentage points over SOTA workflow generation baselines and by up to 65 percentage points over zero-shot prompting, establishing BWG as a principled upgrade to search-based workflow design.
HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Guimin Hu | Daniel Hershcovich | Hasti Seifi
Guimin Hu | Daniel Hershcovich | Hasti Seifi
Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic feedback for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation.HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06, respectively. Furthermore, over 64% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 13% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.
Token-Level Precise Attack on RAG: Searching for the Best Alternatives to Mislead Generation
Zizhong Li | Haopeng Zhang | Jiawei Zhang
Zizhong Li | Haopeng Zhang | Jiawei Zhang
While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.
PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference
Krishna Teja Chitty-Venkata | Jie Ye | Siddhisanket Raskar | Anthony Kougkas | Xian Sun | Murali Emani | Venkatram Vishwanath | Bogdan Nicolae
Krishna Teja Chitty-Venkata | Jie Ye | Siddhisanket Raskar | Anthony Kougkas | Xian Sun | Murali Emani | Venkatram Vishwanath | Bogdan Nicolae
KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM’s PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.
SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity
Ishani Mondal | Meera Bharadwaj | Ayush Roy | Aparna Garimella | Jordan Lee Boyd-Graber
Ishani Mondal | Meera Bharadwaj | Ayush Roy | Aparna Garimella | Jordan Lee Boyd-Graber
Despite significant progress in natural image editing with state-of-the-art MLLMs, compositional layout and content editing for structured visual domains (e.g., posters, websites) remains underexplored. In this work, we introduce SMART-EDITOR, a multi-agent framework for compositional editing for structured images like posters or websites. Unlike prior models that focus on isolated local edits, SMART-EDITOR maintains global coherence through two complementary strategies: Reward-Refine, an inference-time reward-guided refinement method, and RewardDPO, a training-time preference optimization approach leveraging reward-aligned layout pairs. To evaluate performance, we introduce SMARTEdit-Bench, a benchmark of cascading multi-step edit instructions that are implicit in nature yet require layout and semantic-consistency preserving reasoning about edit order to preserve spatial and semantic consistency. Both automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually coherent edits, beyond what single-shot VLMs can generate.
ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations
Amr Gomaa | Ahmed Salem | Sahar Abdelnabi
Amr Gomaa | Ahmed Salem | Sahar Abdelnabi
As language models evolve into autonomous agents that act and communicate on behalf of users, ensuring safety in multi-agent ecosystems becomes a central challenge. Interactions between personal assistants and external service providers expose a core tension between utility and protection: effective collaboration requires information sharing, yet every exchange creates new attack surfaces. We introduce ConVerse, a dynamic benchmark for evaluating privacy and security risks in agent–agent interactions. ConVerse spans three practical domains (travel, real estate, insurance) with 12 user personas and over 864 contextually grounded attacks (611 privacy, 253 security). Unlike prior single-agent settings, it models autonomous, multi-turn agent-to-agent conversations where malicious requests are embedded within plausible discourse. Privacy is tested through a three-tier taxonomy assessing abstraction quality, while security attacks target tool use and preference manipulation. Evaluating seven state-of-the-art models reveals persistent vulnerabilities—privacy attacks succeed in up to 88% of cases and security breaches in up to 60%—with stronger models leaking more. By unifying privacy and security within interactive multi-agent contexts, ConVerse reframes safety as an emergent property of communication.
SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
Kaiwen Zhou | Ahmed Elgohary | A S M Iftekhar | Amin Saied
Kaiwen Zhou | Ahmed Elgohary | A S M Iftekhar | Amin Saied
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ, a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover diverse risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model’s reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 – 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
Who You Are, What You Say: Intra- and Inter- Context Personality for Emotion Recognition in Conversation
Tazeek Bin Abdur Rakib | Lay-Ki Soon | Wern Han Lim
Tazeek Bin Abdur Rakib | Lay-Ki Soon | Wern Han Lim
Emotion recognition in conversation (ERC) requires understanding both contextual dependencies and speaker-specific cues. Existing approaches often treat conversation context as a single representation or encode speaker identity shallowly, limiting their ability to capture fine-grained emotional dynamics. We propose PERC, a personality-aware ERC framework that (1) segregates conversational context into intra- and inter-speaker components, (2) models static or dynamic personality traits to represent stable and evolving speaker dispositions, and (3) performs contrastive cross-alignment between intra–intra and inter–inter representations to enforce contextual and personality consistency. Experiments on three ERC benchmarks show that PERC achieves new state-of-the-art performance, improving weighted F1 by up to 2.74% over non-LLM methods and 0.98% over recent LLM-based methods. Our results demonstrate the effectiveness of integrating context segregation, personality modeling, and contrastive alignment for emotion reasoning in dialogue.
DRIVINGVQA: A Dataset for Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios
Charles Corbière | Simon Roburin | Syrielle Montariol | Antoine Bosselut | Alexandre Alahi
Charles Corbière | Simon Roburin | Syrielle Montariol | Antoine Bosselut | Alexandre Alahi
While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we explore the benefits of incorporating entity-related information, such as entity names, spatial coordinates, and visual content, through supervised fine-tuning to enhance the model’s reasoning abilities. Our experiments demonstrate that interleaving textual explanations with visual tokens extracted from entities relevant to the question improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that this retrieval-based approach effectively scales to the larger A-OKVQA reasoning dataset by leveraging automatically generated pseudo-labels, outperforming CoT prompting.
SAGE: Steerable Agentic Data Generation for Deep Search with Execution Feedback
Fangyuan Xu | Rujun Han | Yanfei Chen | Zifeng Wang | I-Hung Hsu | Jun Yan | Vishy Tirumalashetty | Eunsol Choi | Tomas Pfister | Chen-Yu Lee
Fangyuan Xu | Rujun Han | Yanfei Chen | Zifeng Wang | I-Hung Hsu | Jun Yan | Vishy Tirumalashetty | Eunsol Choi | Tomas Pfister | Chen-Yu Lee
Deep search agents, which aim to answer complex questions requiring reasoning across multiple documents, can significantly speed up the information-seeking process. Collecting human annotations for this application is prohibitively expensive due to long and complex exploration trajectories. We propose an agentic pipeline that automatically generates high-quality, difficulty-controlled deep search question-answer pairs for a given corpus and a target difficulty level. Our pipeline, SAGE, consists of a data generator which proposes QA pairs and a search agent which attempts to solve the generated question and provide execution feedback for the data generator. The two components interact over multiple rounds to iteratively refine the question-answer pairs until they satisfy the target difficulty level. Our intrinsic evaluation shows SAGE generates questions that require diverse reasoning strategies, while significantly increases the correctness and difficulty of the generated data. Our extrinsic evaluation demonstrates up to 23% relative performance gain on popular deep search benchmarks by training deep search agents with our synthetic data. Additional experiments show that agents trained on our data can adapt from fixed-corpus retrieval to Google Search at inference time, without further training.
Negative-Aware Diffusion Process for Temporal Knowledge Graph Extrapolation
Yanglei Gan | Peng He | Yuxiang Cai | Run Lin | Guanyu Zhou | Qiao Liu
Yanglei Gan | Peng He | Yuxiang Cai | Run Lin | Guanyu Zhou | Qiao Liu
Temporal Knowledge Graph (TKG) reasoning seeks to predict future missing facts from historical evidence. While diffusion models (DM) have recently gained attention for their ability to capture complex predictive distributions, two gaps remain: (i) the generative path is conditioned only on positive evidence, overlooking informative negative context, and (ii) training objectives are dominated by cross-entropy ranking, which improves candidate ordering but provides little supervision over the calibration of the denoised embedding. To bridge this gap, we introduce **N**egative-**A**ware **D**iffusion model for TKG **Ex**trapolation (**NADEx**). Specifically, NADEx encodes subject-centric histories of entities, relations and temporal intervals into sequential embeddings. NADEx perturbs the query object in the forward process and reconstructs it in reverse with a Transformer denoiser conditioned on the temporal-relational context. We further derive a cosine-alignment regularizer derived from batch-wise negative prototypes, which tightens the decision boundary against implausible candidates. Comprehensive experiments on four public TKG benchmarks demonstrate that NADEx delivers state-of-the-art performance.
DS2-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
Ruiyao Xu | Noelle I. Samia | Han Liu
Ruiyao Xu | Noelle I. Samia | Han Liu
Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS2-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom’s Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.
DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards
Aaryaman Kartha | Ahmed Masry | Mohammed Saidul Islam | Thinh Lang | Shadikur Rahman | Ridwan Mahbub | Mizanur Rahman | Mahir Ahmed | Md Rizwan Parvez | Enamul Hoque | Shafiq Joty
Aaryaman Kartha | Ahmed Masry | Mohammed Saidul Islam | Thinh Lang | Shadikur Rahman | Ridwan Mahbub | Mizanur Rahman | Mahir Ahmed | Md Rizwan Parvez | Enamul Hoque | Shafiq Joty
Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 292 tasks on 112 interactive dashboards, encompassing 405 question answer pairs overall. These questions span five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark’s significant difficulty. We release DashboardQA at ..
Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?
Zhiting Mei | Christina Zhang | Tenny Yin | Justin Lidard | Ola Sho | Anirudha Majumdar
Zhiting Mei | Christina Zhang | Tenny Yin | Justin Lidard | Ola Sho | Anirudha Majumdar
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced by reinforcement learning. However, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical for safe deployment in real-world applications. To this end, we explore uncertainty quantification (UQ) of reasoning models in this work. We ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (IUQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks focused on knowledge-intensive tasks, we find that reasoning models: (i) are typically overconfident, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). We conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.
AfriMMT-EA: Multi-domain Machine Translation for Low-Resource East African Languages
Naome A Etori | Kelechi Ezema | Nathaniel Romney Robinson | Davis David | Alfred Malengo Kondoro | Elisha Ondieki Makori | Michael Samwel Mollel | Maria Gini
Naome A Etori | Kelechi Ezema | Nathaniel Romney Robinson | Davis David | Alfred Malengo Kondoro | Elisha Ondieki Makori | Michael Samwel Mollel | Maria Gini
Despite remarkable progress in multilingual machine translation (MT), the majority of African—especially East African—languages remain significantly underrepresented both in benchmark datasets and state-of-the-art (SOTA) MT models. This persistent exclusion from mainstream technologies not only limits equitable access, but constrains the development of tools that accurately reflect the region’s linguistic and cultural diversity. Recent advances in open-source large language models have demonstrated strong multilingual MT capabilities through data-efficient adaptation strategies. However, little work has explored their potential for low-resource African languages. We introduce AfriMMT-EA, the first highly multilingual benchmark and MT dataset for East African languages. Our datasets comprise 54 local languages across five East African countries. We used these data to fine-tune two multilingual versions of Gemma-3. We compare models’ performance on these languages with larger off-the-shelf baselines. We release our data and models, in the interest of advancing MT for these low-resource languages and their communities.
Diffusion Language Model Inference with Monte Carlo Tree Search
Zheng Huang | Kiran Ramnath | Yueyan Chen | Aosong Feng | Sangmin Woo | Balasubramaniam Srinivasan | Zhichao Xu | Kang Zhou | Shuai Wang | Haibo Ding | Lin Lee Cheong
Zheng Huang | Kiran Ramnath | Yueyan Chen | Aosong Feng | Sangmin Woo | Balasubramaniam Srinivasan | Zhichao Xu | Kang Zhou | Shuai Wang | Haibo Ding | Lin Lee Cheong
Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, an inference-time scaling framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This design enables efficient inference-time scaling, allowing generation quality to improve as the search budget increases, without additional training. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in DLMs.
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu | Pham Khanh Chi | Dat Phi Van | Linh Ngo Van | Dinh Viet Sang | Trung Le
Duc Trung Vu | Pham Khanh Chi | Dat Phi Van | Linh Ngo Van | Dinh Viet Sang | Trung Le
Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback–Leibler divergence (KL). The process is modulated by dual-space entropy-based weights that up-weight tokens where the student is uncertain and the teacher is confident, thereby focusing learning on informative tokens rather than treating all positions equally. At the sequence level, DWA-KD applies Soft Dynamic Time Warping (Soft-DTW) to both the embedding and final hidden-state layers, enabling robust alignment of lexical and contextual semantics between teacher and student sequences. Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD consistently outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final hidden state layer Soft-DTW alignment.
Harnessing Consistency for Robust Test-Time LLM Ensemble
Zhichen Zeng | Qi Yu | Xiao Lin | Ruizhong Qiu | Xuying Ning | Tianxin Wei | Yuchen Yan | Jingrui He | Hanghang Tong
Zhichen Zeng | Qi Yu | Xiao Lin | Ruizhong Qiu | Xuying Ning | Tianxin Wei | Yuchen Yan | Jingrui He | Hanghang Tong
Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. *Token-level consistency* captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. *Model-level consistency* models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness. Our code is available at https://github.com/zhichenz98/CoRE-EACL26.
AutoAnoEval: Semantic-Aware Model Selection via Tree-Guided LLM Reasoning for Tabular Anomaly Detection
Suhee Yoon | Sanghyu Yoon | Ye Seul Sim | Seungdong Yoa | Dongmin Kim | Soonyoung Lee | Hankook Lee | Woohyung Lim
Suhee Yoon | Sanghyu Yoon | Ye Seul Sim | Seungdong Yoa | Dongmin Kim | Soonyoung Lee | Hankook Lee | Woohyung Lim
In the tabular domain, which is the predominant data format in real-world applications, anomalies are extremely rare or difficult to collect, as their identification often requires domain expertise. Consequently, evaluating tabular anomaly detection models is challenging, since anomalies may be absent even in evaluation sets. To tackle this challenge, prior works have generated synthetic anomaly generation rely on statistical patterns, they often overlook domain semantics and struggle to reflect the complex, domain-specific nature of real-world anomalies. We propose AutoAnoEval, a novel evaluation framework for tabular AD that constructs pseudo-evaluation sets with semantically grounded synthetic anomalies. Our approach leverages an iterative interaction between a Large Language Model (LLM) and a decision tree (DT): the LLM generates realistic anomaly conditions based on contextual semantics, while the DT provides structural guidance by capturing feature interactions inherent in the tabular data. This iterative loop ensures the generation of diverse anomaly conditions, ranging from easily detectable outliers to subtle cases near the decision boundary. Extensive experiments on 20 tabular AD benchmarks demonstrate that AutoAnoEval achieves superior model selection performance, with high ranking alignment and minimal performance gaps compared to evaluations on anomalies encountered in practical applications.
Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment
Tiejin Chen | Xiaoou Liu | Vishnu Nandam | Kuan-Ru Liou | Hua Wei
Tiejin Chen | Xiaoou Liu | Vishnu Nandam | Kuan-Ru Liou | Hua Wei
Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the answers being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling answer-side uncertainty complements preference-level weighting and yields more robust, data-efficient alignment.
ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization
Sunzhu Li | Zhiyu Lin | Jiale Zhao | Shuling Yang | Chen Wei
Sunzhu Li | Zhiyu Lin | Jiale Zhao | Shuling Yang | Chen Wei
Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, namely instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot’s broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (e.g., cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7%), and enhances instruction following. It also synergizes with existing training-based methods. Specially, our analysis reveals that think-prefixes can reliably control LRMs’ reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands.
LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction
Shengmin Piao | Jieun Lee | Sanghyun Park
Shengmin Piao | Jieun Lee | Sanghyun Park
The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raises concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, optimized with a hard-negative supervised contrastive objective to distinguish semantically similar but functionally irrelevant columns, and (ii) a SQL Generator fine-tuned in two stages—supervised fine-tuning followed by execution-guided reinforcement—enabling execution-guided self-correction without multi-candidate sampling, which is commonly required by prior LLM-based approaches.On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2× to 30× fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.
Beyond Coherence: Improving Temporal Consistency and Interpretability in Dynamic Topic Models
Thanh Vinh Nguyen | Ngo Van Dong | Minh Chu Xuan | Tung Nguyen | Linh Ngo Van | Dinh Viet Sang | Trung Le
Thanh Vinh Nguyen | Ngo Van Dong | Minh Chu Xuan | Tung Nguyen | Linh Ngo Van | Dinh Viet Sang | Trung Le
Dynamic topic models aim to reveal how themes emerge, evolve, and dissolve in time-stamped corpora, but existing approaches still face three major challenges: (i) encoders capture bag-of-words statistics but fail to align with the rich semantic priors of large pre-trained language models, (ii) temporal linkages are often modeled as rigid one-to-one chains, limiting the ability to track non-linear evolution such as topic splits or merges, and (iii) interpretability remains shallow, relying on noisy top-word lists that obscure thematic clarity. We propose L-DNTM (LLM-Augmented for Dynamic Neural Topic Model), a variational framework designed to capture more faithful temporal trajectories. Our model integrates three key components: multi-objective distillation to inject PLM-derived semantic knowledge into the encoder, entropy-regularized optimal transport to align entire topic constellations across time for smooth yet flexible evolution, and LLM-guided refinement to sharpen topic–word distributions for improved interpretability. Extensive experiments on diverse corpora show that L-DNTM yields more coherent, temporally consistent, and interpretable topic dynamics, and further enhances downstream classification and clustering tasks.
Interpretable Graph-Language Modeling for Detecting Youth Illicit Drug Use
Yiyang Li | Zehong Wang | Zhengqing Yuan | Zheyuan Zhang | Keerthiram Murugesan | Chuxu Zhang | Yanfang Ye
Yiyang Li | Zehong Wang | Zhengqing Yuan | Zheyuan Zhang | Keerthiram Murugesan | Chuxu Zhang | Yanfang Ye
Illicit drug use among teenagers and young adults (TYAs) remains a pressing public health concern, with rising prevalence and long-term impacts on health and well-being. To detect illicit drug use among TYAs, researchers analyze large-scale surveys such as the Youth Risk Behavior Survey (YRBS) and the National Survey on Drug Use and Health (NSDUH), which preserve rich demographic, psychological, and environmental factors related to substance use. However, existing modeling methods treat survey variables independently, overlooking latent and interconnected structures among them. To address this limitation, we propose LAMI (LAtent relation Mining with bi-modal Interpretability), a novel joint graph-language modeling framework for detecting illicit drug use and interpreting behavioral risk factors among TYAs. LAMI represents individual responses as relational graphs, learns latent connections through a specialized graph structure learning layer, and integrates a large language model to generate natural language explanations grounded in both graph structures and survey semantics. Experiments on the YRBS and NSDUH datasets show that LAMI outperforms competitive baselines in predictive accuracy. Interpretability analyses further demonstrate that LAMI reveals meaningful behavioral substructures and psychosocial pathways, such as family dynamics, peer influence, and school-related distress, that align with established risk factors for substance use. Our codebase is available here.
Tailoring Memory Granularity for Multi-Hop Reasoning over Long Contexts
Peijun Qing | Xingjian Diao | Chiyu Ma | Saeed Hassanpour | Soroush Vosoughi
Peijun Qing | Xingjian Diao | Chiyu Ma | Saeed Hassanpour | Soroush Vosoughi
Multi-hop reasoning over long contexts remains challenging, as it requires integrating relevant contexts scattered across distant sources while resisting semantic drift and noise from distracting content. While retrieval-augmented generation (RAG) has emerged as the prevailing solution, most RAG approaches encode and store context in monolithic memory representations, resulting in noisy retrieval and brittle reasoning. To overcome these limitations, we introduce TAG (Tailoring Memory Granularity), a framework that prestructures memory into diverse granularities and employs a reward-guided navigator to adaptively compose hybrid memory tailored to each query. The navigator is trained with a multi-objective Bradley–Terry loss that learns the relative utility of different memory types, enabling dynamic routing across granularities. This design allows RAG systems to balance fine-grained detail with high-level abstraction, yielding more reliable reasoning. Extensive experiments on long-context multi-hop question answering (QA) benchmarks show that TAG achieves state-of-the-art performance. With only 0.033% additional parameters, it remains lightweight, highlighting its practicality as a scalable and effective solution for enhancing language model agents in complex, real-world scenarios.
Unlocking Large Audio-Language Models for Interactive Language Learning
Hongfu Liu | Zhouying Cui | Xiangming Gu | Ye Wang
Hongfu Liu | Zhouying Cui | Xiangming Gu | Ye Wang
Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.
Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer
Jinghan Zhang | Fengran Mo | Tharindu Cyril Weerasooriya | Xinyue Ye | Dongjie Wang | Yanjie Fu | Kunpeng Liu
Jinghan Zhang | Fengran Mo | Tharindu Cyril Weerasooriya | Xinyue Ye | Dongjie Wang | Yanjie Fu | Kunpeng Liu
Large language models have shown strong reasoning capabilities through chain-structured methods such as Chain-of-Thought. Recent studies optimize thought structures by generating parallel or tree-like structures, switching long and short reasoning modes, or aligning reasoning steps with task performance. However, these approaches mainly rely on previously generated logical directions of the chains, which ignore the unexplored regions of the solution space. Such a phenomenon is denoted as blind spots, which limit the diversity and effectiveness of the reasoning process. To this end, we propose the “Thought Space Explorer” (TSE), a framework for navigating and expanding thought structures to overcome blind spots in LLM reasoning. Our TSE first identifies key nodes with high impact, then generates new nodes by integrating information from multiple chains. Finally, it extends new branches through connection strategies. We conduct a series of experiments on math and QA benchmarks. Compared to existing baseline methods, TSE improves the accuracy of both the final answer and intermediate reasoning steps, while maintaining a better effectiveness-efficiency trade-off for practical deployment.
StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs
Haohan Yuan | Sukhwa Hong | Haopeng Zhang
Haohan Yuan | Sukhwa Hong | Haopeng Zhang
Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. In particular, on ArXiv, it increases FactCC and SummaC by 19.2% and 8.0% points, demonstrating stronger alignment between summaries and source content. The ablation study shows that the combination of multiple strategies does not yield clear performance gains; therefore, structure-aware prompting with graph-based information represents a promising and underexplored direction for the advancement of zero-shot extractive summarization with LLMs.
Logits-Based Block Pruning with Affine Transformations for Large Language Models
Zekun Hu | Yichu Xu | De-Chuan Zhan
Zekun Hu | Yichu Xu | De-Chuan Zhan
As the scale of Large Language Models (LLMs) continues to grow rapidly, the cost of training and inference has significantly increased, limiting their application in resource-constrained scenarios. To address this challenge, model pruning has been widely used to reduce computational complexity. Among various pruning strategies, block-wise pruning has gained popularity due to its ability to accelerate computation by removing entire blocks of parameters. However, existing methods often rely on hard labels from calibration datasets and neglect the cumulative effects of pruning on subsequent blocks. To address this, we propose two complementary techniques: the Logit Disruption Score (LDS), a novel block importance criterion that measures the impact of pruning by comparing the cosine similarity between the logits of the original and pruned models, focusing on the most informative logit dimensions to better preserve the model’s core capabilities; and Activation Statistics Correction (ASC), an affine transformation mechanism that aligns the mean and variance of activations in the pruned model with those of the original model, effectively mitigating the distribution shift caused by block removal and improving the information flow in subsequent blocks. Experiments across multiple datasets show that our approach reduces reliance on calibration data and improves generalization, achieving competitive results with existing methods.
MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
Martin Hyben | Sebastian Kula | Jan Cegin | Jakub Simko | Ivan Srba | Robert Moro
Martin Hyben | Sebastian Kula | Jan Cegin | Jakub Simko | Ivan Srba | Robert Moro
Large language models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims—a key step in the fact-checking process—remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, six topical domains, and two writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark three common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.
What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects
Naihao Deng | Sheng Zhang | Henghui Zhu | Shuaichen Chang | Jiani Zhang | Alexander Hanbo Li | Chung-Wei Hang | Hideo Kobayashi | Yiqun Hu | Patrick Ng
Naihao Deng | Sheng Zhang | Henghui Zhu | Shuaichen Chang | Jiani Zhang | Alexander Hanbo Li | Chung-Wei Hang | Hideo Kobayashi | Yiqun Hu | Patrick Ng
Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets in the context of table instruction tuning. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 table benchmarks. Our study is the first to quantitatively disentangle the effects of training data and base model selection, revealing that base model choice plays a more dominant role than the training data itself. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
Abishek Stephen | Jindřich Libovický
Abishek Stephen | Jindřich Libovický
We present a novel metric for the evaluation of morphological plausibility of subword segmentation.Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentation data that is either unavailable or of inconsistent quality across many languages, our approach utilizes morpho-syntactic features.These are available in resources such as Universal Dependencies or UniMorph for a much wider range of languages.The metric works by probabilistically aligning subwords with morphological features through an IBM Model 1.Our experiments show that the metric correlates well with traditional morpheme boundary recall while being more broadly applicable across languages with different morphological systems.
MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning
Vera Pavlova | Mohammed Makhlouf
Vera Pavlova | Mohammed Makhlouf
We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of text embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain text embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage
Kalyan Nakka | Nitesh Saxena
Kalyan Nakka | Nitesh Saxena
The inherent risk of generating harmful and unsafe content by Large Language Models (LLMs), has highlighted the need for their safety alignment. Various techniques like supervised fine-tuning, reinforcement learning from human feedback, and red-teaming were developed for ensuring the safety alignment of LLMs. However, the robustness of these aligned LLMs is always challenged by adversarial attacks that exploit unexplored and underlying vulnerabilities of the safety alignment. In this paper, we develop a novel black-box jailbreak attack, called BitBypass, that leverages hyphen-separated bitstream camouflage for jailbreaking aligned LLMs. This represents a new direction in jailbreaking by exploiting fundamental information representation of data as continuous bits, rather than leveraging prompt engineering or adversarial manipulations. Our evaluation of five state-of-the-art LLMs, namely GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, in adversarial perspective, revealed the capabilities of BitBypass in bypassing their safety alignment and tricking them into generating harmful and unsafe content. Further, we observed that BitBypass outperforms several state-of-the-art jailbreak attacks in terms of stealthiness and attack success. Overall, these results highlights the effectiveness and efficiency of BitBypass in jailbreaking these state-of-the-art LLMs.
The Problem of Ambiguity in Table Question Answering
Jorge Osés Grijalba | L. Alfonso Ureña | Eugenio Martínez-Cámara | Jose Camacho-Collados
Jorge Osés Grijalba | L. Alfonso Ureña | Eugenio Martínez-Cámara | Jose Camacho-Collados
Question Answering on Tabular Data (or Table Question Answering) has seen tremendous advances with the coming of new generation Large Language Models (LLMs). Despite this, significant challenges still remain to be solved if we are to develop robust enough approaches for general usage. One of these is ambiguity in question answering, which historically has not merited much attention due to the previously limited capabilities of LLMs. In this work, we outlay the main types of ambiguousness inherent to tabular data. Then, we discuss how they are influenced by the way our models interact with the information stored in the tables, and we test the capabilities of some LLMs in detecting them. This work provides an initial ground for a deeper discussion on how to approach ambiguity in Tabular Data in the age of LLMs.
Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
Joschka Braun | Carsten Eickhoff | Seyed Ali Bahrainian
Joschka Braun | Carsten Eickhoff | Seyed Ali Bahrainian
Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer best balance in practice.
Similar Region Search using LLMs on Spatial Feature Space
Al-Amin Sany | Mohaiminul Islam | Tanzima Hashem | Md. Ashraful Islam | Mohammed Eunus Ali
Al-Amin Sany | Mohaiminul Islam | Tanzima Hashem | Md. Ashraful Islam | Mohammed Eunus Ali
Understanding regional similarities is crucial for applications such as urban planning, tourism recommendations, business expansion, and disease prevention. While spatial data, including POI distributions, check-in activity, and building footprints, offer valuable insights, existing similarity methods—based on distance metrics, embeddings, or deep metric learning—fail to capture the contextual richness and adapt to heterogeneous spatial data. To overcome these limitations, we introduce a novel similar region search framework that ranks candidate regions based on their similarity to a query region using large language models. To further enhance performance, we fine-tune the model through self-supervised learning by introducing controlled noise into spatial data. This generates similar and dissimilar samples without relying on extensive labeled data. By transforming spatial data into natural language descriptions, our method seamlessly integrates heterogeneous datasets without requiring structural modifications, ensuring scalability across diverse urban contexts. Experiments on multiple real-world city datasets, including cross-city evaluation, demonstrate that our framework significantly outperforms state-of-the-art methods in both accuracy and ranking performance.
Learning to Ask: Multi-Decoder Fine-Tuning for Multi-Hop Visual Question Generation with External Knowledge
Arpan Phukan | Manish Gupta | Asif Ekbal
Arpan Phukan | Manish Gupta | Asif Ekbal
SLANG-GraphRAG: Multi-Layered Retrieval with Domain-Specific Knowledge for Low Resource Social Media Conversations
Ifeoluwa Wuraola | Daniel Marciniak | Nina Dethlefs
Ifeoluwa Wuraola | Daniel Marciniak | Nina Dethlefs
Emotion classification on social media is especially difficult when texts include informal, culturally grounded language like slang. Standard NLP benchmarks often miss these nuances, particularly in low-resource settings. We present SLANG-GraphRAG, a retrieval-augmented framework that integrates a culture-specific slang knowledge graph into large language models via one-shot prompting. Using multiple retrieval strategies, we incorporate slang definitions, regional usage, and conversational context. Our results show that incorporating structured cultural knowledge into the retrieval process leads to significant improvements, improving accuracy by up to 31% and F1 score by 28%, outperforming traditional and unstructured retrieval methods. To better evaluate model behavior, we propose a probabilistic metric that reflects the distribution of human annotations, providing a more nuanced measure of performance. This highlights the value of culturally sensitive applications and more balanced evaluation in subjective NLP tasks.
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
Hannah Calzi Kleidermacher | James Zou
Hannah Calzi Kleidermacher | James Zou
Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method and show an average performance of 95.9%, indicating that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages. Interestingly, a third of the authors found many technical terms “overtranslated,” expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation.
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
Minjae Lee | Wonjun Kang | Byeongkeun Ahn | Christian Classen | Kevin Galim | Seunghyuk Oh | Minghao Yan | Hyung Il Koo | Kangwook Lee
Minjae Lee | Wonjun Kang | Byeongkeun Ahn | Christian Classen | Kevin Galim | Seunghyuk Oh | Minghao Yan | Hyung Il Koo | Kangwook Lee
Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose **Test-time Adaptive Batched Ensemble Drafting (TABED)**, which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74× over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.
KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge
Alex Robertson | Huizhi Liang | Mahbub Gani | Rohit Kumar | Srijith Rajamohan
Alex Robertson | Huizhi Liang | Mahbub Gani | Rohit Kumar | Srijith Rajamohan
Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present **KGHaluBench**, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM’s response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.
Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
Jungin Kim | Shinwoo Park | Yo-Sub Han
Jungin Kim | Shinwoo Park | Yo-Sub Han
Identifying LLM-generated code through watermarking poses a challenge in preserving functional correctness. Previous methods rely on the assumption that watermarking high-entropy tokens effectively maintains output quality. Our analysis reveals a fundamental limitation of this assumption: syntax-critical tokens such as keywords often exhibit the highest entropy, making existing approaches vulnerable to logic corruption. We present STONE, a syntax-aware watermarking method that embeds watermarks only in non-syntactic tokens and preserves code integrity. For rigorous evaluation, we also introduce STEM, a comprehensive metric that balances three critical dimensions: correctness, detectability, and imperceptibility. Across Python, C++, and Java, STONE preserves correctness, sustains strong detectability, and achieves balanced performance with minimal computational overhead. Our implementation is available at https://github.com/inistory/STONE-watermarking.
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva | David Semedo | Joao Magalhaes
Diogo Glória-Silva | David Semedo | Joao Magalhaes
We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90% accuracy on plan-aware VQA.
Attribute-Controlled Translation with Preference Optimization
Inigo Jauregi Unanue | Najmeh Sadoughi | Vimal Bhat | Zhu Liu | Massimo Piccardi
Inigo Jauregi Unanue | Najmeh Sadoughi | Vimal Bhat | Zhu Liu | Massimo Piccardi
Attribute-controlled translation (ACT) seeks to produce translations that satisfy specific constraints on linguistic and stylistic attributes. While careful prompt engineering can enable large language models to perform strongly in this task, its effectiveness is mainly limited to models of very large size. For this reason, in this paper we set to improve the performance of language models of more contained size by leveraging the contrastive nature of ACT tasks with preference optimization, as well as exploiting knowledge distillation with synthetically-generated training samples from larger models. As a resource for this investigation, we also introduce PREF-FAME-MT, a large, contrastive, formality-controlled parallel corpus which has been generated by expanding the existing FAME-MT dataset with synthetic contrastive samples. Experiments conducted over three datasets for formality- and gender-controlled translation with 71 distinct language pairs have demonstrated the effectiveness of the proposed approach at simultaneously improving attribute matching and translation quality. We release all our code and datasets to allow reproduction and expansion of our work.
ReciFine: Finely Annotated Recipe Dataset for Controllable Recipe Generation
Nuhu Ibrahim | Rishi Ravikumar | Robert Stevens | Riza Batista-Navarro
Nuhu Ibrahim | Rishi Ravikumar | Robert Stevens | Riza Batista-Navarro
We introduce ReciFine, the largest human-evaluated, finely annotated recipe dataset to date, designed to advance controllable and trustworthy recipe generation. Existing resources, such as RecipeNLG, extract food items only from ingredient lists, overlooking entities expressed in instructions, including tools, chef actions, food and tool states, and durations, which are crucial for realistic and context-aware generation. To address this limitation, we extend RecipeNLG with finely annotated extraction of over 97 million entities across ten entity types from 2.2 million recipes. We are the first to explore recipe generation with explicit control over multiple entity types, enabling models to generate recipes conditioned not only on ingredients but also on tools, chef actions, cooking durations, and other contextual factors. Large language models fine-tuned or few-shot prompted with ReciFine extractions consistently outperform those trained on ingredient-list data alone across both automatic and human evaluations. ReciFine establishes a foundation for factual, coherent, structured, controllable recipe generation, and we release a human-annotated benchmark to support future evaluation and model development.
ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure
Thomas Bauwens | Miryam de Lhoneux
Thomas Bauwens | Miryam de Lhoneux
Recent work has explored pruning merges from BPE subword tokenisers using corpus data as a signal for which merges to prune. We argue that because a BPE tokeniser contains a rich data structure on top of its vocabulary set, this in itself can be used as a guide to modify its merges such that segmentations become more desirable. We apply this argument to one of those pruning algorithms, BPE-knockout, by introducing a new reification step that suggests new merges by inspecting the effects left by pruning. By alternating both processes iteratively until convergence, we get a new BPE tokeniser, ReBPE, which outperforms the original BPE-knockout algorithm on morphological alignment in all 14 languages tested by over 11% F1 on average.
Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion
Li Xiao | Kotaro Funakoshi | Manabu Okumura
Li Xiao | Kotaro Funakoshi | Manabu Okumura
Emotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.
Demystifying Mixed Outcomes of Self-Training: Pre-training Analyses on Non-Toy LLMs
Yusuke Nakamura | Hirokazu Kiyomaru | Chaoran Liu | Shuhei Kurita | Daisuke Kawahara
Yusuke Nakamura | Hirokazu Kiyomaru | Chaoran Liu | Shuhei Kurita | Daisuke Kawahara
We investigate whether large language models (LLMs) can improve through recursive training on self-generated text, a topic where prior studies report conflicting outcomes: some find evidence of performance gains (i.e., self-improvement), while others observe performance degradation (i.e., model collapse). To clarify this discrepancy, we use the OLMo-2 models as non-toy LLMs and perform multiple rounds of continual pre-training using self-generated text with different prompting strategies and data filtering. Our experiments show that naive recursive self-training does not improve either perplexity or downstream task performance, regardless of model size. These results suggest that model collapse observed in naive recursive training is inherent to the training procedure itself, while self-improvement likely owes its success not to the model’s autonomous refinement but to human-designed, strategic synthetic pipelines that inject external intelligence.
Revealing Redundant Syntax in Large Language Models through Multi-Hop Dependency Paths
Masaki Sashida | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Masaki Sashida | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Prior work on attention–syntax alignment has largely focused on single-hop Universal Dependency edges (DPs). In this paper, we treat short multi-hop dependency paths (MDPs) (e.g., “obl+case”) as first-class units and analyze them alongside DPs. Across three pretrained autoregressive LMs (GPT-2 XL, Llama 3 8B, Qwen3-8B) and one encoder baseline (BERT-large), we extract 2–3 hop MDPs from UD-parsed English and quantify head–relation alignment with an Unlabeled Attachment Score (UAS)–style metric modified for causal masking in decoder-only models. Rank visualizations reveal both overlap and specialization: we observe heads that align with both DPs and MDPs, as well as heads that appear specialized for one route. To test functional relevance, we first identify heads by UAS and then apply an undifferentiated (uniform) attention ablation to those heads; we evaluate the impact on BLiMP and LAMBADA. Ablating the top 10% of all heads shows that MDP-selected heads induce larger drops than DP-selected heads and that the union (“Mix”) of DP- and MDP-selected heads yields the largest drops. For GPT-2 XL, the observed drops are (BLiMP: 𝛥DP = 1.35 pp, 𝛥MDP = 4.81 pp, 𝛥Mix = 7.11 pp; LAMBADA: 𝛥DP = 4.70 pp, 𝛥MDP = 25.17 pp, 𝛥Mix = 32.99 pp), all exceeding size-matched random controls. These results indicate that models can route information consistent with syntactic dependencies via both DP and MDP pathways, with MDPs playing a distinct and measurable role in some settings under our interventions.
A Scalable Framework for Automated NER Annotation Correction in Low-Resource Languages
Toqeer Ehsan | Thamar Solorio
Toqeer Ehsan | Thamar Solorio
Poor quality or noisy annotations in Named Entity Recognition (NER), as in any other NLP task, make it challenging to achieve state-of-the-art performance. In this paper, we present a multi-step framework to enhance the annotation quality of NER datasets by employing automated techniques. We propose a frequency-based iterative approach that leverages self-training and a dual-threshold mechanism to enhance inference confidence. Experimental evaluations on different NER datasets demonstrate significant improvements in NER performance with respect to the original datasets. This work further explores the potential of generative Large Language Models (LLMs) to perform NER for low-resource languages.
Can ChatGPT Really Understand Modern Chinese Poetry?
Shanshan Wang | Derek F. Wong | Jingming Yao | Lidia S. Chao
Shanshan Wang | Derek F. Wong | Jingming Yao | Lidia S. Chao
ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundamental issues of comprehension. This paper introduces a comprehensive framework for evaluating ChatGPT’s understanding of modern poetry. We collaborated with professional poets to evaluate ChatGPT’s interpretation of unpublished modern Chinese poems by different poets along multiple dimensions. Evaluation results show that ChatGPT’s interpretations align with the original poets’ intents in over 73% of the cases. However, its understanding in certain dimensions, particularly in capturing poeticity, proved to be less satisfactory. These findings highlight the effectiveness and necessity of our proposed framework. This study not only evaluates ChatGPT’s ability to understand modern poetry but also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks.
Knowing What’s Missing: Assessing Information Sufficiency in Question Answering
Akriti Jain | Aparna Garimella
Akriti Jain | Aparna Garimella
Determining whether a provided context contains sufficient information to answer a question is a critical challenge for building reliable question-answering systems. While simple prompting strategies have shown success on factual questions, they frequently fail on inferential ones that require reasoning beyond direct text extraction. We hypothesize that asking a model to first reason about what specific information is missing provides a more reliable, implicit signal for assessing overall sufficiency. To this end, we propose a structured Identify-then-Verify framework for robust sufficiency modeling. Our method first generates multiple hypotheses about missing information and establishes a semantic consensus. It then performs a critical verification step, forcing the model to re-examine the source text to confirm whether this information is truly absent. We evaluate our method against established baselines across diverse multi-hop and factual QA datasets. The results demonstrate that by guiding the model to justify its claims about missing information, our framework produces more accurate sufficiency judgments while clearly articulating any information gaps.
The Curse of Verbalization: How Presentation Order Constrains LLM Reasoning
Yue Zhou | Henry Peng Zou | Barbara Di Eugenio | Yang Zhang
Yue Zhou | Henry Peng Zou | Barbara Di Eugenio | Yang Zhang
This paper delves into the factors that contribute to the difficulty of problems for large language models (LLMs). We begin with a pilot test evaluating LLMs’ understanding of esoteric programming languages and find that LLMs struggle significantly when programs execute in an order that is unaligned with how the program is presented. This phenomenon leads to the hypothesis that LLM performance on reasoning correlates with the alignment between the order in which information is presented and the order in which it should be utilized. We demonstrate that this hypothesis holds broadly in mathematical reasoning: restructuring problems to align the order of information presentation with the order of utilization consistently improves performance across state-of-the-art models. We conjecture this occurs because LLMs acquire a strong tendency to verbalize information in presentation order during training on human text, a tendency detrimental in reasoning domains where the optimal utilization order often diverges from the presentation order. To provide further evidence, we construct pseudo-mathematical problems with nonsensical terms and quantify the verbalization flexibility of LLMs without interference from mathematical knowledge. Across twelve representative LLMs, we find that this flexibility exhibits a strong correlation (p = 0.87) with general reasoning performance rankings on LMArena.
PATS: Personality-Aware Teaching Strategies with Large Language Model Tutors
Donya Rooein | Sankalan Pal Chowdhury | Mariia Eremeeva | Yuan Qin | Debora Nozza | Mrinmaya Sachan | Dirk Hovy
Donya Rooein | Sankalan Pal Chowdhury | Mariia Eremeeva | Yuan Qin | Debora Nozza | Mrinmaya Sachan | Dirk Hovy
Recent advances in large language models (LLMs) demonstrate their potential as educational tutors. However, different tutoring strategies benefit different student personalities, and mismatches can be counterproductive to student outcomes. Despite this, current LLM tutoring systems do not take into account student personality traits. To address this problem, we first construct a taxonomy that links pedagogical methods to personality profiles, based on pedagogical literature. We simulate student-teacher conversations and use our framework to let the LLM tutor adjust its strategy to the simulated student personality. We evaluate the scenario with human teachers and find that they consistently prefer our approach over two baselines. Our method also increases the use of less common, high-impact strategies such as role-playing, which human and LLM annotators prefer significantly. Our findings pave the way for developing more personalized and effective LLM use in educational applications.
Mitigating Causal Bias in LLMs via Potential Outcomes Framework and Actual Causality Theory
Yiheng Zhao | Yuanliang Li | Shreya Savant | Jun Yan
Yiheng Zhao | Yuanliang Li | Shreya Savant | Jun Yan
Event Causality Identification (ECI) aims to identify causal relationships between events, which is essential for root cause analysis. While recent studies reveal that Large Language Models (LLMs) exhibit significant causal hallucination, a systematic evaluation of their document-level ECI performance across varied structural characteristics and a corresponding dataset is currently lacking. To fill this gap, we first construct a structure-controlled dataset to comprehensively assess their document-level ECI performance across texts with various structural characteristics that influence the causal behaviors in ECI. We find that different LLMs exhibit divergent causal bias across texts with varied structures, ranging from consistent hallucination or neglect to structure-dependent shifts between the two. To mitigate the bias, furthermore, we formulate ECI as a causal inference problem and propose a causality identification framework grounded in the potential outcomes and the Halpern–Pearl (HP) definition of actual causality theory. Experimental results demonstrate that our framework significantly reduces the causal bias associated with directly using LLMs on ECI, while also achieving superior performance.
JuriFindIT: an Italian legal retrieval dataset
Niko Dalla Noce | Davide Colla | Sina Farhang Doust | Lorenzo De Mattei | Davide Bacciu
Niko Dalla Noce | Davide Colla | Sina Farhang Doust | Lorenzo De Mattei | Davide Bacciu
Statutory article retrieval (SAR) targets retrieval of legislative provisions relevant to a natural language question. The lexical gap between everyday queries and specialized legal language, as well as the structural dependencies of statute law, makes it a challenging task. Here, we introduce JuriFindIT, the first SAR dataset for the Italian legal domain and the first to explicitly encode cross-article references extracted from national legal code. The dataset covers four macro-areas—civil law, criminal law, anti-money laundering and counter-terrorism, and privacy—and includes 895 expert-authored questions and 169,301 generated ones, linked to more than 23,000 statutory articles. We provide retrieval models fine-tuned on JuriFindIT, proposing a pipeline that integrates dense encoders with an heterogeneous legislative graph, achieving consistent improvements over prior SAR approaches.
Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches
Noopur Zambare | Kiana Aghakasiri | Carissa Lin | Carrie Ye | J Ross Mitchell | Mohamed Abdalla
Noopur Zambare | Kiana Aghakasiri | Carissa Lin | Carrie Ye | J Ross Mitchell | Mohamed Abdalla
Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification.Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291
How Many Ratings per Item are Necessary for Reliable Significance Testing?
Christopher M Homan | Flip Korn | Deepak Pandita | Chris Welty
Christopher M Homan | Flip Korn | Deepak Pandita | Chris Welty
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, “gold standard” data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI—along with strong evidence that humans are unreliable judges—estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
QFrBLiMP: a Quebec-French Benchmark of Linguistic Minimal Pairs
David Beauchemin | Pier-Luc Veilleux | Richard Khoury | Johanna-Pascale Roy
David Beauchemin | Pier-Luc Veilleux | Richard Khoury | Johanna-Pascale Roy
In this paper, we introduce the Quebec-French Benchmark of Linguistic Minimal Pairs (QFrBLiMP), a corpus designed to evaluate LLMs’ linguistic knowledge of prominent grammatical phenomena in Quebec-French. QFrBLiMP comprises 1,761 minimal pairs annotated with 20 LPs.Specifically, these minimal pairs have been created by manually modifying sentences extracted from an official online resource maintained by a Québec government institution. Each pair is annotated by 12 Quebec-French native speakers, who select the sentence they consider grammatical from the two.These annotations are used to compare the competency of LLMs with that of humans.We evaluate different LLMs on QFrBLiMP and MultiBLiMP-Fr by observing the rate of higher probabilities assigned to the sentences of each minimal pair for each category. We find that while grammatical competence scales with model size, a clear hierarchy of difficulty emerges. All benchmarked models consistently fail on phenomena requiring deep semantic understanding, revealing a critical limitation. Finally, our statistical analysis comparing QFrBLiMP and MultiBLiMP reveals a significant performance degradation for most models on Quebec-French; however, the most capable models remain within the statistical significance interval, demonstrating cross-dialectal robustness.
QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Task
Mae Sosto | Delfina S. Martinez Pandiani | Laura Hollink
Mae Sosto | Delfina S. Martinez Pandiani | Laura Hollink
This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject’s gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized "unmarked" category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute—but not eliminate—representational harms.
Efficient Table Retrieval and Understanding with Multimodal Large Language Models
Zhuoyan Xu | Haoyang Fang | Boran Han | Bonan Min | Bernie Wang | Cuixiong Hu | Shuai Zhang
Zhuoyan Xu | Haoyang Fang | Boran Han | Bonan Min | Bernie Wang | Cuixiong Hu | Shuai Zhang
Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose , a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
FedReFT: Federated Representation Fine-Tuning with All-But-Me Aggregation
Fatema Siddika | Md Anwar Hossen | Juan Pablo Munoz | Tanya G. Roosta | Anuj Sharma | Ali Jannesari
Fatema Siddika | Md Anwar Hossen | Juan Pablo Munoz | Tanya G. Roosta | Anuj Sharma | Ali Jannesari
Parameter-efficient fine-tuning (PEFT) adapts large pre-trained models by updating only a small subset of parameters. Recently, Representation Fine-Tuning (ReFT) has emerged as an effective alternative. ReFT shifts the fine-tuning paradigm from updating model weights to directly manipulating hidden representations that capture rich semantic information, and outperform state-of-the-art PEFTs in standalone settings. However, its application in Federated Learning (FL) remains challenging due to heterogeneity in clients’ data distributions, model capacities, and computational resources. To address these challenges, we introduce Federated Representation Fine-Tuning (FedReFT), a novel approach to fine-tune clients’ hidden representations. FedReFT applies sparse intervention layers to steer hidden representations directly, offering a lightweight and semantically rich fine-tuning alternative ideal for edge devices. However, representation-level updates are especially vulnerable to aggregation mismatch under different task heterogeneity, where naive averaging can corrupt semantic alignment. To mitigate this issue, we propose All-But-Me (ABM) aggregation, where each client receives the aggregated updates of others and partially incorporates them, enabling stable and personalized learning by balancing local focus with global knowledge. We further design an adaptive update strategy inspired by Test-Time Computing (TTC) to balance local and global contributions under heterogeneous conditions. FedReFT achieves state-of-the-art performance on commonsense reasoning, arithmetic reasoning, and GLUE benchmarks, while delivering 1x–49x higher parameter efficiency compared to leading LoRA-based methods.
RiddleBench: A New Generative Reasoning Benchmark for LLMs
Deepon Halder | Alan Saji | Thanmay Jayakumar | Anoop Kunchukuttan | Ratish Puduppully | Raj Dabre
Deepon Halder | Alan Saji | Thanmay Jayakumar | Anoop Kunchukuttan | Ratish Puduppully | Raj Dabre
While Large Language Models (LLMs) show remarkable capabilities, their complex reasoning skills require deeper investigation. We introduce **RiddleBench**, a new benchmark of 1,737 challenging puzzles designed to test reasoning beyond simple pattern matching. Our evaluation of state-of-the-art models reveals significant limitations, including hallucination cascades (uncritically accepting flawed peer reasoning) and poor self-correction due to strong self-confirmation bias. We also find that model performance is fragile, degrading when faced with reordered constraints or irrelevant information. RiddleBench serves as a resource for diagnosing these issues and guiding the development of more robust LLMs.
Language Model-Driven Data Pruning Enables Efficient Active Learning
Abdul Hameed Azeemi | Ihsan Ayyub Qazi | Agha Ali Raza
Abdul Hameed Azeemi | Ihsan Ayyub Qazi | Agha Ali Raza
Active learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. However, scaling active learning to large datasets remains a critical challenge, as AL acquisition functions incur prohibitive computational costs when evaluating large unlabeled data pools. To bridge this gap, we introduce a novel plug-and-play data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. To enhance the diversity of the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on diverse datasets and AL strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality ↔ efficiency tradeoff of the data pruning methods and show that ActivePrune provides up to 74% reduction in the end-to-end AL time compared to other LLM score-based pruning methods.
HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content
Lorenzo Puppi Vecchi | Alceu De Souza Britto Jr. | Emerson Cabrera Paraiso | Rafael M. O. Cruz
Lorenzo Puppi Vecchi | Alceu De Souza Britto Jr. | Emerson Cabrera Paraiso | Rafael M. O. Cruz
Explaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. We argue that this optimization penalizes situations where references to stereotypes or offensive content are essential for explanations with higher explanatory fidelity. To address this gap, we introduce SBIC-Explain, a human-validated dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should lead to higher perceived explanations with higher explanatory fidelity. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context. We propose HARM (Hate-Aware Reward Model), a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM outperforms general-purpose baselines, improving NLE pair-wise preference. Available at: https://github.com/Lorenzo815/HARM.
MATH-IDN: A Multilingual Mathematical Problem Solving Dataset Featuring Local Languages in Indonesia
Xiao Xiao | Iftitahu Ni'mah | Yuyun Wabula | Mykola Pechenizkiy | Meng Fang
Xiao Xiao | Iftitahu Ni'mah | Yuyun Wabula | Mykola Pechenizkiy | Meng Fang
Large Language Models (LLMs) excel at mathematical reasoning in English, but their performance in low-resource languages remains underexplored. This gap is particularly critical in the Indonesian context, where equitable access to AI systems depends on robust multilingual reasoning across diverse local languages.We introduce MATH-IDN, a multilingual benchmark for mathematical problem solving in Indonesian, Javanese, Sundanese, and Buginese, with English as a reference, following the MATH dataset. We evaluate multiple open-source LLMs, including math-specialized, Southeast-Asian-adapted, and general-purpose models, under a zero-shot chain-of-thought setting. Results show that MATH-IDN presents a challenging and discriminative benchmark, revealing substantial performance gaps in low-resource languages, particularly Buginese, and highlighting key limitations in current multilingual reasoning capabilities. Our data and code are available at https://github.com/aialt/MATH-IND.
Parameter-Efficient Routed Fine-Tuning: Mixture-of-Experts Demands Mixture of Adaptation Modules
Yilun Liu | Yunpu Ma | Yuetian Lu | Shuo Chen | Zifeng Ding | Volker Tresp
Yilun Liu | Yunpu Ma | Yuetian Lu | Shuo Chen | Zifeng Ding | Volker Tresp
Mixture-of-Experts (MoE) benefits from a dynamic routing mechanism among their specialized experts, which existing Parameter- Efficient Fine-Tuning (PEFT) strategies often fail to leverage. This motivates us to investigate whether adaptation modules themselves should incorporate routing mechanisms to align with MoE’s multi-expert architecture. We analyze dynamics of core components when applying PEFT to MoE language models, and examine how different routing strategies affect adaptation effectiveness. Extensive experiments adapting OLMoE-1B-7B and Mixtral-8×7B on various commonsense and math reasoning tasks validate the performance and efficiency of our routed approach. We identify optimal configurations for different scenarios and provide empirical analyses with practical insights to facilitate better PEFT and MoE applications.
MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference
Zheyuan Zhang | Lin Ge | Hongjiang Li | Weicheng Zhu | Chuxu Zhang | Yanfang Ye
Zheyuan Zhang | Lin Ge | Hongjiang Li | Weicheng Zhu | Chuxu Zhang | Yanfang Ye
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce Multi-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future.
Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought
Bowen Li | Ziqi Xu | Jing Ren | Renqiang Luo | Xikun Zhang | Xiuzhen Zhang | Yongli Ren | Feng Xia
Bowen Li | Ziqi Xu | Jing Ren | Renqiang Luo | Xikun Zhang | Xiuzhen Zhang | Yongli Ren | Feng Xia
Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
ExpressivityBench: Can LLMs Communicate Implicitly?
Joshua Tint | Som Sagar | Aditya Taparia | Kelly Raines | Bimsara Pathiraja | Caleb Liu | Ransalu Senanayake
Joshua Tint | Som Sagar | Aditya Taparia | Kelly Raines | Bimsara Pathiraja | Caleb Liu | Ransalu Senanayake
Human communication is often implicit, conveying tone, identity, and intent beyond literal meanings. While large language models have achieved strong performance on explicit tasks such as summarization and reasoning, their capacity for expressivity, or implicit communication, remains underexplored. We introduce ExpressivityBench, a framework for evaluating the expressivity of LLMs using information-theoretic communication models. Our approach quantifies how well LLM-generated text communicates target properties without explicit mention, across nine tasks spanning emotion, identity, and tone. To enable scalable and reproducible evaluation, we employ LLM-based graders validated against human judgments. Our results reveal that while models are adept at expressing affective content, they struggle with sociolinguistic signals, lagging behind human baselines. This study provides a necessary step to evaluate human-like implicit communication, with implications for applications such as education, mental health support, and socially-aware dialogue systems. We provide code and data for our benchmark alongside our paper.
Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL
Md Mahadi Hasan Nahid | Davood Rafiei | Weiwei Zhang | Yong Zhang
Md Mahadi Hasan Nahid | Davood Rafiei | Weiwei Zhang | Yong Zhang
Schema linking—the process of aligning natural language questions with database schema elements—is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
PEAR: Planner-Executor Agent Robustness Benchmark
Shen Dong | Mingxuan Zhang | Pengfei He | Li Ma | Bhavani Thuraisingham | Hui Liu | Yue Xing
Shen Dong | Mingxuan Zhang | Pengfei He | Li Ma | Bhavani Thuraisingham | Hui Liu | Yue Xing
Large Language Model (LLM)–based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner–executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner–executor structure—a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, incorporating a memory module into the executor yields only marginal improvements in clean-task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.
Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition
Zheng Hui | Xiaokai Wei | Yexi Jiang | Kevin Gao | Chen Wang | Se-eun Yoon | Rachit Pareek | Michelle Gong
Zheng Hui | Xiaokai Wei | Yexi Jiang | Kevin Gao | Chen Wang | Se-eun Yoon | Rachit Pareek | Michelle Gong
Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment. Code will be released upon acceptance.
Linguistic Cues for LLM-based Implicit Discourse Relation Classification
Yi Fan | Michael Strube | Wei Liu
Yi Fan | Michael Strube | Wei Liu
Large language models (LLMs) have achieved impressive success across many NLP tasks, yet implicit discourse relation classification (IDRC) is still dominated by encoder-only pre-trained language models such as RoBERTa. This may be due to earlier reports that ChatGPT performs poorly on IDRC in zero-shot settings. In this paper, we show that fine-tuned LLMs can perform on par with, or even better than, existing encoder-based approaches. Nevertheless, we find that LLMs alone struggle to capture subtle lexical relations between arguments for the task. To address this, we propose a two-step strategy that enriches arguments with explicit lexical-level semantic cues before fine-tuning. Experiments demonstrate substantial gains, particularly in cross-domain scenarios, with F1 scores improved by more than 10 points compared to strong baselines.
SpARK: An Embarrassingly Simple Sparse Watermarking in LLMs with Enhanced Text Quality
Duy Cao Hoang | Thanh Quoc Hung Le | Rui Chu | Ping Li | Weijie Zhao | Yingjie Lao | Khoa D Doan
Duy Cao Hoang | Thanh Quoc Hung Le | Rui Chu | Ping Li | Weijie Zhao | Yingjie Lao | Khoa D Doan
With the widespread adoption of Large Language Models (LLMs), concerns about potential misuse have emerged. To this end, watermarking has been adapted to LLM, enabling a simple and effective way to detect and monitor generated text. However, while the existing methods can differentiate between watermarked and unwatermarked text with high accuracy, they often face a trade-off between the quality of the generated text and the effectiveness of the watermarking process. In this work, we present a novel type of LLM watermark, Sparse WatermARK (or SpARK), which aims to mitigate this trade-off by applying watermarks to a small subset of generated tokens distributed across the text. To demonstrate this type of watermark, we introduce two novel variants, SpARK-P and SpARK-R, which achieve sparsity by anchoring watermarked tokens to words that have specific Part-of-Speech (POS) tags and specific hash values w.r.t a pseudorandom hash function, respectively. Our experimental results demonstrate that the proposed watermarking schemes, albeit embarrassingly simple, are incredibly effective, achieving high detectability while generating text that outperforms previous LLM watermarking methods in quality across various tasks. SpARK further advances the watermarking capability for LLMs while maintaining their generated text quality.
Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Large language models (LLMs) are increasingly used as knowledge discovery tools. Humanistic disciplines like historical linguistics and literary studies have shown interest in this capability. These fields often construct arguments on the basis of distinctions between phenomena like time-period or genre. Such methodological investments complicate reliance on LLMs pretrained over large sets of broadly-collected data. We show that efficient pretraining techniques produce useful models of semantic change over modest historical corpora without allowing potential contamination from anachronistic data. We verify that these trained-from-scratch models better respect historical divisions and are more computationally efficient compared to the standard approach of fine-tuning an existing LLM. We compare the trade-offs in general linguistic fluency versus detecting and characterizing various forms of linguistic change, and provide a pipeline implementation of our approach that can be readily adapted and applied to a wide range of diachronic phenomena.
Improving Language Identification for Code-Switched Speech: The Pivotal Role of Accented English
Adyasha Patra | Dhiraj Kumar Sah | Preethi Jyothi
Adyasha Patra | Dhiraj Kumar Sah | Preethi Jyothi
Code-switching, where speakers alternate between languages within a single utterance, poses unique challenges for language identification (LID). Existing LID models often fail to reliably identify English spoken with the accent of the matrix (dominant) language. We show that finetuning LID models with small amounts of such accented English significantly improves code-switched LID, without degrading performance on standard monolingual speech—a limitation observed with direct finetuning on code-switched utterances. This is achieved via low-rank adaptation (LoRA) on limited accented data, which allows models to adapt efficiently. To better evaluate performance, we introduce LangRank, a metric that captures the relative ranking of identified languages often overlooked by traditional metrics. Our method generalizes across multiple language pairs, including Hindi-English, Bengali-English, Mandarin-English, and Arabic-English, providing robust LID in code-switched multilingual contexts.
Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval
Aditya Sharma | Christopher Pal | Amal Zouaq
Aditya Sharma | Christopher Pal | Amal Zouaq
The ability to generate SPARQL queries from natural language questions is crucial for ensuring efficient and accurate retrieval of structured data from knowledge graphs (KG). While large language models (LLMs) have been widely adopted for SPARQL query generation, they are often susceptible to hallucinations and out-of-distribution errors when generating KG elements, such as Uniform Resource Identifiers (URIs), based on opaque internal parametric knowledge. We propose PGMR (Post-Generation Memory Retrieval), a modular framework where the LLM produces an intermediate query using natural language placeholders for URIs, and a non-parametric memory module is subsequently employed to retrieve and resolve the correct KG URIs. PGMR significantly enhances query correctness (SQM) across various LLMs, datasets, and distribution shifts, while achieving the near-complete suppression of URI hallucinations. Critically, we demonstrate PGMR’s superior safety and robustness: a retrieval confidence threshold enables PGMR to effectively refuse to answer queries that lack support, and the retriever proves highly resilient to memory noise, maintaining strong performance even when the non-parametric memory size is scaled up to 9 times with irrelevant, distracting entities.
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models
Zhengyuan Jiang | Yuepeng Hu | Yuchen Yang | Yinzhi Cao | Neil Zhenqiang Gong
Zhengyuan Jiang | Yuepeng Hu | Yuchen Yang | Yinzhi Cao | Neil Zhenqiang Gong
Text-to-Image models may generate harmful content, such as pornographic images, particularly when unsafe prompts are submitted. To address this issue, safety filters are often added on top of text-to-image models, or the models themselves are aligned to reduce harmful outputs. However, these defenses remain vulnerable when an attacker strategically designs adversarial prompts to bypass these safety guardrails. In this work, we propose PromptTune, a method to jailbreak text-to-image models with safety guardrails using a fine-tuned large language model. Unlike other query-based jailbreak attacks that require repeated queries to the target model, our attack generates adversarial prompts efficiently after fine-tuning our AttackLLM. We evaluate our method on three datasets of unsafe prompts and against five safety guardrails. Our results demonstrate that our approach effectively bypasses safety guardrails, outperforms existing no-box attacks, and also facilitates other query-based attacks. Our code is available at https://github.com/zhengyuan-jiang/PromptTune.
BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction
Haoran Wang | Jiatong Shi | Jinchuan Tian | Bohan Li | Kai Yu | Shinji Watanabe
Haoran Wang | Jiatong Shi | Jinchuan Tian | Bohan Li | Kai Yu | Shinji Watanabe
Neural audio codecs have recently enabled high-fidelity reconstruction at high compression rates, especially for speech. However, speech and non-speech audio exhibit fundamentally different spectral characteristics: speech energy concentrates in narrow bands around pitch harmonics (80-400 Hz), while non-speech audio requires faithful reproduction across the full spectrum, particularly preserving higher frequencies that define timbre and texture. This poses a challenge—speech-optimized neural codecs suffer degradation on music or sound. Treating the full spectrum holistically is suboptimal: frequency bands have vastly different information density and perceptual importance by content type, yet full-band approaches apply uniform capacity across frequencies without accounting for these acoustic structures. To address this gap, we propose **BSCodec** (Band-Split Codec), a novel neural audio codec architecture that splits the spectral dimension into separate bands and compresses each band independently. Experimental results demonstrate that BSCodec achieves superior reconstruction over baselines across sound and music, while maintaining competitive quality in the speech domain, when trained on the same combined dataset of speech, music and sound. Downstream benchmark tasks further confirm that BSCodec shows strong potential for use in downstream applications.
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Jiwei Guan | Haibo Jin | Haohan Wang
Jiwei Guan | Haibo Jin | Haohan Wang
Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs.
SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph
Jiazheng Li | Yawei Wang | Qiaojing Yan | Yijun Tian | Zhichao Xu | Huan Song | Panpan Xu | Lin Lee Cheong
Jiazheng Li | Yawei Wang | Qiaojing Yan | Yijun Tian | Zhichao Xu | Huan Song | Panpan Xu | Lin Lee Cheong
Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards — a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms — requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.
UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation
Xiaojie Guo | Yang Zhang | Bing Zhang | Ryo Kawahara | Mikio Takeuchi | Yada Zhu
Xiaojie Guo | Yang Zhang | Bing Zhang | Ryo Kawahara | Mikio Takeuchi | Yada Zhu
Recent advancements in Large Language Models (LLMs) have enabled autonomous agents to decompose complex tasks, select appropriate tools, and execute structured workflows. However, a key challenge in this field is the lack of a universal, large-scale, and cross-domain benchmark to systematically evaluate LLMs’ ability to reason over and utilize interconnected tools for automation. Existing benchmarks, such as TaskBench, focus on manually curated tool graphs for benchmark generation, which lack scalability and diversity across domains. To address this, we propose UniToolBench, a benchmark that incorporates automated tool graph construction by formulating link prediction as a probabilistic task, instead of relying on categorical LLM outputs. Furthermore, we introduce a confidence-based beam search sampling strategy to select high-confidence tool dependencies, ensuring more structured and semantically coherent subgraphs for evaluation. Through extensive experiments on multiple datasets, we demonstrate that while LLMs show promise in tool selection, significant challenges remain in parameter prediction and handling complex tool dependencies.
Benchmarking the Energy Savings with Speculative Decoding Strategies
Rohit Dutta | Paramita Koley | Soham Poddar | Janardan Misra | Sanjay Podder | Naveen Balani | Saptarshi Ghosh | Niloy Ganguly
Rohit Dutta | Paramita Koley | Soham Poddar | Janardan Misra | Sanjay Podder | Naveen Balani | Saptarshi Ghosh | Niloy Ganguly
Speculative decoding has emerged as an effective method to reduce latency and inference cost of LLM inferences. However, there has been inadequate attention towards the energy requirements of these models. To address this gap, this paper presents a comprehensive survey of energy requirements of speculative decoding strategies, with detailed analysis on how various factors – model size and family, speculative decoding strategies, and dataset characteristics – influence the energy optimizations.
Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding
Yeonkyoung So | Gyuseong Lee | Sungmok Jung | Joonhak Lee | JiA Kang | Sangho Kim | Jaejin Lee
Yeonkyoung So | Gyuseong Lee | Sungmok Jung | Joonhak Lee | JiA Kang | Sangho Kim | Jaejin Lee
Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce *Thunder-NUBench* — a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually created sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models’ understanding of negation.
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
William Watson | Nicole Cho | Sumitra Ganesh | Manuela Veloso
William Watson | Nicole Cho | Sumitra Ganesh | Manuela Veloso
Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query’s form can also shape a listener’s (and model’s) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.
Completely Modular Fine-tuning for Dynamic Language Adaptation
Zhe Cao | Yusuke Oda | Qianying Liu | Akiko Aizawa | Taro Watanabe
Zhe Cao | Yusuke Oda | Qianying Liu | Akiko Aizawa | Taro Watanabe
Multilingual Fine-tuning of Large Language Models (LLMs) has achieved great advancements in machine translation. However, existing research focuses only on the traditional fine-tuning setting with a fixed set of languages, lacking dynamic adaptability to new ones. Introducing new languages requires retraining and often causes catastrophic forgetting. In this study, we propose a completely modular fine-tuning pipeline that enables dynamic language adaptation for LLMs. Instead of directly fine-tuning on all languages, our approach first trains English-centric input and output LoRA adapters for each language separately, and then merges the corresponding adapters for arbitrary-direction translation without any additional training. Experiments on 12 translation directions of four low-resource and less-supported languages show that modular fine-tuning achieves up to 86% performance of traditional multi-parallel full-parameter fine-tuning, while training only 0.1% parameters and relying solely on English-centric data without any catastrophic forgetting. Furthermore, we perform a comprehensive analysis about the merging ratio, when to merge, and the rationale for using English as a bridge language via Bayesian Optimization and logit lens.
A Multi-Task Learning Framework for Modeling Engagement and Topic-Sensitive Responses in Arabic Women’s Discourse
Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Ibrahim | Wajdi Zaghouani
Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Ibrahim | Wajdi Zaghouani
Predicting how audiences react to Arabic social media posts requires reasoning beyond textual sentiment: reactions emerge from collective interpretation moderated by engagement dynamics and topical context. We present a multi-task learning (MTL) framework that jointly learns (i) audience reaction classification (Love, Haha, Angry, Sad, Care, Wow), (ii) engagement magnitude regression (six reactions, comments, shares), and (iii) non-engagement detection. On a corpus of 158k Arabic Facebook posts spanning women’s rights, gender debates, and economic empowerment, our model achieves a test macro-F1 of 72.4 and weighted-F1 of 89.1.
We Are What We Repeatedly Do: Improving Long Context Instruction Following
Preston K Robinette | Andrew Hard | Swaroop Ramaswamy | Ehsan Amid | Rajiv Mathews | Taylor T Johnson
Preston K Robinette | Andrew Hard | Swaroop Ramaswamy | Ehsan Amid | Rajiv Mathews | Taylor T Johnson
Large language model context lengths have grown rapidly in recent years, from 512 tokens in GPT to 2M tokens in Gemini 1.5 Pro. Larger context windows enable models to condition on significantly more input tokens, leading to higher quality responses for some user prompts. However, longer contexts also pose challenges to system instruction adherence. In this work, we formalize verifiable instructions to evaluate model *compliance* based on clear, measurable criteria. From this criteria, we present **VerIFY**, a **Ver**ifiable **I**nstruction **F**ollowing **Y**ardstick dataset designed to benchmark the compliance and accuracy of LLMs in adhering to various types of instructions across multi-turn, long-context conversations. From experiments with open-source models, we reveal insights into instruction-following failures in long contexts, helping to improve the reliability, safety, and precision of these models. Furthermore, we implement and evaluate six mitigation strategies to enhance instruction compliance in extended contexts, achieving an improvement up to 79%. This is the first work to consider instruction following for multi-turn, long context conversations.
ConRAS: Contrastive In-context Learning Framework for Retrieval-Augmented Summarization
Juseon Do | Sungwoo Han | Jingun Kwon | Hidetaka Kamigaito | Manabu Okumura
Juseon Do | Sungwoo Han | Jingun Kwon | Hidetaka Kamigaito | Manabu Okumura
Contrastive learning (CL) has achieved remarkable progress in natural language processing (NLP), primarily as a paradigm for pre-training and fine-tuning. However, its potential during the generation phase, particularly in in-context learning (ICL)-based retrieval-augmented summarization, remains largely unexplored. While previous studies have attempted to incorporate negative samples into ICL prompts, these methods do not enforce a true contrastive objective that encourages separation of positive and negative samples in the representation space. In this paper, we first demonstrate through preliminary experiments that small language models (SLMs) can interpret contrastive prompts and effectively distinguish between positive and negative samples during inference, without any parameter updates. Building on these findings, we propose ConRAS, a novel framework that injects contrastive objectives into ICL-based retrieval-augmented summarization. Extensive experiments and in-depth analysis on three summarization benchmarks using four SLMs show that ConRAS consistently outperforms state-of-the-art retrieval-augmented methods, achieving significant improvements in summary quality.
Beyond Sampling: Self-Sorting for Long-Context Ranking
Juseon Do | Sungwoo Han | Jingun Kwon | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Juseon Do | Sungwoo Han | Jingun Kwon | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe
Ranking is a fundamental component in a wide range of AI applications. However, large language models (LLMs) remain unstable on long-context ranking. Sliding-window processing is costly and listwise prompting over full candidates still yields inconsistent orders. We show that sampling alone, even with selection-based methods, cannot stabilize ranking because LLM consistency decomposes into within-list order and cross-list preference, in which a single stochastic process cannot align. To address this, we introduce Self-Sorting (SS), which generates m candidate lists and performs n selection-time re-rankings over those lists. SS fuses explicit within-list positions with implicit cross-list preferences to score entities and return a top-k set. Experimental results on five widely used ranking benchmarks show significant improvements in nDCG@1,5,10, highlighting the critical role of implicit consistency.
Program-of-Thought Reveals LLM Abstraction Ceilings
Mike Zhou | Fenil Bardoliya | Vivek Gupta | Dan Roth
Mike Zhou | Fenil Bardoliya | Vivek Gupta | Dan Roth
Large language models (LLMs) are often claimed to exhibit reasoning ability when supervised with chain-of-thought (CoT) traces. True reasoning, however, requires invariance: isomorphic problems should yield identical solutions regardless of superficial variation. We test this property by evaluating base and reasoning-optimized models—including LLaMA, Mistral, Qwen, GPT-OSS, and Deepseek—on isomorphic variants from GSM8K and MATH. All models exhibit substantial accuracy drops under perturbation. To assess whether training can induce invariance, we fine-tune models with Program-of-Thought (PoT) supervision under concrete and masked formulations. PoT fine-tuning increases behavioral cross-variant consistency but does not significantly reduce the accuracy gap, and these gains fail to transfer across prompting formats and domains. Our central finding is that models converge toward stable but systematically incorrect behaviors: consistency without correctness. This dissociation suggests that current reasoning supervision teaches models to reproduce solution templates rather than to abstract mathematical structure.
From Numbers to Narratives: Efficient Language Model-Based Detection for Safety-Critical Minority Classes
Ahatsham Hayat | Hunter Tridle | Mohammad Rashedul Hasan
Ahatsham Hayat | Hunter Tridle | Mohammad Rashedul Hasan
Safety-critical classification tasks face a persistent challenge: traditional models achieve high overall accuracy but inadequate performance on critical minority classes. We introduce a numbers to narratives framework that transforms tabular data into contextually rich descriptions, enabling language models to leverage pre-trained knowledge for minority class detection. Our approach integrates structured verbalization, linguistically-informed augmentation, and parameter-efficient fine-tuning to address the "minority class blind spot” in high-consequence domains. Using a significantly more efficient model architecture than existing approaches, our framework achieves superior minority class F1-scores: 78.76% for machine failures (+7.42 points over XGBoost), 65.87% for at-risk students (+12.12 points over MLP), and 32.00% for semiconductor failures (+1.01 points over XGBoost, despite 14:1 class imbalance). Our approach also improves overall accuracy by up to 22.43% in five of six datasets while maintaining computational feasibility. Ablation studies confirm that narrative-based verbalization enables effective reasoning about tabular data by contextualizing abstract numerical features. This work provides a practical, resource-efficient approach for enhancing minority class performance in safety-critical domains.
R-GDA: Reflective Guidance Data Augmentation with Multi-Agent Feedback for Domain-Specific Named Entity Recognition
Hyeonseok Kang | Hyuk Namgoong | Goun Pyeon | Sangkeun Jung
Hyeonseok Kang | Hyuk Namgoong | Goun Pyeon | Sangkeun Jung
Domain-specific Named Entity Recognition (NER) often requires data augmentation due to the scarcity of annotated corpora. Guidance Data Augmentation (GDA), a method utilizing Large Language Models (LLMs) to decompose sentences into abstract components, can lead to over-abstraction, resulting in undefined entity tags and sentences lacking domain-specific vocabulary. In this work, we propose Reflective GDA (R-GDA), a framework that introduces a multi-agent feedback loop to enhance augmentation quality. R-GDA incorporates two distinct agents: a **Guidance Refiner (GR)**, which assesses the initial abstraction to prevent over-generalization, and an **Augmentation Calibrator (AC)**, which validates the final generated sample for domain-fidelity and tag integrity. On the SciERC and NCBI-disease datasets, R-GDA improves F1-Score, validating its effectiveness. Concurrently, it achieves low BERTScore in most cases, indicating greater sentence diversity. For the FIN dataset, it achieves performance comparable to the GDA baseline. R-GDA consistently prevents errors regarding domain-specific tags, demonstrating that the reflective feedback mechanism enhances data fidelity by mitigating critical generation errors.
Enabling Autoregressive Models to Fill In Masked Tokens
Daniel Mingyi Israel | Aditya Grover | Guy Van den Broeck
Daniel Mingyi Israel | Aditya Grover | Guy Van den Broeck
Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers
Atsushi Shimizu | Shohei Taniguchi | Yutaka Matsuo
Atsushi Shimizu | Shohei Taniguchi | Yutaka Matsuo
Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling (RFS), that generalizes well to lengths unseen during pretraining or fine-tuning. In particular, instead of selecting position indices from a predefined discrete set, RFS uses randomly sampled continuous values, thereby avoiding out-of-distribution (OOD) issues on unseen lengths by exposing the model to diverse indices during training. Since assigning indices to tokens is a common and fundamental procedure in widely used PEs, the advantage of RFS can easily be incorporated into, for instance, the absolute sinusoidal encoding, RoPE, and ALiBi. Experiments corroborate its effectiveness by showing that RFS results in superior performance in length generalization tasks as well as zero-shot commonsense reasoning benchmarks.
Open-Domain Safety Policy Construction
Di Wu | Siyue Liu | Zixiang Ji | Ya-Liang Chang | Zhe-Yu Liu | Andrew Pleffer | Kai-Wei Chang
Di Wu | Siyue Liu | Zixiang Ji | Ya-Liang Chang | Zhe-Yu Liu | Andrew Pleffer | Kai-Wei Chang
Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.
Think Just Enough: Leveraging Self-Assessed Confidence for Adaptive Reasoning in Language Models
Junyeob Kim | Sang-goo Lee | Taeuk Kim
Junyeob Kim | Sang-goo Lee | Taeuk Kim
Recent reinforcement learning (RL)-trained language models have demonstrated strong performance on complex reasoning tasks by producing long and detailed reasoning traces. However, despite these advancements, they often struggle with finding the right balance in reasoning length: some terminate prematurely before reaching a correct answer (underthinking), while others continue reasoning beyond necessity, leading to inefficiency or even degraded accuracy (overthinking).To address these challenges, we propose a method for optimizing reasoning length via self-assessed confidence. By prompting the model to evaluate its own confidence at intermediate reasoning steps, we enable dynamic stopping once sufficient reasoning is achieved.Experiments across multiple reasoning benchmarks show that our approach improves computational efficiency without compromising answer quality. Furthermore, we find that confidence estimates from RL-trained reasoning models are more reliable than those from standard LLMs, making it a valuable internal signal for controlling reasoning depth.
CLICKER: Cross-Lingual Knowledge Editing via In-Context Learning with Adaptive Stepwise Reasoning
Zehui Jiang | Xin Zhao | Yuta Kumadaki | Naoki Yoshinaga
Zehui Jiang | Xin Zhao | Yuta Kumadaki | Naoki Yoshinaga
As large language models (LLMs) are increasingly deployed as multilingual services, keeping their factual knowledge accurate across languages has become both essential and challenging. However, most of the existing knowledge editing (KE) methods are static, in that they update parameters offline for given accumulated edits of knowledge, and are struggling to effectively propagate edits in one language to others, while avoiding side effects. To mitigate this issue, we propose **CLICKER**, a KE method with stepwise reasoning that dynamically retrieves only knowledge relevant to a given query and then edit, while maintaining cross-lingual consistency through: (1) relevance-aware knowledge retrieval, (2) on-demand in-context KE, and (3) language alignment of the outputs. To rigorously evaluate the locality of edits in cross-lingual KE, we develop **Multi-CounterFact** dataset that contain many semantically-similar but irrelevant prompts for the edit. Experiments on Multi-CounterFact and MzsRE with both open- and closed-source LLMs confirmed that CLICKER effectively localizes edits and resolves cross-lingual inconsistencies, outperforming dynamic KE baselines.
Show or Tell? Modeling the evolution of request-making in Human-LLM conversations
Shengqi Zhu | Jeffrey Rzeszotarski | David Mimno
Shengqi Zhu | Jeffrey Rzeszotarski | David Mimno
Designing user-centered LLM systems requires understanding how people use them, but patterns of user behavior are often masked by the variability of queries. In this work, we introduce a new framework to describe request-making that segments user input into request content, roles assigned, query-specific context, and the remaining task-independent expressions. We apply the workflow to create and analyze a dataset of 211k real-world queries based on WildChat. Compared with similar human-human setups, we find significant differences in the language for request-making in the human-LLM scenario. Further, we introduce a novel and essential perspective of diachronic analyses with user expressions, which reveals fundamental and habitual user-LLM interaction patterns beyond individual task completion. We find that query patterns evolve from early ones emphasizing sole requests to combining more context later on, and individual users explore expression patterns but tend to converge with more experience. From there, we propose to understand communal trends of expressions underlying distinct tasks and discuss the preliminary findings. Finally, we discuss the key implications for user studies, computational pragmatics, and LLM alignment.
Multilingual Self-Taught Faithfulness Evaluators
Carlo Alfano | Aymen Al Marjani | Zeno Jonke | Amin Mantrach | Saab Mansour | Marcello Federico
Carlo Alfano | Aymen Al Marjani | Zeno Jonke | Amin Mantrach | Saab Mansour | Marcello Federico
The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents STEMF (Self-Taught Evaluators for Multilingual Faithfulness), a framework that learns exclusively from synthetic multilingual data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
Benchmarking Direct Preference Optimization for Medical Large Vision–Language Models
Dain Kim | Jiwoo Lee | Jaehoon Yun | Yong Hoe Koo | Qingyu Chen | Hyunjae Kim | Jaewoo Kang
Dain Kim | Jiwoo Lee | Jaehoon Yun | Yong Hoe Koo | Qingyu Chen | Hyunjae Kim | Jaewoo Kang
Large vision-language models (LVLMs) are gaining traction in clinical tasks such as diagnostic support, report generation, and medical question answering. Among post-training techniques, Direct Preference Optimization (DPO) has shown promise in aligning model outputs with human preferences, yet its effectiveness in high-stakes medical contexts remains underexplored. In this work, we present the first systematic evaluation of nine DPO variants applied to two leading medical LVLMs, LLaVA-Med and HuatuoGPT-Vision. We benchmark these models on five curated datasets covering diverse clinical tasks. Evaluations include both automated metrics and expert assessments. Our results show that while DPO improves alignment and reduces severe hallucinations, it yields inconsistent gains over supervised fine-tuning. We further introduce DPO variant that better handles visual misinterpretations and enhances clinical understanding. These findings reveal both the potential and limitations of DPO in medical AI. To support future research, we will release all DPO training data, model checkpoints, and expert annotations upon acceptance.
Stay Focused: Problem Drift in Multi-Agent Debate
Jonas Becker | Lars Benedikt Kaesberg | Andreas Stephan | Jan Philip Wahle | Terry Ruas | Bela Gipp
Jonas Becker | Lars Benedikt Kaesberg | Andreas Stephan | Jan Philip Wahle | Terry Ruas | Bela Gipp
Multi-agent debate – multiple instances of large language models discussing problems in turn-based interaction – has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi-agent debate drifts away from the initial problem over multiple turns, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). We find that generative tasks drift often due to the subjectivity of the answer space (76-89%), compared to high-complexity tasks (7-21%). To identify the reasons, eight human experts analyze 170 multi-agent debates suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). We propose DRIFTJudge, an LLM-as-a-judge method, as a first baseline to detect problem drift. We also propose DRIFTPolicy, which mitigates 31% of problem drift cases. Our study is a step toward understanding a key limitation of multi-agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Yulia Otmakhova | Thinh Hung Truong | Rahmad Mahendra | Zenan Zhai | Rongxin Zhu | Daniel Beck | Jey Han Lau
Yulia Otmakhova | Thinh Hung Truong | Rahmad Mahendra | Zenan Zhai | Rongxin Zhu | Daniel Beck | Jey Han Lau
We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels — from orthography to dialect and style — and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE’s utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
Yiheng Yang | Yujie Wang | Chi Ma | Lei Yu | Emmanuele Chersoni | Chu-Ren Huang
Yiheng Yang | Yujie Wang | Chi Ma | Lei Yu | Emmanuele Chersoni | Chu-Ren Huang
Dense large language models (LLMs) face critical efficiency bottlenecks, as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods (static pruning or dynamic activation) partially address this issue, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by the human brain’s dual-process mechanisms — predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex contexts — we propose CLADA, a Cognitive-Load-Aware Dynamic Activation framework that synergizes statistical sparsity with semantic adaptability.Our key insight is that LLM activations exhibit two complementary patterns: 1. Global Statistical Sparsity driven by sequence-level prefix information, and 2. Local Semantic Adaptability modulated by cognitive load metrics (e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline derived from offline error-controlled optimization ensures over 40% sparsity, which is then dynamically adjusted using real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves 20% average speedup with less than 2% accuracy degradation, outperforming Griffin (over 5% degradation) and TT (negligible speedup).Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis (R2 = 0.17), revealing a sparsity–adaptation synergy. Requiring no retraining or architectural changes, CLADA provides a deployable solution for resource-aware LLM inference while advancing biologically inspired AI design.
PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes | Vasisht Duddu | N. Asokan | Nikolaos Aletras | Ning Ma
Anthony Hughes | Vasisht Duddu | N. Asokan | Nikolaos Aletras | Ning Ma
Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH: Privacy-Aware Targeted Circuit Patching, a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.
Argument Component Segmentation with Fine-Tuned Large Language Models
Ettore Caputo | Sergio Greco | Lucio La Cava
Ettore Caputo | Sergio Greco | Lucio La Cava
Argument Mining (AM) aims to identify and interpret argumentative structures in unstructured text, with Argument Component Classification (ACC) as a core task. Despite significant advances, most ACC approaches rely on manually pre-segmented inputs, an assumption that rarely holds in practice due to the high cost and effort of expert human annotation, creating a major bottleneck for scalable AM systems. In this work, we focus on the foundation Argument Component Segmentation (ACS) task by proposing a fine-grained, paired-tag annotation schema that explicitly distinguishes between relevant and surrounding content, thus overcoming the limitations of previous single-separator approaches. Leveraging small and open Large Language Models (LLMs) fine-tuned on our paired-tag annotation schema, we can perform ACS with quality comparable to human expert annotators across multiple benchmark datasets. We further validate our approach on the downstream ACC task, showing that automated segmentation with fine-tuned LLMs yields ACC performances comparable to pipelines relying on human annotations. These findings suggest that reliable automated ACS via LLMs is both feasible and effective, paving the way for more scalable AM pipelines without human intervention.
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Yuliang Yan | Haochun Tang | Shuo Yan | Enyan Dai
Yuliang Yan | Haochun Tang | Shuo Yan | Enyan Dai
Large language models (LLMs) are considered valuable Intellectual Properties (IP) due to the enormous computational cost of training, making their protection against malicious stealing or unauthorized deployment crucial.Despite efforts in watermarking and fingerprinting, existing methods either affect text generation or rely on white-box access, limiting practicality.To address this, we propose DuFFin, a novel Dual-Level Fingerprinting framework for black-box ownership verification.DuFFin jointly extracts trigger patterns and knowledge-level fingerprints to identify the source of a suspect model.We conduct experiments on diverse open-source models, including four popular base LLMs and their fine-tuned, quantized, and safety-aligned variants released by large companies, start-ups, and individuals.Results show that DuFFin accurately verifies the copyright of protected LLMs on their variants, achieving an IP-ROC greater than 0.99.Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Asif Azad | Mohammad Sadat Hossain | MD Sadik Hossain Shanto | M Saifur Rahman | Md Rizwan Parvez
Asif Azad | Mohammad Sadat Hossain | MD Sadik Hossain Shanto | M Saifur Rahman | Md Rizwan Parvez
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
Diagnosis of Dysarthria Severity and Explanation Generation Using XAI-Enhanced CLINIC-GENIE on Diadochokinetic Tasks
Jihyeon Kim | Insung Lee | Myoung-Wan Koo
Jihyeon Kim | Insung Lee | Myoung-Wan Koo
Deep neural network classifiers for dysarthria impairment severity face limitations regarding interpretability and treatment guidance. To overcome these, we introduce CLINIC-GENIE, an explainable two-stage framework consisting of: (1) CLINIC, a dysarthria severity classification model combining acoustic and speech embeddings with Clinically Explainable Acoustic Features (CEAFs); and (2) GENIE, a module translating CEAFs and their Shapley values into intuitive natural language explanations via a large language model. CLINIC achieved a balanced accuracy of 0.952 (17.3% improvement over using CEAFs alone), and certified speech-language pathologists rated explanations from CLINIC-GENIE with an average fidelity score of 4.94, confirming enhanced clinical utility.
A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages
Raoyuan Zhao | Yihong Liu | Hinrich Schuetze | Michael A. Hedderich
Raoyuan Zhao | Yihong Liu | Hinrich Schuetze | Michael A. Hedderich
Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present a comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques – i.e., truncation and error injection – to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.
ORSO QGen: Odds-Ratio Steerable Optimization for Controlling Question Generation
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle S McNamara
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle S McNamara
Question generation plays an important role in educational applications, enabling automated assessment and reading comprehension support. Attribute-controlled question generation aims to produce questions that fit predefined characteristics such as difficulty, focus, or coverage. Existing methods predominantly rely on supervised fine-tuning, which often fails to impose a strong adherence to attribute values, resulting in weak coupling between prompt specifications and model outputs. We introduce Odds-Ratio Steerable Optimization (ORSO), a framework designed to enhance attribute sensitivity in question generation models. Building upon preference-based learning techniques without requiring human-curated preference sets, ORSO employs input-level perturbations to create contrastive training signals. Empirical evaluations on both exhaustive and expert-validated attribute configurations indicate that ORSO performs better in enforcing attribute conformity while maintaining output quality. These results argue for the benefits of explicit attribute-aware optimization in controllable question generation tasks.
Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
Noam Dahan | Omer Kidron | Gabriel Stanovsky
Noam Dahan | Omer Kidron | Gabriel Stanovsky
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
Let’s Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification
Jingshen Zhang | Xin Ying Qiu | Lifang Lu | Zhuhua Huang | Yutao Hu | Yuechang Wu | JunYu Lu
Jingshen Zhang | Xin Ying Qiu | Lifang Lu | Zhuhua Huang | Yutao Hu | Yuechang Wu | JunYu Lu
Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.
LogToP: Logic Tree-of-Program with Table Instruction-tuned LLMs for Controlled Logical Table-to-Text Generation
Yupian Lin | Guangya Yu | Cheng Yuan | Huan Du | Hui Luo | Yuang Bian | Jingping Liu | Zhidong He | Wen Du | Tong Ruan
Yupian Lin | Guangya Yu | Cheng Yuan | Huan Du | Hui Luo | Yuang Bian | Jingping Liu | Zhidong He | Wen Du | Tong Ruan
Logical table-to-text generation aims to generate natural language descriptions that fluently and precisely describe the given table with both surface-level and logic-level fidelity. Although large language models (LLMs) have demonstrated strong capabilities in plain text, their proficiency in interpreting and reasoning tabular data is still limited. In this paper, we are the first to comprehensively explore the performance of various LLMs in the logical table-to-text generation task. However, we find that existing LLMs are difficult to achieve satisfactory results in this task. Even worse, existing prompt strategies cannot cope with complex non-chain logical reasoning scenarios on tables. To address the challenges mentioned above, we constructed a new table-related instruction dataset called LogicTableInstruct and instruction-tuned the open-source LLM on this dataset, resulting in the specialized LLM (LogicTableLLaMA-3.1-8B) for table-related tasks. We also introduced a novel reasoning method, Logic Tree-of-Program (LogicToP), to improve the logical reasoning ability of the LLMs on tables. Our extensive experiments on various LLMs demonstrated that LogicToP can effectively improve the performance of LLMs on this task. Our LogicTableLLaMA-3.1-8B model in the 5-shot LogicToP setting achieves state-of-the-art results on the Logic2Text dataset. The code and data will be released at https://github.com/FXLP/LogToP to boost future work on table-related tasks.
IRPO: Implicit Policy Regularized Preference Optimization
Youngsoo Jang | Yu Jin Kim | Geon-Hyeong Kim | Honglak Lee | Moontae Lee
Youngsoo Jang | Yu Jin Kim | Geon-Hyeong Kim | Honglak Lee | Moontae Lee
Training complexity often scales with the size of hyperparameter space for Large Language Models (LLMs). While Direct Preference Optimization (DPO) offers learning stability through reparameterizing the reward function, its regularization against the reference policy can lead to suboptimal outcomes when the reference policy is not optimal. Recent DPO variants address this concern, but at a cost: they introduce additional hyperparameters, reducing feasibility for LLM fine-tuning. To overcome this challenge, we introduce Implicit policy Regularized Preference Optimization (IRPO), which tackles suboptimality while maintaining training simplicity. By treating the winning policy that generated the chosen responses in a pairwise dataset as an implicit policy, IRPO maximizes KL-regularized reward without extra hyperparameters. Then we propose a novel PO algorithm that directly optimizes the IRPO objective by estimating the likelihood ratio between implicit policies. As the winning policy generally outperforms the reference policy, IRPO can effectively address suboptimality. Our experiments show that IRPO significantly outperforms baseline algorithms with the same hyperparameter complexity. Moreover, IRPO demonstrates comparable performance to recent algorithms that rely on a larger number of hyperparameters, offering a practical solution for scalable LLM fine-tuning.
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
Seffi Cohen | Nurit Cohen Inger | Niv Goldshlager | Bracha Shapira | Lior Rokach
Seffi Cohen | Nurit Cohen Inger | Niv Goldshlager | Bracha Shapira | Lior Rokach
Large Language Models (LLMs) demonstrate impressive capabilities but exhibit inconsistent performance across diverse domains. We propose DFPE (Diverse Fingerprint Ensemble), a novel training-free method that systematically constructs subject-adaptive ensembles by balancing model diversity and competence. DFPE introduces three key innovations: (1) semantic fingerprinting using averaged response embeddings to capture distinct problem-solving patterns, (2) DBSCAN-based clustering with quantile-based competence filtering to ensure diverse yet capable model selection, and (3) exponentially-weighted aggregation adapted to subject-specific performance. Our method’s effectiveness is highlighted on the challenging MMLU-pro benchmark, where DFPE achieves a striking 17.1 percentage point gain over the best single model, reaching 71.4% accuracy. This strong performance is consistent across other standard benchmarks, with significant accuracy improvements of 4.4 points on AGIEval and 2.7 points on MMLU. Our results underscore that a systematic approach to ensemble construction - one that balances diversity, subject-specific competence, and adaptive weighting, can substantially enhance the generalization and robustness of LLMs on multifaceted language understanding tasks.
The paper extends the Data Movement Distance (DMD) – a metric defined to measure the locality in computer memory – to text by defining a normalized version called nDMD. A key feature of nDMD is a new term designed to better characterize low-frequency tokens. By evaluating nDMD on English subset of the M4 dataset and GenAI detection shared task, the paper shows three key findings. First, nDMD is systematically higher in human-written text than in machine-generated text. Second, nDMD-based features not only outperform frequency baselines but also improve overall performance when combined. Finally, the proposed DMD normalization is more effective in distinguishing human and machine text than alternative normalization approaches.
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
Jeonghun Baek | Kazuki Egashira | Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Hikaru Ikuta | Kiyoharu Aizawa
Jeonghun Baek | Kazuki Egashira | Shota Onohara | Atsuyuki Miyai | Yuki Imajuku | Hikaru Ikuta | Kiyoharu Aizawa
Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
Hierarchical User Intent Inference with Knowledge Graph Grounding
Tzu-Cheng Peng | Chien Chin Chen | Yung-Chun Chang
Tzu-Cheng Peng | Chien Chin Chen | Yung-Chun Chang
Understanding user intent in online reviews requires modeling not only explicit aspect ratings but also implicit motivations shaped by contextual factors. Existing large language models (LLMs) often lack structured grounding, fail to capture nuanced intent expression. We propose HII-KG, a two-stage Hierarchical Intent Inference framework that first predicts fine-grained aspect ratings and then generates natural language intent statements, guided by contextual subgraphs retrieved from a domain-specific knowledge graph (KG). We first employ parameter-efficient fine-tuning of LLaMA3.1-8B to predict aspect ratings in an instruction-based format. Moreover, we leverage Cypher-aware prompting to generate user intent from KG summaries. Experiments on a online hotel review dataset show that HII-KG consistently outperforms strong LLM and encoder-based baselines in both aspect classification (avg. F1 +4.5%) and intent generation (BLEU +3.3, ROUGE-L +2.9). The results demonstrate that structured KG integration can significantly enhance fluency, contextual relevance, and factual alignment in user intent inference.
Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection
Joe Stacey | Lisa Alazraki | Aran Ubhi | Beyza Ermis | Aaron Mueller | Marek Rei
Joe Stacey | Lisa Alazraki | Aran Ubhi | Beyza Ermis | Aaron Mueller | Marek Rei
We investigate the robustness of fine-tuned Large Language Models (LLMs) for the task of Natural Language Inference (NLI), finding that the in-distribution gains from fine-tuning correspond to a large drop in out-of-distribution (OOD) performance. Despite the widespread use of closed-source LLMs, there are no robustness mitigation methods that work under their API fine-tuning constraints. Existing methods to improve robustness typically require changing the fine-tuning process or large-scale data augmentation, methods that are infeasible or cost prohibitive for closed-source models. To address this, we propose strategically selecting the NLI fine-tuning data, prioritising more complex examples or replacing existing training examples with LLM-generated data. Prioritising more complex training examples improves performance on challenging OOD NLI datasets, while training with synthetic data leads to substantial improvements on easier OOD datasets. We find that synthetic examples are often too simple, and by prompting LLMs to create more complex synthetic data we can improve performance on both easy and challenging OOD datasets. Finally, we show that recent autoregressive LLMs are substantially more robust to distributional shifts compared to encoder models, and should be a preferred baseline for future research.
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models
Siwei Wu | King Zhu | Yu Bai | Yiming Liang | Yizhi Li | Haoning Wu | Jiaheng Liu | Ruibo Liu | Xingwei Qu | Xuxin Cheng | Ge Zhang | Wenhao Huang | Chenghua Lin
Siwei Wu | King Zhu | Yu Bai | Yiming Liang | Yizhi Li | Haoning Wu | Jiaheng Liu | Ruibo Liu | Xingwei Qu | Xuxin Cheng | Ge Zhang | Wenhao Huang | Chenghua Lin
Current multi-modal benchmarks primarily focus on facts within individual images. However, they overlook the associative relations among multiple images, which necessitate conducting commonsense reasoning grounded in associated knowledge at different granularities (i.e., image-level and entity-level) as well as the ability to perceive the order of images. Therefore, we propose a multi-image relational association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. To systematically evaluate current LVLMs, we establish a system of associative relations among images that contains 11 subtasks (e.g., UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., image-level and entity-level), based on relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose greater challenges for LVLMs than image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating limited spatial awareness. Furthermore, we find that LVLMs exhibit weak image order perception capabilities, and we design a method to significantly improve this ability, demonstrating that most current LVLMs do not adequately consider image order perception during pre-training.
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Existing Chinese preference datasets suffer from limited scale, restricted domain coverage, and insufficiently rigorous data validation. Human annotation significantly limits the scalability of human preference datasets. As a result, Chinese Alignment and Chinese Reward Models (CRM) have not yet been thoroughly explored. To address these challenges, we design an LLM-based data annotation pipeline with no human intervention. Based on this pipeline, we curate COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset consisting of 1M Chinese preference pairs and 92k carefully curated Chinese queries across diverse domains, including Chat, Coding, Maths, and others. We conduct experiments to verify the quality of COIG-P from two perspectives. (1) COIG-P brings significant performance improvements for the Qwen2/2.5 and Infinity-Instruct model series on AlignBench through DPO, with gains ranging from 2% to 12%. Furthermore, it significantly outperforms other existing Chinese preference datasets. (2) We train an 8B-sized CRM and manually annotate a Chinese Reward Benchmark (CRBench). Our CRM demonstrates robust scoring ability on CRBench. In addition, in practical data construction experiments, the quality of the data constructed by our CRM is comparable to that produced by GPT-4o.
Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models
Ningyuan Deng | Hanyu Duan | Yixuan Tang | Yi Yang
Ningyuan Deng | Hanyu Duan | Yixuan Tang | Yi Yang
Text embedding models are widely used in natural language processing applications. However, their capability is often benchmarked on tasks that do not require understanding nuanced numerical information in text. As a result, it remains unclear whether current embedding models can precisely encode numerical content, such as numbers, into embeddings. This question is critical because embedding models are increasingly applied in domains where numbers matter, such as finance and healthcare. For example, ”Company X’s market share grew by 2%” should be interpreted very differently from ”Company X’s market share grew by 20%” , even though both indicate growth in market share. This study aims to examine whether text embedding models can capture such nuances. Using synthetic data in a financial context, we evaluate 13 widely used text embedding models and find that they generally struggle to capture numerical details accurately. Our further analyses provide deeper insights into embedding numeracy, informing future research to strengthen embedding model-based NLP systems with improved capacity for handling numerical content.
code-transformed: The Influence of Large Language Models on Code
Yuliang Xu | Siming Huang | Mingmeng Geng | Yao Wan | Xuanhua Shi | Dongping Chen
Yuliang Xu | Siming Huang | Mingmeng Geng | Yao Wan | Xuanhua Shi | Dongping Chen
Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 20,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake_case function names in Python code increased from 40.7% in Q1 2023 to 49.8% in Q3 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Our experimental results may provide the first large-scale empirical evidence that LLMs affect real-world programming style.
Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing
Mukund Choudhary | Madhur Jindal | Gaurja Aeron | Monojit Choudhury
Mukund Choudhary | Madhur Jindal | Gaurja Aeron | Monojit Choudhury
Do large language models (LLMs) model linguistic variation? We investigate this question through Hindi-English (Hinglish) verb code-mixing, where speakers can use either a Hindi verb or an English verb with the light verb karna (’do’). Both forms are grammatical, but speakers show unexplained variation in language choice for the verb. We compare human preferences on controlled code-mixed minimal pairs to LLM perplexities spanning families, sizes, and training language compositions. We find that current LLMs do not reliably classify verb language preferences to match native speaker judgments. We also see that with specific supervision, some models do predict human preference to an extent. We release native speaker acceptability judgments on 30 verb pairs, perplexity ratios for 4,279 verb pairs across 7 models, and experimental materials.
ART: Attention-Regularized Transformers for Multi-Modal Robustness
Mohammed Bouri | Mohammed Erradi | Adnane Saoud
Mohammed Bouri | Mohammed Erradi | Adnane Saoud
Transformers have become the standard in Natural Language Processing (NLP) and Computer Vision (CV) due to their strong performance, yet they remain highly sensitive to small input changes, often referred to as adversarial attacks, such as synonym swaps in text or pixel-level perturbations in images. These adversarial attacks can mislead predictions, while existing defenses are often domain-specific or lack formal robustness guarantees. We propose the Attention-Regularized Transformer (ART), a framework that enhances robustness across modalities. ART builds on the Attention Sensitivity Tensor (AST), which quantifies the effect of input perturbations on attention outputs. By incorporating an AST-based regularizer into training, ART encourages stable attention maps under adversarial perturbations in both text and image tasks. We evaluate ART on IMDB, QNLI, CIFAR-10, CIFAR-100, and Imagenette. Results show consistent robustness gains over strong baselines such as FreeLB and DSRM: up to +36.9% robust accuracy on IMDB and QNLI, and +5–25% on image benchmarks across multiple Vision Transformer (ViT) architectures, while maintaining or improving clean accuracy. ART is also highly efficient, training over 10× faster than adversarial methods on text and requiring only 1.25× the cost of standard training on images, compared to 1.5–5.5× for recent robust ViTs. Codes are available at [https://github.com/cliclab-um6p/ART](https://github.com/cliclab-um6p/ART)
GRAFF: GRaph-Augmented Fine-grained Fusion for Large Language Models
Himanshu Chaudhary | Ruida Wang | Gowtham Ramesh | Junjie Hu
Himanshu Chaudhary | Ruida Wang | Gowtham Ramesh | Junjie Hu
Tackling Distractor Documents in Multi-Hop QA with Reinforcement and Curriculum Learning
Jerry Huang | Siddarth Madala | Risham Sidhu | Cheng Niu | Hao Peng | Julia Hockenmaier | Tong Zhang
Jerry Huang | Siddarth Madala | Risham Sidhu | Cheng Niu | Hao Peng | Julia Hockenmaier | Tong Zhang
Retrieval-augmented generation (RAG) systems rely on retrieval models for identifying relevant contexts and answer generation models for utilizing those contexts. However, retrievers exhibit imperfect recall and precision, limiting downstream performance. We introduce RAG-RL, an answer generation model trained for multi-hop question answering (MHQA) to not only generate answers but also to identify and cite relevant information from larger sets of retrieved contexts, shifting some of the burden of identifying relevant documents from the retriever to the answer generator. Our approach uses curriculum learning, where models are trained across retrieval settings with varying levels of noise. Our experiments show that training samples with fewer distractor documents enable models to acquire citation and reasoning skills with greater sample efficiency and generalizability, demonstrating strong model performance even as the number of irrelevant passages increases. We benchmark our methods on three open-domain MHQA datasets and report significant gains in answer and citation accuracy. Furthermore, our experiments provide empirical insights into how simpler training samples can give models stronger signals for learning specific skills (e.g., citation generation) and how different components of post-training (e.g., training set construction, rule-based rewards, training sample ordering, etc.) impact final model performance.
RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams
Andrei Vlad Man | Răzvan-Alexandru Smădu | Cristian-George Craciun | Dumitru-Clementin Cercel | Florin Pop | Mihaela-Claudia Cercel
Andrei Vlad Man | Răzvan-Alexandru Smădu | Cristian-George Craciun | Dumitru-Clementin Cercel | Florin Pop | Mihaela-Claudia Cercel
The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about the Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, along with annotated legal references and explanations written by human experts. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks, including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum passing grades required for driving exams. We highlight the potential and limitations of applying LLMs and VLMs to legal education. We release the code and resources through the GitHub repository (https://github.com/vladman-25/RoD-TAL).
FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs
Albert Sawczyn | Jakub Binkowski | Denis Janiak | Bogdan Gabrys | Tomasz Jan Kajdanowicz
Albert Sawczyn | Jakub Binkowski | Denis Janiak | Bogdan Gabrys | Tomasz Jan Kajdanowicz
Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel zero-resource black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as interpretable knowledge graphs consisting of facts in the form of triples, providing clearer insights into content factuality than traditional approaches. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sentence-level sampling-based methods while providing more detailed and interpretable insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35.5% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only a 10.6% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content. Additionally, we contribute FavaMultiSamples, a novel dataset that addresses a gap in the field by providing the research community with a second dataset for evaluating sampling-based methods.
Punctuations and Predicates in Language Models
Sonakshi Chauhan | Maheep Chaudhary | Choy Kwan Kiu | Samuel Nellessen | Nandi Schoots
Sonakshi Chauhan | Maheep Chaudhary | Choy Kwan Kiu | Samuel Nellessen | Nandi Schoots
In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. We investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability and model analysis.
Test-time Corpus Feedback: From Retrieval to RAG
Mandeep Rathee | Venktesh V | Sean MacAvaney | Avishek Anand
Mandeep Rathee | Venktesh V | Sean MacAvaney | Avishek Anand
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks, combining large language models (LLMs) with document retrieval from external corpora. Despite its widespread use, most RAG pipelines continue to treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction. This static design often limits performance on complex tasks that require iterative evidence gathering or high-precision retrieval. Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback. In this survey, we present a structured overview of advanced retrieval and ranking mechanisms that integrate such feedback. We categorize feedback signals based on their source and role in improving the query, retrieved context, or document pool. By consolidating these developments, we aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic, learnable component of end-to-end RAG systems.
RADAR: A Reasoning-Guided Attribution Framework for Explainable Visual Data Analysis
Anku Rani | Aparna Garimella | Apoorv Saxena | Balaji Vasan Srinivasan | Paul Pu Liang
Anku Rani | Aparna Garimella | Apoorv Saxena | Balaji Vasan Srinivasan | Paul Pu Liang
Data visualizations like charts are fundamental tools for quantitative analysis and decision-making across fields, requiring accurate interpretation and mathematical reasoning. The emergence of Multimodal Large Language Models (MLLMs) offers promising capabilities for automated visual data analysis, such as processing charts, answering questions, and generating summaries. However, they provide no visibility into which parts of the visual data informed their conclusions; this black-box nature poses significant challenges to real-world trust and adoption. In this paper, we take the first major step toward evaluating and enhancing the capabilities of MLLMs to attribute their reasoning process by highlighting the specific regions in charts and graphs that justify model answers. To this end, we contribute RADAR, a semi-automatic approach to obtain a benchmark dataset comprising 1000 charts, 2000 question-answer pairs, 3599 reasoning steps, and 11,220 attribution annotations. We also introduce a method that provides attribution for chart-based mathematical reasoning. Experimental results demonstrate that our reasoning-guided approach improves attribution accuracy by up to 15 percentage points compared to baseline methods, and enhanced attribution capabilities translate to stronger answer generation, achieving high semantic similarity (BERTScore ∼ 0.90) with ground truth responses. This advancement represents a significant step toward more interpretable and trustworthy chart analysis systems, enabling users to verify and understand model decisions through reasoning and attribution.
MaskLoRA: Low-Rank Subspace–Induced Token Masking for Efficient and Faithful Language Models
Rifat Rafiuddin
Rifat Rafiuddin
A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction
Marco Martinelli | Stefano Marchesin | Vanessa Bonato | Giorgio Di Nunzio | Nicola Ferro | Ornella Irrera | Laura Menotti | Federica Vezzani | Gianmaria Silvello
Marco Martinelli | Stefano Marchesin | Vanessa Bonato | Giorgio Di Nunzio | Nicola Ferro | Ornella Irrera | Laura Menotti | Federica Vezzani | Gianmaria Silvello
Information Extraction (IE), encompassing Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GutBrainIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark’s rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.
What Matters to an LLM? Behavioral and Computational Evidences from Summarization
Yongxin Zhou | Changshun Wu | Philippe Mulhem | Didier Schwab | Maxime Peyrard
Yongxin Zhou | Changshun Wu | Philippe Mulhem | Didier Schwab | Maxime Peyrard
Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into *what* LLMs prioritize in summarization and *how* this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.
Neural network embeddings recover value dimensions from psychometric survey items on par with human data
Max Pellert | Clemens M Lechner | Indira Sen | Markus Strohmaier
Max Pellert | Clemens M Lechner | Indira Sen | Markus Strohmaier
We demonstrate that embeddings derived from large language models, when processed with "Survey and Questionnaire Item Embeddings Differentials" (SQuID), can recover the structure of human values obtained from human rater judgments on the Revised Portrait Value Questionnaire (PVQ-RR). We compare multiple embedding models across a number of evaluation metrics including internal consistency, dimension correlations and multidimensional scaling configurations. Unlike previous approaches, SQuID addresses the challenge of obtaining negative correlations between dimensions without requiring domain-specific fine-tuning or training data re-annotation. Quantitative analysis reveals that our embedding-based approach explains 55% of variance in dimension-dimension similarities compared to human data. Multidimensional scaling configurations show alignment with pooled human data from 49 different countries. Generalizability tests across three personality inventories (IPIP, BFI-2, HEXACO) demonstrate that SQuID consistently increases correlation ranges, suggesting applicability beyond value theory. These results show that semantic embeddings can effectively replicate psychometric structures previously established through extensive human surveys. The approach offers substantial advantages in cost, scalability and flexibility while maintaining comparable quality to traditional methods. Our findings have significant implications for psychometrics and social science research, providing a complementary methodology that could expand the scope of human behavior and experience represented in measurement tools.
Compositional Reasoning via Joint Image and Language Decomposition
Dwip Dalal | Madhav Kanda | Zhenhailong Wang | Heng Ji | Unnat Jain
Dwip Dalal | Madhav Kanda | Zhenhailong Wang | Heng Ji | Unnat Jain
Multimodal reasoning tasks such as visual question answering (VQA) require models to process both language and visual inputs. However, existing approaches typically decompose only language queries, treating images as monolithic inputs. We introduce REDI, a framework that jointly decomposes both images and questions into visual sub-domains (segmentation, material, depth, and color) with corresponding sub-questions. REDI uses an MLLM orchestrator to select the sub-domains required for each query, generate domain-specific sub-questions with grounded object references (via shared object labels), and fuse worker outputs via consistency-aware aggregation (verify–refine–override) to produce the final answer. This hierarchical multi-agent design mitigates error propagation and improves compositional reasoning across both open- and closed-source MLLMs. On SEEDBench, MMBench, and CLEVR, REDI achieves absolute accuracy improvements of 8.9%, 8.2%, and 16.0% over chain-of-thought and visual programming baselines. Project webpage: https://madhav-kanda.github.io/redi
Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
Manan Roy Choudhury | Adithya Chandramouli | Mannan Anand | Vivek Gupta
Manan Roy Choudhury | Adithya Chandramouli | Mannan Anand | Vivek Gupta
The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM’s legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs’ ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.
Token-Wise Kernels (TWiKers) for Vicinity-Aware Attention in Transformers
Kuangdai Leng | Jia Bi | Samuel Pinilla | Jaehoon Cha
Kuangdai Leng | Jia Bi | Samuel Pinilla | Jaehoon Cha
Self-attention mechanisms in transformers enable tokens to interact across a sequence but lack an explicit inductive bias to capture local contextual dependencies, an inherent characteristic of natural languages. We propose Token-Wise Kernels (TWiKers), a novel enhancement to transformers that learn token-specific convolutional kernels applied to the keys or values. Each token is assigned a small kernel, initialized to the "Central Dirac" (e.g., [0,1,0] for size=3), meaning the token "bears" the attention from all other tokens alone. During training, these kernels adapt, and greater deviation from the Central Dirac indicates stronger attention redistribution to neighboring tokens. This introduces the first transformer weights with direct semantic interpretability. Our experiments show that content words (e.g., nouns and verbs) retain self-focus, while function words (e.g., prepositions and conjunctions) shift attention toward their neighbors, aligning with their syntactic and semantic roles. We further apply TWiKers to distinguish literary genres, historical periods, and authors, demonstrating their effectiveness in capturing high-level stylistic patterns. Finally, we demonstrate the potential of TWiKers as an effective inductive bias to improve transformer training, validated across a range of downstream tasks.
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training
Jeffrey Li | Joshua P Gardner | Doug Kang | Fangping Shi | Karanjeet Singh | Chun-Liang Li | Herumb Shandilya | David Leo Wright Hall | Oncel Tuzel | Percy Liang | Ludwig Schmidt | Hadi Pouransari | Fartash Faghri
Jeffrey Li | Joshua P Gardner | Doug Kang | Fangping Shi | Karanjeet Singh | Chun-Liang Li | Herumb Shandilya | David Leo Wright Hall | Oncel Tuzel | Percy Liang | Ludwig Schmidt | Hadi Pouransari | Fartash Faghri
One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
Michał Pietruszka | Łukasz Borchmann | Aleksander Jędrosz | Paweł Morawiecki
Michał Pietruszka | Łukasz Borchmann | Aleksander Jędrosz | Paweł Morawiecki
We present FeatEng, a novel benchmark designed to evaluate the ability of large language models (LLMs) to perform feature engineering, a critical and knowledge-intensive task in data science. FeatEng assesses LLMs by their capacity to generate Python code that transforms raw tabular data into features that improve the performance of a downstream machine learning model. Our analysis of LLM outputs reveals that success on FeatEng often requires the application of significant world and domain knowledge, along with complex reasoning, to construct novel data representations. While focused on feature engineering, the benchmark probes a confluence of abilities indicative of an LLM’s broader potential for practical, data-centric problem-solving. We demonstrate that FeatEng offers a targeted and efficient approach to assess a specific but crucial aspect of LLM capabilities relevant to real-world data science applications.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere | Elena Kochkina | Samuel Mensah | Simerjot Kaur | Fernando Acero | Arturo Oncevay | Charese Smiley | Xiaomo Liu | Manuela Veloso
Jabez Magomere | Elena Kochkina | Samuel Mensah | Simerjot Kaur | Fernando Acero | Arturo Oncevay | Charese Smiley | Xiaomo Liu | Manuela Veloso
Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to 71.75% macro-F1, outperforming prompt-based approaches (+1.99, +6.24) and existing RL methods (+5.84). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition.
Argument-Based Consistency in Toxicity Explanations of LLMs
Ramaravind Kommiya Mothilal | Joanna Roy | Syed Ishtiaque Ahmed | Shion Guha
Ramaravind Kommiya Mothilal | Joanna Roy | Syed Ishtiaque Ahmed | Shion Guha
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs’ *reasoning* about toxicity—from their explanations that justify a stance—to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, **Argument-based Consistency (ArC)**, that measures the extent to which LLMs’ free-form toxicity explanations reflect an ideal and logical argumentation process. Based on uncertainty quantification, we develop six metrics for ArC to comprehensively evaluate the (in)consistencies in LLMs’ toxicity explanations. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our [code](https://github.com/uofthcdslab/ArC) and [LLM-generated explanations](https://huggingface.co/collections/uofthcdslab/arc) for future works.
Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding
Seyyed Saeid Cheshmi | Hahnemann Ortiz | James Mooney | Dongyeop Kang
Seyyed Saeid Cheshmi | Hahnemann Ortiz | James Mooney | Dongyeop Kang
Vision–language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language—such as sarcasm, humor, and metaphor—remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity.We propose a three-step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others—especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open- and closed-source models.Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross-style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at https://github.com/scheshmi/CrossStyle-MMR.
QueStER: Query Specification for Generative Keyword-Based Retrieval
Arthur Satouf | Yuxuan Zong | Habiboulaye Amadou Boubacar | Pablo Piantanida | Benjamin Piwowarski
Arthur Satouf | Yuxuan Zong | Habiboulaye Amadou Boubacar | Pablo Piantanida | Benjamin Piwowarski
Generative retrieval (GR) differs from the traditional index–then–retrieve pipeline by storing relevance in model parameters and generating retrieval cues directly from the query, but it can be brittle out of domain and expensive to scale. We introduce QueStER (QUEry SpecificaTion for gEnerative Keyword-Based Retrieval), which bridges GR and query reformulation by learning to generate explicit keyword-based search specifications. Given a user query, a lightweight LLM produces a keyword query that is executed by a standard retriever (BM25), combining the generalization benefits of generative query rewriting with the efficiency and scalability of lexical indexing. We train the rewriting policy with reinforcement learning techniques. Across in- and out-of-domain evaluations, QueStER consistently improves over BM25 and is competitive with neural IR baselines, while maintaining strong efficiency.
Evaluating Sparse Autoencoders for Monosemantic Representation
Moghis Fereidouni | Muhammad Umair Haider | Peizhong Ju | A.b. Siddique
Moghis Fereidouni | Muhammad Umair Haider | Peizhong Ju | A.b. Siddique
A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, no quantitative comparison has examined how concept activation distributions differ between SAEs and their base models. This paper provides the first systematic evaluation of SAEs against base models through activation distribution lens. We introduce a fine-grained concept separability score based on the Jensen–Shannon distance, which captures how distinctly a neuron’s activation distributions vary across concepts. Using two large language models (Gemma-2-2B and DeepSeek-R1) and multiple SAE variants across five datasets (including word-level and sentence-level), we show that SAEs reduce polysemanticity and achieve higher concept separability. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP achieves the smallest perplexity increase while remaining highly effective at concept removal.
Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur | Nitesh Vamshi Bommisetty | Gene Louis Kim
Abdullah Al Monsur | Nitesh Vamshi Bommisetty | Gene Louis Kim
The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model’s ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs’ performance on long-tailed event classes.
Think Hard Only When Needed: A Hybrid Best-of-N and Beam Search for Efficient Test-Time Compute
Hyewon Suh | Chaojian Li | Cheng-Jhih Shih | Zheng Wang | Kejing Xia | Yonggan Fu | Yingyan Celine Lin
Hyewon Suh | Chaojian Li | Cheng-Jhih Shih | Zheng Wang | Kejing Xia | Yonggan Fu | Yingyan Celine Lin
Test-time compute has emerged as a promising paradigm that enables small language models (SLMs) to achieve large language model (LLM)-level capabilities by allocating additional compute for explicit reasoning during inference. Two common approaches are beam search and Best-of-N sampling. Beam search improves reasoning quality by scoring and optimizing token sequences using Process Reward Models (PRMs), but can incur non-trivial computational overhead and latency. In contrast, Best-of-N executes all reasoning trajectories without PRM guidance, often wasting compute on low-quality trajectories that may have gone astray early in the generation process. To address both inefficiencies, we propose THROW (THink haRd Only When needed)—a hybrid inference pipeline that combines the diversity of Best-of-N with the reasoning trajectory optimization of beam search. THROW introduces a selective branch truncation and expansion mechanism: it generates shorter initial trajectories than Best-of-N and evaluates them using PRMs to classify each query as "easy" or "hard." Based on this classification, THROW applies branch truncation for easy queries, mimicking Best-of-N, and PRM-guided branch expansion for hard ones, similar to beam search. Evaluations on MATH500, AMC23, and AIME24 demonstrate that THROW achieves 1.54× and 14.38× latency speedups and 35.7% and 80.4% token reductions on average while preserving high reasoning accuracy compared to Best-of-N and Beam Search, respectively.
Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models
Đorđe Klisura | Joseph Khoury | Ashish Kundu | Ram Krishnan | Anthony Rios
Đorđe Klisura | Joseph Khoury | Ashish Kundu | Ram Krishnan | Anthony Rios
Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM’s ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.
NL2Logic: AST-Guided Translation of Natural Language into First-Order Logic with Large Language Models
Rizky Ramadhana Putra | Raihan Sultan Pasha Basuki | Yutong Cheng | Peng Gao
Rizky Ramadhana Putra | Raihan Sultan Pasha Basuki | Yutong Cheng | Peng Gao
Automated reasoning is critical in domains such as law and governance, where verifying claims against facts in documents requires both accuracy and interpretability.Recent work has adopted a structured reasoning paradigm that parses first-order logic (FOL) rules from natural language and delegates inference to automated solvers.With the rise of large language models (LLMs), methods such as GCD and CODE4LOGIC leverage their reasoning and code generation capabilities to enhance logic parsing.However, these approaches suffer from (1) fragile syntax control, due to weak enforcement of global grammar consistency, and (2) low semantic faithfulness, as they lack fine-grained clause-level semantic understanding.To address these challenges, we propose , a FOL translation framework that uses an AST as an intermediate layer, combining a recursive LLM-based semantic parser with an AST-guided generator that deterministically produces solver-ready code.On the FOLIO, LogicNLI, and ProofWriter benchmarks, attains 99% syntactic accuracy and improves semantic correctness by 30% over state-of-the-art baselines.Moreover, integrating into Logic-LM yields near-perfect executability and improves downstream reasoning accuracy by ~31% over Logic-LM’s original few-shot unconstrained FOL translation module.
Coding Agents with Multimodal Browsing are Generalist Problem Solvers
Aditya Bharat Soni | Boxuan Li | Xingyao Wang | Valerie Chen | Graham Neubig
Aditya Bharat Soni | Boxuan Li | Xingyao Wang | Valerie Chen | Graham Neubig
Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. Similarly, specialized AI agents with task-specific tools or architectures often fail to generalize beyond their intended scope. In this work, we ask: *can agents achieve generalizability across diverse domains with a small, but well-chosen set of general tools?* We propose OpenHands-Versa, a single-agent system with a modest number of general tools like code execution, search engine, web browser and multimodal file viewer, for three practical domains: software engineering, deep research, and web browsing. Notably, OpenHands-Versa demonstrates superior or competitive performance over task-specific specialized agents on three challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, with absolute improvements in success rate of **9.1**, **1.3**, and **9.1** points, respectively. Thus, our *single-agent* system can achieve strong generalization indicating that specialist agents for these domains provide no practical benefit. Furthermore, we find that specialist multi-agent systems do not generalize beyond their intended scope. These findings establish OpenHands-Versa as a strong baseline for future research.
Quantifying Data Contamination in Psychometric Evaluations of LLMs
Jongwook Han | Woojung Song | Jonggeun Lee | Yohan Jo
Jongwook Han | Woojung Song | Jonggeun Lee | Yohan Jo
Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
Task-aware Block Pruning with Output Distribution Signals for Large Language Models
Song-ha Jo | Youngrok Ko | Sang-goo Lee | Jinseok Seol
Song-ha Jo | Youngrok Ko | Sang-goo Lee | Jinseok Seol
Large language models (LLMs) provide excellent performance, but their practical deployment is limited by the substantial compute and memory demands of large models and the latency of auto-regressive decoding. To mitigate these inefficiencies, block pruning reduces the number of executed transformer blocks, effectively lowering latency while preserving architectural coherence. However, existing methods typically rely on representation similarity or computationally expensive sensitivity analyses to estimate block importance, thereby neglecting task-aware model behavior. To address this limitation, we introduce Task-aware Block Pruning (TaBP), a novel approach that directly captures task-specific inference dynamics by quantifying block-level uncertainty from the statistics of each block’s early-exited output distribution on a calibration dataset. Since output distributions reflect the model’s confidence and decision uncertainty conditioned on downstream tasks, these statistics provide a principled signal for identifying blocks that are less critical for task performance. Extensive experiments demonstrate that TaBP preserves downstream task performance while substantially reducing inference latency and computational cost, without relying on cost-heavy sensitivity analyses. To facilitate reproducibility and further research, we release our implementation of TaBP on [GitHub](https://github.com/Song-haJo/TaBP).
LARA: LLM-based Agile Power Distribution Network Restoration from Disastrous Events
Jishnu Warrier | Heqing Huang | Yuzhang Lin | Sai Qian Zhang
Jishnu Warrier | Heqing Huang | Yuzhang Lin | Sai Qian Zhang
Restoring power distribution networks after disruptions demands rapid, reliable coordination across repair crews, mobile power sources, and switching actions under strict constraints. Classical optimization yields high-quality plans but can be slow, while reinforcement learning often requires feeder-specific training and careful reward shaping. We recast restoration as language-conditioned planning: a large language model generates high-level restoration plans over a compact pre-validated catalogue of feasible actions. This constrained generation design makes decisions reliably, scalably, and interpretably, and allows for real-time human-in-the-loop decision-making while requiring no topology-specific setup or retraining. Our method achieves near-mixed-integer-linear programming performance on the IEEE 13-node standard power distribution feeder and outperforms a time-capped MILP solver on the IEEE 33-node standard feeder by around 13%, while using less than 1% of its wall-clock runtime.
Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Benchmark
Mohammad Khodadad | Ali Shiraee Kasmaee | Mahdi Astaraki | Nicholas Sherck | Hamidreza Mahyar | Soheila Samiee
Mohammad Khodadad | Ali Shiraee Kasmaee | Mahdi Astaraki | Nicholas Sherck | Hamidreza Mahyar | Soheila Samiee
We introduce ChemComp, the first chemistry-focused benchmark for evaluating compositional multi-hop reasoning in large language models (LLMs). Our automated pipeline constructs benchmarks from proprietary or public data by integrating generative reasoning models, chemical named-entity recognition, and external knowledge bases to build knowledge graphs. Applied to recent chemistry literature, this approach minimizes overlap with LLM pretraining data. The resulting dataset comprises 1,188 multi-hop questions, refined through domain-expert feedback and robust evaluation protocols.Using ChemComp, we systematically compare LLM performance with and without retrieval augmentation, including an idealized gold-context scenario. Our results show that even state-of-the-art models struggle with compositional reasoning: retrieval significantly improves accuracy, yet reasoning errors persist even under perfect retrieval. These findings highlight the limitations of current LLMs and the critical role of retrieval-augmented methods in scientific reasoning. Furthermore, our pipeline is generalizable with fine-tuning, enabling the creation of challenging multi-hop reasoning benchmarks across domains and proprietary datasets.
SD-E2: Semantic Exploration for Reasoning Under Token Budgets
Kshitij Mishra | Nils Lukas | Salem Lahlou
Kshitij Mishra | Nils Lukas | Salem Lahlou
How to Contextualize Empirical Data for Risk Analysis with LLMs: A Case Study of Power Outages
Haiyun Huang | Yukun Li | Marco A Pretell | Jacob Naroian | Ebadah Khan | Liping Liu
Haiyun Huang | Yukun Li | Marco A Pretell | Jacob Naroian | Ebadah Khan | Liping Liu
Large Language Models (LLMs) are increasingly being considered for high-stakes decision-making, yet their application in statistical risk analysis remains largely underexplored. A central challenge in this domain is enabling LLMs to effectively leverage historical data. To address this, we propose novel methods for extracting key information from raw data and translating it into structured contextual input within the LLM prompt. Applying our methods to a case study of power outage risk assessment, we demonstrate that this contextualization strategy significantly improves the LLM’s performance in risk assessment tasks. While the LLM’s prediction performance still does not match that of a standard machine learning model, the LLM-based approach offers distinct advantages in versatility and interpretability. These findings demonstrate a new paradigm for contextualizing data to support risk assessment.
Thinking Beyond the Local: Multi-View Instructed Adaptive Reasoning in KG-Enhanced LLMs
Minghan Zhang | Shu Zhao | Zhen Yang | Hongsheng Wu | Yongxing Lin | Haodong Zou | Jie Chen | Zhen Duan
Minghan Zhang | Shu Zhao | Zhen Yang | Hongsheng Wu | Yongxing Lin | Haodong Zou | Jie Chen | Zhen Duan
Knowledge Graph-enhanced Large Language Models (KG-Enhanced LLMs) integrate the linguistic capabilities of LLMs with the structured semantics of Knowledge Graphs (KGs), showing strong potential in knowledge-intensive reasoning tasks. However, existing methods typically adopt query-driven iterative reasoning from a local perspective, which limits their ability to capture semantically distant but crucial information, leading to dual bottlenecks in efficiency and accuracy for complex multi-hop tasks. To address this issue, we propose MIAoG, a multi-view instructed adaptive reasoning of LLM on KG, which is designed to overcome the limitations of local exploration by enabling LLMs to plan, evaluate, and adapt reasoning paths from a global perspective. Instead of query-anchored exploration, MIAoG first prompts the LLM to generate a multi-view instruction set that outlines diverse potential reasoning paths and explicitly specifies global reasoning intentions to guide the model toward coherent and targeted reasoning. During reasoning, MIAoG integrates a real-time introspection mechanism that evaluates the alignment between the current path and the instructions, adaptively pruning inconsistent trajectories to enhance global consistency while maintaining efficiency. Extensive experiments on multiple public datasets show that MIAoG achieves state-of-the-art performance in KG-enhanced LLM reasoning, particularly excelling in complex multi-hop scenarios.
DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution
L D M S Sai Teja | N Siva Gopala Krishna | Ufaq Khan | Muhammad Haris Khan | Atul Mishra
L D M S Sai Teja | N Siva Gopala Krishna | Ufaq Khan | Muhammad Haris Khan | Atul Mishra
In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA) overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.
FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
Juhyun Oh | Nayeon Lee | Chani Jung | Jiho Jin | Junho Myung | Jongwon Lee | Taieui Song | Alice Oh
Juhyun Oh | Nayeon Lee | Chani Jung | Jiho Jin | Junho Myung | Jongwon Lee | Taieui Song | Alice Oh
Large Language Models (LLMs) often default to overly cautious and vague responses when handling sensitive topics, sacrificing helpfulness for safety. Existing evaluation frameworks lack systematic methods to identify and address specific weaknesses in responses to sensitive topics, making it difficult to improve both safety and helpfulness simultaneously. To address this, we introduce FINEST, a FINE-grained response evaluation taxonomy for Sensitive Topics, which breaks down helpfulness and harmlessness into errors across three main categories: Content, Logic, and Appropriateness. Experiments on a Korean-sensitive question dataset demonstrate that our score- and error-based improvement pipeline, guided by FINEST, significantly improves the model responses across all three categories, outperforming refinement without guidance. Notably, score-based improvement—providing category-specific scores and justifications—yields the most significant gains, reducing the error sentence ratio for Appropriateness by up to 33.09%. This work lays the foundation for a more explainable and comprehensive evaluation and improvement of LLM responses to sensitive questions.
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
Junbo Li | Peng Zhou | Rui Meng | Meet P. Vadera | Lihong Li | Yang Li
Junbo Li | Peng Zhou | Rui Meng | Meet P. Vadera | Lihong Li | Yang Li
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
Decoding Time Series with LLMs: A Multi-Agent Framework for Cross-Domain Annotation
Minhua Lin | Zhengzhang Chen | Yanchi Liu | Xujiang Zhao | Zongyu Wu | Junxiang Wang | Xiang Zhang | Suhang Wang | Haifeng Chen
Minhua Lin | Zhengzhang Chen | Yanchi Liu | Xujiang Zhao | Zongyu Wu | Junxiang Wang | Xiang Zhang | Suhang Wang | Haifeng Chen
Time series data is ubiquitous across various domains, including manufacturing, finance, and healthcare. High-quality annotations are essential for effectively understanding time series and facilitating downstream tasks. However, obtaining such annotations is challenging, particularly in mission-critical domains. In this paper, we propose TESSA, a multi-agent system designed to automatically generate both general and domain-specific annotations for time series data. TESSA introduces two agents: a general annotation agent and a domain-specific annotation agent. The general agent captures common patterns and knowledge across multiple source domains, leveraging both time-series-wise and text-wise features to generate general annotations. Meanwhile, the domain-specific agent utilizes limited annotations from the target domain to learn domain-specific terminology and generate targeted annotations. Extensive experiments on multiple synthetic and real-world datasets demonstrate that TESSA effectively generates high-quality annotations, outperforming existing methods.
Multi-Hall-SA: A Cross-lingual Benchmark for Multi-Type Hallucination Detection in Low-Resource South African Languages
Sello Ralethe | Jan Buys
Sello Ralethe | Jan Buys
Hallucinations generated by Large Language Models (LLMs) pose significant challenges for their application to low-resource languages. We present Multi-Hall-SA, a cross-lingual benchmark for hallucination detection spanning English and four low-resource South African languages: isiZulu, isiXhosa, Sepedi, and Sesotho. Derived from government texts, this benchmark categorizes hallucinations into four types aligned with established taxonomies of factual errors: temporal shifts, entity errors, numerical inaccuracies, and location mistakes. Human validation confirms the quality and cross-lingual alignment of our synthetically generated hallucinations. Our cross-lingual alignment methodology enables direct performance comparison between high-resource and low-resource languages, revealing notable gaps in detection capabilities. Evaluation across four state-of-the-art models shows they detect up to 23.6% fewer hallucinations in South African languages compared to English. Knowledge augmentation reduces this disparity, decreasing cross-lingual performance gaps by 59.4% on average. Beyond introducing a validated resource for low-resource languages, Multi-Hall-SA provides a framework for evaluating and improving factual reliability across linguistic boundaries, advancing more inclusive and equitable AI development.
Query4Regex: Verifiable Regex Transformation through Formal Operations from NL and DSL Queries
Joonghyuk Hahn | Yo-Sub Han
Joonghyuk Hahn | Yo-Sub Han
While large language models (LLMs) excel at generating structured data, such as code, their ability to precisely manipulate it based on instructions remains relatively under-explored. Regular expressions (regexes), critical in practice, are challenging to manipulate. Crucially, the correctness of transformations can be mathematically verified, making them exceptionally well-suited for measuring the symbolic reasoning of LLMs. We introduce Query4Regex, a new benchmark for evaluating verifiable transformations on regexes. Our benchmark tests two query formats: natural language instructions and a program-like domain-specific language (DSL) that specifies the sequence of operations. We evaluate a range of LLMs, verifying semantic correctness through rigorous deterministic finite automata (DFA) equivalence testing. Our empirical studies reveal: 1) the formal DSL significantly outperforms natural language, achieving up to 6.74%p accuracy gains on average. 2) Performance for both formats degrades sharply as compositional complexity increases, highlighting a core challenge in multi-step reasoning. 3) Models often generate plausible but unparsable outputs. Even among parsable outputs, semantic errors remain common, making failures difficult to detect without formal verification. Query4Regex provides a robust framework for analyzing the gap between LLMs’ linguistic fluency and their symbolic reasoning, paving the way for more reliable and verifiable manipulation of formal languages. Our code is available at https://github.com/peer0/Query4Regex.
SrcMix: Mixing of Related Source Languages Benefits Extremely Low-resource Machine Translation
Sanjeev Kumar | Preethi Jyothi | Pushpak Bhattacharyya
Sanjeev Kumar | Preethi Jyothi | Pushpak Bhattacharyya
Multilingual models are widely used for machine translation (MT). However, their effectiveness for extremely low-resource languages (ELRLs) depends critically on how related languages are incorporated during fine-tuning. In this work, we study the role of language mixing directionality, linguistic relatedness, and script compatibility in ELRL translation. We propose SrcMix, a simple source-side mixing strategy that combines related ELRLs during fine-tuning while constraining the decoder to a single target language. Compared to its target-side counterpart TgtMix, SrcMix improves performance by +3 ChrF++ and +5 BLEU in high-resource to ELRL translations, and by +5 ChrF++ and +12 BLEU in mid-resource to ELRL translations. We also release the first Angika MT dataset and provide a systematic comparison of LLM (Aya-101) and NMT (mT5-Large) models under ELRL settings, highlighting the importance of directional mixing and linguistic compatibility.
IMRNNs: An Efficient Method for Interpretable Dense Retrieval via Embedding Modulation
Yash Saxena | Ankur Padia | Kalpa Gunaratna | Manas Gaur
Yash Saxena | Ankur Padia | Kalpa Gunaratna | Manas Gaur
Interpretability in black-box dense retrievers remains a central challenge in Retrieval-Augmented Generation (RAG). Understanding how queries and documents semantically interact is critical for diagnosing retrieval behavior and improving model design. However, existing dense retrievers rely on static embeddings for both queries and documents, which obscures this bidirectional relationship. Post-hoc approaches such as re-rankers are computationally expensive, add inference latency, and still fail to reveal the underlying semantic alignment. To address these limitations, we propose Interpretable Modular Retrieval Neural Networks (IMRNNs), a lightweight framework that augments any dense retriever with dynamic, bidirectional modulation at inference time. IMRNNs employ two independent adapters: one conditions document embeddings on the current query, while the other refines the query embedding using corpus-level feedback from initially retrieved documents. This iterative modulation process enables the model to adapt representations dynamically and expose interpretable semantic dependencies between queries and documents. Empirically, IMRNNs not only enhance interpretability but also improve retrieval effectiveness. Across seven benchmark datasets, applying our method to standard dense retrievers yields average gains of +6.35% nDCG, +7.14% recall, and +7.04% MRR over state-of-the-art baselines. These results demonstrate that incorporating interpretability-driven modulation can both explain and enhance retrieval in RAG systems.
MMUIE: Massive Multi-Domain Universal Information Extraction for Long Documents
Shuyi Zhang | Zhenbin Chen | Shuting Li | Kewei Tu | Li Jing | Zixia Jia | Zilong Zheng
Shuyi Zhang | Zhenbin Chen | Shuting Li | Kewei Tu | Li Jing | Zixia Jia | Zilong Zheng
We present **MMUIE**, a large-scale universal dataset for multi-domain, document-level information extraction (IE) from long texts.Existing IE systems predominantly operate at the sentence level or within narrow domains due to annotation constraints.MMUIE addresses this gap by introducing an automated annotation pipeline that integrates traditional knowledge bases with large language models to extract fine-grained entities, aliases, and relation triples across 34 domains.The dataset comprises a weakly-supervised training set and a manually verified test set, featuring 723 entity types and 456 relation types.Empirical evaluations reveal that existing sentence-level IE models and even advanced LLMs underperform on this task, highlighting the need for better domain-aware document-level models.To this end, we develop DocUIE, a universal IE model fine-tuned on MMUIE, which achieves strong generalization and transferability across domains. MMUIE lays the foundation for robust, scalable, and universal information extraction from long-form text in diverse real-world scenarios. All code, data, and models are available in https://github.com/Shuyi-zsy/Massive-Multi-Domain-UIE.
Learning to Judge: LLMs Designing and Applying Evaluation Rubrics
Clemencia Siro | Pourya Aliannejadi | Mohammad Aliannejadi
Clemencia Siro | Pourya Aliannejadi | Mohammad Aliannejadi
Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and use their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs—consistent within models but fragmented across them—and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
PsyProbe: Proactive and Interpretable Dialogue through User State Modeling for Exploratory Counseling
Sohhyung Park | Hyunji Kang | Sungzoon Cho | Dongil Kim
Sohhyung Park | Hyunji Kang | Sungzoon Cho | Dongil Kim
Recent advances in large language models have enabled mental health dialogue systems, yet existing approaches remain predominantly reactive, lacking systematic user state modeling for proactive therapeutic exploration. We introduce PsyProbe, a dialogue system designed for the exploration phase of counseling that systematically tracks user psychological states through the PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) augmented with cognitive error detection. PsyProbe combines State Builder for extracting structured psychological profiles, Memory Construction for tracking information gaps, Strategy Planner for Motivational Interviewing behavioral codes, and Response Generator with Question Ideation and Critic/Revision modules to generate contextually appropriate, proactive questions. We evaluate PsyProbe with 27 participants in real-world Korean counseling scenarios, including automatic evaluation across ablation modes, user evaluation, and expert evaluation by a certified counselor. The full PsyProbe model consistently outperforms baseline and ablation modes in automatic evaluation. User evaluation demonstrates significantly increased engagement intention and improved naturalness compared to baseline. Expert evaluation shows that PsyProbe substantially improves core issue understanding and achieves question rates comparable to professional counselors, validating the effectiveness of systematic state modeling and proactive questioning for therapeutic exploration.
Learning from Child-directed Speech in Two-language Scenarios: A French-English Case-Study
Liel Binyamin | Elior Sulem
Liel Binyamin | Elior Sulem
Research on developmentally plausible language models has so far centered on English, leaving open questions about multilingual settings. We present a systematic study of compact models by extending BabyBERTa to English–French scenarios under strictly size-matched data conditions, addressing monolingual, bilingual, and cross-lingual settings. Our design contrasts two corpus types: (i) child-directed speech (≈2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (≈10M tokens), extending the BabyLM framework to French. To support fair evaluation, we also introduce new resources: French versions of QAMR and QASRL, and an English and French multi-domain corpus.We evaluate the models on both syntactic and semantic tasks, comparing with Wikipedia-only training. Results reveal context-dependent effects: training on Wikipedia consistently favors semantic tasks, while child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, disproportionately benefiting French. Importantly, the same relative patterns are observed across BabyBERTa, RoBERTa, and LTG-BERT, indicating consistent trends across the tested architectures.
DeVisE: Towards the Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue | Heloisa Oss Boll | Aykut Erdem | Erkut Erdem | Iacer Calixto
Camila Zurdo Tagliabue | Heloisa Oss Boll | Aykut Erdem | Erkut Erdem | Iacer Calixto
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations
Sequence Repetition Enhances Token Embeddings and Improves Sequence Labeling with Decoder-only Language Models
Matija Luka Kukić | Marko Čuljak | David Dukić | Martin Tutek | Jan Šnajder
Matija Luka Kukić | Marko Čuljak | David Dukić | Martin Tutek | Jan Šnajder
Modern language models (LMs) are trained in an autoregressive manner, conditioned only on the prefix. In contrast, sequence labeling (SL) tasks assign labels to each individual input token, naturally benefiting from bidirectional context. This discrepancy has historically led SL to rely on inherently bidirectional encoder-only models. However, the rapid development of decoder-only models has raised the question of whether they can be adapted to SL. While causal mask removal has emerged as a viable technique for adapting decoder-only models to leverage the full context for SL, it requires considerable changes to the base model functionality. In this work, we explore sequence repetition (SR) as a less invasive alternative for enabling bidirectionality in decoder-only models. Through fine-tuning experiments, we show that SR inherently makes decoders bidirectional, improving the quality of token-level embeddings and surpassing encoders and unmasked decoders. Contrary to earlier claims, we find that increasing the number of repetitions does not degrade SL performance. Finally, we demonstrate that embeddings from intermediate layers are highly effective for SR, comparable to those from final layers, while being significantly more efficient to compute. Our findings underscore that SR alleviates the structural limitations of decoders, enabling more efficient and adaptable LMs and broadening their applicability to other token-level tasks.
MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment
Omid Ghahroodi | Arshia Hemmat | Marzia Nouri | Seyed Mohammad Hadi Hosseini | Doratossadat Dastgheib | Mohammad Vali Sanian | Alireza Sahebi | Reihaneh Zohrabi | Mohammad Hossein Rohban | Ehsaneddin Asgari | Mahdieh Soleymani Baghshah
Omid Ghahroodi | Arshia Hemmat | Marzia Nouri | Seyed Mohammad Hadi Hosseini | Doratossadat Dastgheib | Mohammad Vali Sanian | Alireza Sahebi | Reihaneh Zohrabi | Mohammad Hossein Rohban | Ehsaneddin Asgari | Mahdieh Soleymani Baghshah
Recent advancements in large vision-language models (VLMs) have primarily focused on English, with limited attention given to other languages. To address this gap, we introduce MEENA (also known as PersianMMMU), the first dataset designed to evaluate Persian VLMs across scientific, reasoning, and human-level understanding tasks. Our dataset comprises approximately 7,500 Persian and 3,000 English questions, covering a wide range of topics such as reasoning, mathematics, physics, diagrams, charts, and Persian art and literature. Key features of MEENA include: (1) diverse subject coverage spanning various educational levels, from primary to upper secondary school, (2) rich metadata, including difficulty levels and descriptive answers, (3) original Persian data that preserves cultural nuances, (4) a bilingual structure to assess cross-linguistic performance, and (5) a series of diverse experiments assessing various capabilities, including overall performance, the model’s ability to attend to images, and its tendency to generate hallucinations. We hope this benchmark contributes to enhancing VLM capabilities beyond English.
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models
Taido Purason | Pavel Chizhov | Ivan P. Yamshchikov | Mark Fishel
Taido Purason | Pavel Chizhov | Ivan P. Yamshchikov | Mark Fishel
Tokenizer adaptation plays an important role in adapting pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to extension trains a new tokenizer on domain-specific text and appends the tokens that do not overlap with the existing vocabulary, which often results in many tokens that are unreachable or never used. We propose continued BPE training that extends a pre-trained tokenizer by continuing the BPE merge learning process on new data. Experiments across multiple languages and model families show that this approach improves tokenization efficiency and leads to better utilization of added vocabulary. We also introduce leaf-based vocabulary pruning, which removes redundant tokens while preserving model quality. Together, these methods provide practical tools for controlled vocabulary modification, which we release as an open-source toolkit.
AGIC: Attention-Guided Image Captioning to Improve Caption Relevance
L D M S Sai Teja | Ashok Urlana | Pruthwik Mishra
L D M S Sai Teja | Ashok Urlana | Pruthwik Mishra
Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k, Flickr30k and MSCOCO datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.
Visual–Linguistic Abductive Reasoning with LLMs for Knowledge-based Visual Question Answering
Jieun Kim | Yujin Jeong | Sung-Bae Cho
Jieun Kim | Yujin Jeong | Sung-Bae Cho
Recent attempts to leverage large language models (LLMs) for reasoning and pre-trained knowledge in multi-modal reasoning focus on two main approaches: aligning image features with linguistic space, and converting images into textual cues to exploit the implicit reasoning capabilities of LLMs. Although they integrate visual information into the reasoning pipeline, they often treat visual perception and language reasoning as separate processes, limiting the potential for fully unified multi-modal reasoning. In this paper, we propose a novel method, Visual–Linguistic Abductive Reasoning (ViLA), inspired by human abductive reasoning processes. ViLA hypothesizes a plausible answer, generates the corresponding visual and textual premises, and employs fuzzy scoring to select the most coherent combination, thus deriving the final inference. This process integrates visual and linguistic modalities into interpretable abductive reasoning chains, enabling unified multi-modal reasoning. Without fine-tuning LLMs or retrieving external knowledge, ViLA improves performance by 2.31% on AOKVQA, 1.7% on OKVQA, and 1.7% on GQA over previous state-of-the-art models, while also improving interpretability and stability.
FactAppeal: Identifying Epistemic Factual Appeals in News Media
Guy Mor-Lan | Tamir Sheafer | Shaul R. Shenhav
Guy Mor-Lan | Tamir Sheafer | Shaul R. Shenhav
How is a factual claim made credible? We propose the novel task of Epistemic Appeal Identification, which identifies whether and how factual statements have been anchored by external sources or evidence. To advance research on this task, we present FactAppeal, a manually annotated dataset of 3,226 English-language news sentences. Unlike prior resources that focus solely on claim detection and verification, FactAppeal identifies the nuanced epistemic structures and evidentiary basis underlying these claims and used to support them. FactAppeal contains span-level annotations which identify factual statements and mentions of sources on which they rely. Moreover, the annotations include fine-grained characteristics of factual appeals such as the type of source (e.g. Active Participant, Witness, Expert, Direct Evidence), whether it is mentioned by name, mentions of the source’s role and epistemic credentials, attribution to the source via direct or indirect quotation, and other features. We model the task with a range of encoder models and generative decoder models in the 2B-9B parameter range. Our best performing model, based on Gemma 2 9B, achieves a macro-F1 score of 0.73.
Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm-ai-research/PhoASR.
MapCoder-Lite: Distilling Multi-Agent Coding into a Single Small LLM
Woongkyu Lee | Junhee Cho | Jungwook Choi
Woongkyu Lee | Junhee Cho | Jungwook Choi
Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale (> 30 B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, a framework for distilling the complex reasoning of large, multi-agent coding systems into a single 7B model. Our contribution is a novel, three-pillar methodology that synergistically generates, refines, and encodes multi-agent knowledge: (i) pass-based trajectory distillation from strong LLMs fixes format fragility in retrieval and reduces failures in debugging, (ii) supervisor-guided correction with global feedback strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation.Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (13.2% → 28.3%), eliminates all format failures, while reducing GPU memory and token-generation time by 4× compared to a 32B model. It also achieves over 10% gains on simpler coding benchmarks, demonstrating broad improvements beyond competitive programming. These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model. Our code is publicly available at https://github.com/aiha-lab/MapCoder-Lite.
When Do Language Models Endorse Limitations on Human Rights Principles?
Keenan Samway | Miu Nicole Takagi | Rada Mihalcea | Bernhard Schölkopf | Ilias Chalkidis | Daniel Hershcovich | Zhijing Jin
Keenan Samway | Miu Nicole Takagi | Rada Mihalcea | Bernhard Schölkopf | Ilias Chalkidis | Daniel Hershcovich | Zhijing Jin
As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.
Abstractive Summarization of Bengali Academic Videos Based on Audio Subtitles
Lamisa Bintee Mizan Deya | Farhatun Shama | Abdul Aziz | Md Kaykobad Reza | Md Shahidul Salim
Lamisa Bintee Mizan Deya | Farhatun Shama | Abdul Aziz | Md Kaykobad Reza | Md Shahidul Salim
Active Learning with Non-Uniform Costs for African Natural Language Processing
Bonaventure F. P. Dossou | Ines Arous | Audrey Durand | Jackie Chi Kit Cheung
Bonaventure F. P. Dossou | Ines Arous | Audrey Durand | Jackie Chi Kit Cheung
Labeling datasets for African languages poses substantial challenges due to the diverse settings in which annotations are collected, leading to highly variable labeling costs. These costs vary with task complexity, annotator expertise, and data availability. Yet, most active learning (AL) frameworks assume uniform annotation costs, limiting their applicability in real-world, resource-constrained scenarios. To address this, we introduce KnapsackBALD, a novel cost-aware active learning method that integrates the BatchBALD acquisition strategy with a 0-1 Knapsack optimization objective to select informative and budget-efficient samples. We evaluate KnapsackBALD on the MasakhaNEWS dataset, a multilingual news classification benchmark covering 11 African languages. Our method consistently outperforms seven strong active learning baselines, including BALD, BatchBALD, and stochastic sampling variants such as PowerBALD and Softmax-BALD, across all three cost scenarios. The performance gap widens as annotation cost imbalances become more extreme, demonstrating the robustness of KnapsackBALD in different cost settings. These findings show that when annotation costs are explicitly heterogeneous, cost-sensitive acquisition is critical for effective active learning, as demonstrated in African Languages NLP and similar settings. Our code base is open-sourced here.
CrisiText: A dataset of warning messages for LLM training in emergency communication
Giacomo Gonella | Gian Maria Campedelli | Stefano Menini | Marco Guerini
Giacomo Gonella | Gian Maria Campedelli | Stefano Menini | Marco Guerini
Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used to assist humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper, we present *CrisiText*, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow expert’s written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal variants to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.
Training-Free Text Emotion Tagging via LLM-Based Best-Worst Scaling
Lukas Christ | Shahin Amiriparian
Lukas Christ | Shahin Amiriparian
Large Language Models (LLMs) have been frequently used as automatic annotators for tasks such as Text Emotion Recognition (TER). We consider a scenario in which annotators assign at least one emotion label from a large set of options to a text snippet. For this emotion tagging task, we propose a novel zero-shot algorithm that leverages Best-Worst Scaling (BWS), prompting the LLM to choose the least and most suitable emotions for a given text from several label subsets. The LLM’s choices can be represented by a graph linking labels via worse-than relations. Random walks on this graph yield the final score for each label. We compare our algorithm with naive prompting approaches as well as an established BWS-based method. Extensive experiments demonstrate the suitability of the method. It proves to compare favorably to the benchmarks in terms of both accuracy and calibration with respect to human annotations. Moreover, our algorithm’s automatic annotations are shown to be suitable for finetuning lightweight emotion classification models. The proposed method consumes considerably fewer computational resources than the established BWS approach.
Scaling Cultural Resources for Improving Generative Models
Hayk Stepanyan | Aishwarya Verma | Andrew Zaldivar | Rutledge Chin Feman | Erin MacMurray van Liemt | Charu Kalia | Vinodkumar Prabhakaran | Sunipa Dev
Hayk Stepanyan | Aishwarya Verma | Andrew Zaldivar | Rutledge Chin Feman | Erin MacMurray van Liemt | Charu Kalia | Vinodkumar Prabhakaran | Sunipa Dev
Generative models are known to have reduced performance in different global cultural contexts and languages. While continual data updates have been known to be conducted to improve overall model performance, bolstering and evaluating this cross-cultural competence of generative AI models requires data resources to be intentionally expanded to include global contexts and languages. In this work, we construct a multi-pronged pipeline to collect and contribute culturally salient, multilingual data. We posit that such data can assess the state of the global applicability of our models and thus, in turn, help identify and improve upon cross-cultural gaps.
Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning
Sultan AlRashed | Jianghui Wang | Francesco Orabona
Sultan AlRashed | Jianghui Wang | Francesco Orabona
Benchmarks for language models have become essential tools for research. Yet, such benchmarks face a persistent contamination problem, with recent studies finding 25-50% of evaluation datasets appearing in training corpora. This is true even looking at the two-player zero-sum game setting, where most benchmarks are based on popular games, like chess, whose optimal strategies are all over the web. Such contamination hinders the possibility to differentiate memorization and reasoning skills. To rectify these problems, we introduce TCG-Bench, a benchmark based on a new two-player trading card game (TCG), similar in spirit to games like Magic: The Gathering. TCG-Bench offers three key innovations: (1) a contamination-resistant design by separating the publicly released game engine from hidden card implementations, (2) a continuous difficulty spectrum via Monte Carlo simulation that prevents benchmark saturation, and (3) a parallel implementation in English and Arabic, the first multilingual text-based game benchmark to do so. We also formalize a practical threat model and refresh protocol that preserves evaluation integrity even if specific cards leak.Our analysis across 17 models (50,000+ games) reveals that performance declines exponentially with difficulty, while model size correlates only weakly with strategic ability. We also observe cross-linguistic performance gaps between English and Arabic, with a gap of 47.4% at 32B, highlighting the need for multilingual game benchmarks that target reasoning capabilities in the target language. We host a leaderboard showcasing these results and welcome evaluation requests on our private cards.
ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laymen
Shounak Paul | Raghav Dogra | Pawan Goyal | Saptarshi Ghosh
Shounak Paul | Raghav Dogra | Pawan Goyal | Saptarshi Ghosh
Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.