Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh (Editors)


Anthology ID:
2025.ijcnlp-long
Month:
December
Year:
2025
Address:
Mumbai, India
Venues:
IJCNLP | AACL
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long/
DOI:
ISBN:
979-8-89176-298-5
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.pdf

pdf bib
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Kentaro Inui | Sakriani Sakti | Haofen Wang | Derek F. Wong | Pushpak Bhattacharyya | Biplab Banerjee | Asif Ekbal | Tanmoy Chakraborty | Dhirendra Pratap Singh

pdf bib
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Theo King | Zekun Wu | Adriano Koshiyama | Emre Kazim | Philip Colin Treleaven

A stereotype is a generalised claim about a social group. Such claims change with culture and context and are often phrased in everyday language, which makes them hard to detect: the State of the Art Large Language Models (LLMs) reach only 68% macro-F1 on the yes/no task “does this sentence contain a stereotype?”. We present HEARTS, a Holistic framework for Explainable, sustAinable and Robust Text Stereotype detection that brings together NLP and social-science. The framework is built on the Expanded Multi-Grain Stereotype Dataset (EMGSD), 57201 English sentences that cover gender, profession, nationality, race, religion and LGBTQ+ topics, adding 10% more data for under-represented groups while keeping high annotator agreement (𝜅 = 0.82). Fine-tuning the lightweight ALBERT-v2 model on EMGSD raises binary detection scores to 81.5% macro-F1, matching full BERT while producing 200× less CO2. For Explainability, we blend SHAP and LIME token level scores and introduce a confidence measure that increases when the model is correct (𝜌 = 0.18). We then use HEARTS to assess 16 SOTA LLMs on 1050 neutral prompts each for stereotype propagation: stereotype rates fall by 23% between model generations, yet clear differences remain across model families (LLaMA > Gemini > GPT > Claude). HEARTS thus supplies a practical, low-carbon and interpretable toolkit for measuring stereotype bias in language.

pdf bib
Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
Xiantao Zhang

Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: *Modality-Unifying Captioners*, *Multimodal Embedders*, and *End-to-End Representers*. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role.Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.

pdf bib
With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting
Stephen Meisenbacher | Florian Matthes

Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of *text rewriting* mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of *dataset size*, or rather, the effect of dataset size on a mechanism’s efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.

pdf bib
From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models
Mahammed Kamruzzaman | Abdullah Al Monsur | Gene Louis Kim | Anshuman Chhabra

Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. To provide a deeper interpretive lens, we incorporate four key cultural dimensions, namely Power Distance, Uncertainty Avoidance, Long-Term Orientation, and Individualism, derived from Hofstede’s cross-cultural framework. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.

pdf bib
REGULAR: A Framework for Relation-Guided Multi-Span Question Generation
Jiayi Lin | Chenyang Zhang | Bingxuan Hou | Dongyu Zhang | Qingqing Hong | Junli Wang

To alleviate the high cost of manually annotating Question Answering (QA) datasets, Question Generation (QG) requires the model to generate a question related to the given answer and passage. This work primarily focuses on Multi-Span Question Generation (MSQG), where the generated question corresponds to multiple candidate answers. Existing QG methods may not suit MSQG as they typically overlook the correlation between the candidate answers and generate trivial questions, which limits the quality of the synthetic datasets. Based on the observation that relevant entities typically share the same relationship with the same entity, we propose REGULAR, a framework of RElation-GUided MuLti-SpAn Question GeneRation. REGULAR first converts passages into relation graphs and extracts candidate answers from the relation graphs. Then, REGULAR utilizes a QG model to generate a set of candidate questions and a QA model to obtain the best question. We construct over 100,000 questions using Wikipedia corpora, named REGULAR-WIKI, and conduct experiments to compare our synthetic datasets with other synthetic QA datasets. The experiment results show that models trained with REGULAR-WIKI achieve the best performance. We also conduct ablation studies and statistical analysis to verify the quality of our synthetic dataset. Our code and data are available at https://github.com/PluseLin/REGULAR.

pdf bib
Feature Decomposition-Augmentation Network for Multimodal Sentiment Analysis
Dapeng Yin | Bingxuan Hou | Mengna Gao | Shuyue Zhu | Junli Wang

Multimodal sentiment analysis identifies human emotional tendencies by analyzing text, visual, and auditory modalities. In most studies, the textual modality is usually considered to contain the most emotional information and is regarded as the dominant modality. Existing methods mostly map auxiliary modalities into a semantic space close to the dominant modality, which overly relies on the dominant modality. In this work, we propose a Feature Decomposition-Augmentation (FeaDA) framework, which aims to elevate the role of auxiliary modalities in multimodal data fusion. We first design a projector to decompose auxiliary modalities into partial features, which contain features for emotion judgment, and then utilize these decomposed features to guide the fusion process with KL loss, thereby enhancing the status of auxiliary modality fusion. To verify the effectiveness of our method, we conducted experiments on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets. The experimental results show that our FeaDA framework outperforms mutilmodal sentiment analysis methods of the same type in main metrics. Our code is available at https://github.com/PowerLittleYin/FeaDA-main.

pdf bib
CSPLADE: Learned Sparse Retrieval with Causal Language Models
Zhichao Xu | Aosong Feng | Yijun Tian | Haibo Ding | Lin Lee Cheong

In recent years, dense retrieval has been the focus of information retrieval (IR) research. While effective, dense retrieval produces uninterpretable dense vectors, and suffers from the drawback of large index size. Learned sparse retrieval (LSR) has emerged as promising alternative, achieving competitive retrieval performance while also being able to leverage the classical inverted index data structure for efficient retrieval. However, limited works have explored scaling LSR beyond BERT scale. In this work, we identify two challenges in training large language models (LLM) for LSR: (1) training instability during the early stage of contrastive training; (2) suboptimal performance due to pre-trained LLM’s unidirectional attention. To address these challenges, we propose two corresponding techniques: (1) a lightweight adaptation training phase to eliminate training instability; (2) two model variants to enable bidirectional information. With these techniques, we are able to train LSR models with 8B scale LLM, and achieve competitive retrieval performance with reduced index size. Furthermore, we are among the first to analyze the performance-efficiency tradeoff of LLM-based LSR model through the lens of model quantization. Our findings provide insights into adapting LLMs for efficient retrieval modeling.

pdf bib
Bias Amplification: Large Language Models as Increasingly Biased Media
Ze Wang | Zekun Wu | Yichi Zhang | Xin Guan | Navya Jain | Qinyang Lu | Saloni Gupta | Adriano Koshiyama

Model collapse—a phenomenon where models degrade in performance due to indiscriminate use of synthetic data—is well studied. However, its role in bias amplification—the progressive reinforcement of pre-existing social biases in Large Language Models (LLMs)—remains underexplored. In this paper, we formally define the conditions for bias amplification and demonstrate through statistical simulations that bias can intensify even in the absence of sampling errors, the primary driver of model collapse. Empirically, we investigate political bias amplification in GPT-2 using a custom-built benchmark for sentence continuation tasks. Our findings reveal a progressively increasing right-leaning bias. Furthermore, we evaluate three mitigation strategies—Overfitting, Preservation, and Accumulation—and show that bias amplification persists even when model collapse is mitigated. Finally, a mechanistic interpretation identifies distinct sets of neurons responsible for model collapse and bias amplification, suggesting they arise from different underlying mechanisms.

pdf bib
STAR: Self-Automated Back-Querying for Production Data Generation
Kellen Tan Cheng | Anna Lisa Gentile | Chad DeLuca | Guang-Jie Ren

The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs’ input/output text through various detectors. However, developing and maintaining robust detectors has many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs before deployment. In this work, we propose STAR, a simple yet intuitive solution to generate production-like labeled data for LLMs’ guardrails development. STAR is based on two key ideas: (i) using self-automated back-querying to synthetically generate data, paired with (ii) a sparse human-in-the-loop clustering technique to label the data. The aim of self-automated back-querying is to construct a parallel corpus roughly representative of the original dataset and resembling real LLM output. We then infuse existing datasets with our synthetically generated examples to produce robust training data for our detectors. We test our technique on one of the most difficult and nuanced detectors: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.48%, despite having 400x less parameters.

pdf bib
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
Chao Zeng | Songwei Liu | Shu Yang | Fangmin Chen | Xing Mei | Lean Fu

Model compression has emerged as a mainstream solution to reduce memory usage and computational overhead. This paper proposes GQSA, a novel model compression framework specifically designed for LLMs. Traditional methods typically focus exclusively on either quantization or sparsification, but relying on a single strategy often results in significant performance loss at high compression rates. In contrast, GQSA integrates quantization and sparsification in a tightly coupled manner, leveraging GPU-friendly structured group sparsity and quantization for efficient acceleration. Building upon system-algorithm co-design principles, we propose a two-stage sparse optimization strategy that ensures the performance superiority of the compressed model. On the engine side, we introduce a “task-centric” parallel strategy, which, to the best of our knowledge, is the first application in the domain of sparse computing. Compared to the traditional 2:4 sparse method, the GQSA offers a more flexible and adjustable sparsity rate, as well as a higher weight compression rate, and is efficiently compatible with weight-only quantization methods. Experimental results demonstrate that, under the GQSA W4S50% compression setting, the model’s accuracy surpasses that of both 2:4 pruning and W2 quantization. Furthermore, at the inference level, GQSA outperforms W2 by 1.26 × and 2:4 pruning by 2.35 × in terms of speed.

pdf bib
Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms
Yuxi Sun | Wei Gao | Hongzhan Lin | Jing Ma | Wenxuan Zhang

Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action report a witnessed crime are social norms that inform our conduct, such as It is expected to be brave to report crimes. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to report a witnessed crime, one may balance bravery against self-protection. In this paper, we introduce ClarityEthic, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.

pdf bib
Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation
Juanhui Li | Haoyu Han | Zhikai Chen | Harry Shomer | Wei Jin | Amin Javari | Hui Liu

Session-based recommendation systems have attracted growing interest for their ability to provide personalized recommendations based on users’ in-session behaviors. While ID-based methods have shown strong performance, they often struggle with long-tail items and overlook valuable textual information. To incorporate text information, various approaches have been proposed, generally employing a naive fusion framework. Interestingly, this approach often fails to outperform the best single-modality baseline. Further exploration indicates a potential imbalance issue in the naive fusion method, where the ID tends to dominate the training and the text is undertrained. This issue indicates that the naive fusion method might not be as effective in combining ID and text as once believed. To address this, we propose AlterRec, an alternative training framework that separates the optimization of ID and text to avoid the imbalance issue. AlterRec also designs an effective strategy to enhance the interaction between the two modalities, facilitating mutual interaction and more effective text integration. Extensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation.

pdf bib
Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data Generation
Zijian Li | Jingjing Fu | Lei Song | Jiang Bian | Jun Zhang | Rui Wang

Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose Chain of Functions (CoF), a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. CoF provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: it eliminates reliance on extremely large models. Employing CoF, we construct the ChartCoF dataset, with 1.4k complex reasoning Q&A for fine-grained analysis and 50k Q&A for reasoning enhancement.Experiments show that ChartCoF improves performance for MLLMs on widely used benchmarks, and the fine-grained evaluation on ChartCoF reveals varying performance across question taxonomies and step numbers for each MLLM. Furthermore, the novel paradigm of function-governed rationale generation in CoF could inspire broader applications beyond charts. The code and data have been publicly available at https://github.com/microsoft/Chain-of-Functions.

pdf bib
Topology-Aware Gated Graph Neural Network for Social Bot Detection
Pi Jiebin | Yantuan Xian | Yuxin Huang | Yan Xiang | Ran Song | Zhengtao Yu

The rapid growth of social networks has led to a surge in social bots, which often disseminate low-quality content and may manipulate public opinion, posing threats to online security. Although recent GNN-based bot detection methods perform strongly, they still face two major challenges. First, deep GNNs are prone to over-smoothing: neighbor aggregation blends bot and human node representations, obscuring bot-specific features. Second, social graphs are dominated by human–human and human–bot connections, while direct bot–bot links are scarce, making it difficult for effective bot representations to propagate within GNNs. To address these issues, we propose a Topology-Aware Gated Graph Neural Network () to detect social bots. employs topology-aware data augmentation to synthesize realistic bot nodes that preserve the original graph structure, mitigating class imbalance; it also introduces a hierarchical gating mechanism that restructures node embeddings into a tree format, selectively filtering noise and enhancing discriminative features. Experiments on three standard benchmark datasets show that consistently surpasses leading baselines in highly imbalanced settings, delivering superior accuracy and robustness.

pdf bib
Minimizing Queries, Maximizing Impact: Adaptive Score-Based Attack and Defense for Sentiment Analysis
Yigit Efe Enhos | Shira Wein | Scott Alfeld

While state-of-the-art large language models find high rates of success on text classification tasks such as sentiment analysis, they still exhibit vulnerabilities to adversarial examples: meticulously crafted perturbations of input data that guide models into making false predictions. These adversarial attacks are of particular concern when the systems in question are user-facing. While many attacks are able to reduce the accuracy of Transformer-based classifiers by a substantial margin, they often require a large amount of computational time and a large number of queries made to the attacked model, which makes the attacks susceptible to detection. In this work, we resolve the limitations of high query counts and necessary computational time by proposing a query-efficient word-level attack that is fast during deployment and does not compromise the attack success rate of state-of-the-art methods. Our attack constructs a dictionary of adversarial word substitutions based on prior data and leverages these substitutions to flip the sentiment classification of the text. Our attack method achieves an average of 27.49 queries—over 30% fewer than the closest competitor—while maintaining a 99.70% attack success rate. We also develop an effective defense strategy inspired by our attack approach.

pdf bib
Generating Text from Uniform Meaning Representation
Emma Markle | Reihaneh Iranmanesh | Shira Wein

Uniform Meaning Representation (UMR) is a recently developed graph-based semantic representation, which expands on Abstract Meaning Representation (AMR) in a number of ways, in particular through the inclusion of document-level information and multilingual flexibility. In order to effectively adopt and leverage UMR for downstream tasks, efforts must be placed toward developing a UMR technological ecosystem. Though only a small amount of UMR annotations have been produced to date, in this work, we investigate the first approaches to producing text from multilingual UMR graphs. Exploiting the structural similarity between UMR and AMR graphs and the wide availability of AMR technologies, we introduce (1) a baseline approach which passes UMR graphs to AMR-to-text generation models, (2) a pipeline conversion of UMR to AMR, then using AMR-to-text generation models, and (3) a fine-tuning approach for both foundation models and AMR-to-text generation models with UMR data. Our best performing models achieve multilingual BERTscores of 0.825 for English and 0.882 for Chinese, a promising indication of the effectiveness of fine-tuning approaches for UMR-to-text generation even with limited UMR data.

pdf bib
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan | Yufei Xu | Zhengyin Du | Xuesong Yao | Zheyu Wang | Xiaowen Guo | Jiecao Chen

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years seems to bring us close enough to the “last exam” for LLMs to surpass humanity. However, is the LLMs’ remarkable reasoning ability indeed coming from true intelligence by human standards, or are they actually reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM’s recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60 percent performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to reevaluate the true intelligence level of cutting-edge LLMs.

pdf bib
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi | Chiyu Ma | Wenhua Liang | Xingjian Diao | Weicheng Ma | Soroush Vosoughi

LLM-as-a-Judge has emerged as a promising alternative to human evaluators across various tasks, yet inherent biases—particularly position bias, the tendency to favor solutions based on their position within the prompt—compromise its reliability. This exploratory study evaluates position bias in LLM judges across pairwise and list-wise comparison settings, introducing three metrics: repetition stability, position consistency, and preference fairness. Our experiments, involving 15 LLM judges across MTBench and DevBench with 22 tasks and approximately 40 solution-generating models, result in over 150,000 evaluation instances. We identify Judge-Level, Candidate-Level, and Task-Level factors contributing to bias. The findings confirm that position bias is not due to random chance and varies significantly across judges and tasks. While position bias is weakly influenced by the length of prompt components, it is strongly affected by the quality gap between solutions. Our agreement and disagreement analysis among judges further provides insights into the distribution of judging difficulty across the dataset, and highlights the potential for dataset modifications.

pdf bib
Item-Language Model: Improving Large Language Model for Recommendation via Item-Language Representation Learning
Li Yang | Anushya Subbiah | Hardik Patel | Judith Yue Li | Yanwei Song | Reza Mirghaderi | Vikram Aggarwal | Fuli Feng | Zenglin Xu | Dongfang Liu | Qifan Wang

Large Language Models (LLMs) have recently made significant advancements in tackling complex tasks, such as retrieving hard-to-find information and solving intricate problems. Consequently, various approaches have been proposed to integrate LLMs into recommender systems, primarily by embedding them within existing architectures or training them on the recommendation data. However, most existing methods fail to effectively incorporate user-item interaction signals into pretrained LLMs due to the modality gap between interaction data and the LLM’s internal knowledge. To address this challenge, we propose the Item-Language Model (ILM) to enhance LLMs for recommendation. ILM consists of two main components: An item-language representation learning module, where an ILM encoder is pretrained to generate text-aligned item representations. And an item-language co-training module, where the ILM encoder is integrated into a pretrained LLM for the recommendation tasks. Extensive experiments demonstrate the superior performance of our approach over several state-of-the-art methods, validating the importance of text-aligned item representations in bridging this modality gap. Our ablation studies further reveal the effectiveness of our model design for integrating the interaction knowledge into LLMs for recommendation tasks. Our code is available at: https://anonymous.4open.science/r/ILM-7AD4/.

pdf bib
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
Zahraa Al Sahili | Ioannis Patras | Matthew Purver

Multilingual vision–language models (VLMs) promise universal image–text retrieval, yet their social biases remain under‐explored.We perform the first systematic audit of four public multilingual CLIP variants—M‐CLIP, NLLB‐CLIP, CAPIVARA‐CLIP, and the debiased SigLIP‐2—covering ten languages that differ in resource availability and morphological gender marking.Using balanced subsets of FairFace and the PATA stereotype suite in a zero‐shot setting, we quantify race and gender bias and measure stereotype amplification.Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English‐only baseline.CAPIVARA‐CLIP shows its largest biases precisely in the low‐resource languages it targets, while the shared encoder of NLLB‐CLIP and SigLIP‐2 transfers English gender stereotypes into gender‐neutral languages; loosely coupled encoders largely avoid this leakage.Although SigLIP‐2 reduces agency and communion skews, it inherits—and in caption‐sparse contexts (e.g., Xhosa) amplifies—the English anchor’s crime associations.Highly gendered languages consistently magnify all bias types, yet gender‐neutral languages remain vulnerable whenever cross‐lingual weight sharing imports foreign stereotypes.Aggregated metrics thus mask language‐specific “hot spots,” underscoring the need for fine‐grained, language‐aware bias evaluation in future multilingual VLM research.

pdf bib
A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
Adam M. Morgan | Adeen Flinker

We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.

pdf bib
Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria’s Minority Languages
Oluwadara Kalejaiye | Luel Hagos Beyene | David Ifeoluwa Adelani | Mmekut-mfon Gabriel Edet | Aniefon Daniel Akpan | Eno-Abasi Urua | Anietie Andy

Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e <1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce Ibom—a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.

pdf bib
An Analysis of the Impact of Problem Paraphrasing on LLM-Based Mathematical Problem Solving
Yerim Han | Hyein Seo | Hyuk Namgoong | Sangkeun Jung

Recent advances in large language models (LLMs) have significantly improved mathematical problem-solving. Among various techniques, paraphrasing problem statements has emerged as a promising strategy to enhance model understanding and accuracy.We define twelve paraphrasing types grounded in mathematics education theory and analyze their impact on LLM performance across different configurations. To automate selection, we propose a Paraphrase Type Selector that predicts effective paraphrases for each problem.Experiments on MATH-500, SVAMP, and AIME shows consistent performance gain from paraphrased problems. On MATH-500 with LLaMA 3.1-8B, combining the original with the best five paraphrased problems improves accuracy by +8.4%, with the selector achieving an additional +1.33% gain.

pdf bib
Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches
Changhao Pan | Dongyu Yao | Yu Zhang | Wenxiang Guo | Jingyu Lu | Zhiyuan Zhu | Zhou Zhao

Recent advances in singing voice synthesis (SVS) have attracted substantial attention from both academia and industry. With the advent of large language models and novel generative paradigms, producing controllable, high‐fidelity singing voices has become an attainable goal. Yet the field still lacks a comprehensive survey that systematically analyzes deep‐learning‐based singing voice systems and their enabling technologies.To address the aforementioned issue, this survey first categorizes existing systems by task type and then organizes current architectures into two major paradigms: cascaded and end-to-end approaches. Moreover, we provide an in-depth analysis of core technologies, covering singing modeling and control techniques. Finally, we review relevant datasets, annotation tools, and evaluation benchmarks that support training and assessment. In appendix, we introduce training strategies and further discussion of SVS. This survey provides an up-to-date review of the literature on SVS models, which would be a useful reference for both researchers and engineers. Related materials are available at https://github.com/David-Pigeon/SyntheticSingers.

pdf bib
ASAudio: A Survey of Advanced Spatial Audio Research
Zhiyuan Zhu | Yu Zhang | Wenxiang Guo | Changhao Pan | Zhou Zhao

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.

pdf bib
MossNet: Mixture of State-Space Experts is a Multi-Head Attention
Shikhar Tuli | James Seale Smith | Haris Jeelani | Chi-Heng Lin | Abhishek Patel | Vasili Ramanishka | Yen-Chang Hsu | Hongxia Jin

Large language models (LLMs) have significantly advanced generative applications in natural language processing (NLP). Recent trends in model architectures revolve around efficient variants of transformers or state-space/gated-recurrent models (SSMs, GRMs). However, prevailing SSM/GRM-based methods often emulate only a single attention head, potentially limiting their expressiveness. In this work, we propose MossNet, a novel mixture-of-state-space-experts architecture that emulates a linear multi-head attention (MHA). MossNet leverages a mixture-of-experts (MoE) implementation not only in channel-mixing multi-layered perceptron (MLP) blocks but also in the time-mixing SSM kernels to realize multiple “attention heads.” Extensive experiments on language modeling and downstream evaluations show that MossNet outperforms both transformer- and SSM-based architectures of similar model size and data budgets. Larger variants of MossNet, trained on trillions of tokens, further confirm its scalability and superior performance. In addition, real-device profiling on a Samsung Galaxy S24 Ultra and an Nvidia A100 GPU demonstrate favorable runtime speed and resource usage compared to similarly sized baselines. Our results suggest that MossNet is a compelling new direction for efficient, high-performing recurrent LLM architectures.

pdf bib
LLM-Based Behavior Prediction for Social Media Users with Continuous Memory
Kun Li | Chengwei Dai | Wei Zhou | Songlin Hu

Large language models (LLMs) have demonstrated strong capabilities in simulating social roles and generating human-like behaviors. However, their effectiveness in predicting real-world user behavior under continuous memory accumulation remains largely unexplored. Most existing studies focus on short-term interactions or static personas, neglecting the dynamic nature of users’ historical experiences in social media environments. To address this gap, we introduce FineRob, a novel dataset for fine-grained behavior prediction of social media users, which includes long-term memory traces from 1,866 users across three platforms. Each behavior is decomposed into three elements: object, type, and content, resulting in 78.6k QA records.We identify that as memory accumulates, prediction accuracy drops significantly due to the model’s difficulty in accessing detailed historical information. We further propose the OM-CoT fine-tuning framework to enhance the model’s ability to process and utilize long-term memory. Experimental results show that our method effectively reduces the performance degradation caused by memory growth, improving fine-grained behavior prediction. .

pdf bib
Hassles and Uplifts Detection on Social Media Narratives
Jiyu Chen | Sarvnaz Karimi | Diego Molla | Andreas Duenser | Maria Kangas | Cecile Paris

Hassles and uplifts are psychological constructs of individuals’ positive or negative responses to daily minor incidents, with cumulative impacts on mental health. These concepts are largely overlooked in NLP, where existing tasks and models focus on identifying general sentiment expressed in text. These, however, cannot satisfy targeted information needs in psychological inquiry. To address this, we introduce Hassles and Uplifts Detection (HUD), a novel NLP application to identify these constructs in social media language.We evaluate various language models and task adaptation approaches on a probing dataset collected from a private, real-time emotional venting platform. Some of our models achieve F scores close to 80%. We also identify open opportunities to improve affective language understanding in support of studies in psychology.

pdf bib
Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri | Yerulan Kongrat | Adrian Santosh | Ruslan Tasmukhanov | Josemaria Loza Vera | Muhammad Dehan Al Kautsar | Fajri Koto

As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

pdf bib
SEA-LION: Southeast Asian Languages in One Network
Raymond Ng | Thanh Ngan Nguyen | Huang Yuli | Tai Ngee Chia | Leong Wai Yi | Wei Qi Leong | Xianbin Yong | Jian Gang Ngui | Yosephine Susanto | Nicholas Cheng | Hamsawardhini Rengarajan | Peerat Limkonchotiwat | Adithya Venkatadri Hulagadri | Kok Wai Teng | Yeo Yeow Tong | Bryan Siow | Wei Yi Teo | Tan Choon Meng | Brandon Ong | Zhi Hao Ong | Jann Railey Montalan | Adwin Chan | Sajeban Antonyrex | Ren Lee | Esther Choa | David Ong Tat-Wee | Bing Jie Darius Liu | William Chandra Tjhi | Erik Cambria | Leslie Teo

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

pdf bib
Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction
Hongjin Kim | Jaewook Lee | Kiyoung Lee | Jong-hun Shin | Soojong Lim | Oh-Woog Kwon

Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model’s internal reasoning processes with Korean inputs—particularly by tuning Korean-specific neurons in early layers—is key to unlocking RL’s effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.

pdf bib
Multilingual Iterative Model Pruning: What Matters?
Haryo Akbarianto Wibowo | Haiyue Song | Hideki Tanaka | Masao Utiyama | Alham Fikri Aji | Raj Dabre

Pruning techniques have been studied to construct small models for efficiency, yet the effect of cross-lingual, which shows language performance transferability, is understudied in this field. In this work, we investigate cross-lingual effects in multilingual large language model compression using iterative pruning and recovery. We employ structured layer pruning with LoRA-based recovery and knowledge distillation, testing whether calibration languages different from target evaluation languages can preserve multilingual performance. Experiments on Qwen2.5-7B and Llama3.1-8B demonstrate that any recovery language consistently outperforms no-recovery baselines, with even low-resource languages like Swahili providing ~5% improvements. In contrast to expectations, dominant pretraining languages do not always yield the best results, where Indonesian achieves the highest performance in Llama3.1-8B, while Japanese performs the best in Qwen2.5-7B. Our findings reveal that cross-lingual calibration effectively maintains multilingual capabilities in the iterative pruning.

pdf bib
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Lijia Liu | Takumi Kondo | Kyohei Atarashi | Koh Takeuchi | Jiyi Li | Shigeru Saito | Hisashi Kashima

This paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

pdf bib
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
Yohan Mathew | Ollie Matthews | Robert McCarthy | Joan Velja | Christian Schroeder de Witt | Dylan Cope | Nandi Schoots

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions.Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation.The use of information hiding (steganography) in agent communications could render such collusion practically undetectable.This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures.To investigate this problem we design two approaches – a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method – for reliably eliciting sophisticated LLM-generated linguistic text steganography.We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training.Additionally, we find that standard mitigations — both passive oversight of model outputs and active mitigation through communication paraphrasing — are not fully effective at preventing this steganographic communication.Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

pdf bib
A Survey on LLM-Assisted Clinical Trial Recruitment
Shrestha Ghosh | Moritz Schneider | Carina Reinicke | Carsten Eickhoff

Clinical trials are designed in natural language and the task of matching them to patients, represented via both structured and unstructured textual data, benefits from knowledge aggregation and reasoning abilities of LLMs. LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution than classical approaches that employ trial-specific heuristics. Yet, adoption of LLMs in critical domains, such as clinical research, comes with many challenges, such as, the availability of public benchmarks, the dimensions of evaluation and data sensitivity. In this survey, we contextualize emerging LLM-based approaches in clinical trial recruitment. We examine the main components of the clinical trial recruitment process, discuss existing challenges in adopting LLM technologies in clinical research and exciting future directions.

pdf bib
The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Francois Meyer | Jan Buys

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.

pdf bib
PRALEKHA: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan | Haiyue Song | Mohammed Safi Ur Rahman Khan | Anoop Kunchukuttan | Raj Dabre

Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based methods, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that our chunk-based method is 2–3× faster while maintaining competitive performance, and that DAC achieves substantial gains over pooling-based baselines. Extrinsic evaluation further demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The dataset and evaluation framework are publicly available to support further research.

pdf bib
Structured Document Translation via Format Reinforcement Learning
Haiyue Song | Johannes Eschbach-Dymanus | Hour Kaing | Sumire Honda | Hideki Tanaka | Bianka Buschbeck | Masao Utiyama

Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose Format Reinforcement Learning (FormatRL), which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we propose StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.

pdf bib
StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies
Pragya Khanna | Priyanka Kommagouni | Vamshi Raghu Simha Narasinga | Anil Vuppala

Stuttering is a complex speech disorder that challenges both ASR systems and clinical assessment. We propose a multimodal stuttering detection and classification model that integrates acoustic and linguistic features through a two-stage fusion mechanism. Fine-tuned Wav2Vec 2.0 and HuBERT extract acoustic embeddings, which are early fused with MFCC features to capture fine-grained spectral and phonetic variations, while Llama-2 embeddings from Whisper ASR transcriptions provide linguistic context. To enhance robustness against out-of-distribution speech patterns, we incorporate Retrieval-Augmented Generation or adaptive classification. Our model achieves state-of-the-art performance on SEP-28k and FluencyBank, demonstrating significant improvements in detecting challenging stuttering events. Additionally, our analysis highlights the complementary nature of acoustic and linguistic modalities, reinforcing the need for multimodal approaches in speech disorder detection.

pdf bib
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
Huiyin Xue | Nafise Sadat Moosavi | Nikolaos Aletras

The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention’s effectiveness and open new avenues for simplifying language models without sacrificing performance.

pdf bib
Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance
Ahmed Alajrami | Xingwei Tan | Nikolaos Aletras

Instruction-tuning plays a vital role in enhancing the task-solving abilities of large language models (LLMs), improving their usability in generating helpful responses on various tasks. However, previous work has demonstrated that they are sensitive to minor variations in instruction phrasing. In this paper, we explore whether introducing perturbations in instruction-tuning data can enhance LLMs’ resistance against noisy instructions. We focus on how instruction-tuning with perturbations, such as removing stop words or shuffling words, affects LLMs’ performance on the original and perturbed versions of widely-used benchmarks (MMLU, BBH, GSM8K). We further assess learning dynamics and potential shifts in model behavior. Surprisingly, our results suggest that instruction-tuning on perturbed instructions can, in some cases, improve downstream performance. These findings highlight the importance of including perturbed instructions in instruction-tuning, which can make LLMs more resilient to noisy user inputs.

pdf bib
Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling | Hanwen Du | Bo Peng | Zhihui Zhu | Xia Ning

Multimodal foundation models (MFMs) have demonstrated strong capabilities in e-commerce by effectively leveraging multimodal data to enhance product understanding and user experienceHowever, the development of e-commerce MFMs is hindered by two challenges: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods in e-commerce. To address these challenges, we introduce MMECInstruct, the first large-scale, high-quality multimodal instruction dataset designed specifically for e-commerce MFMs. MMECInstruct comprises 75,000 samples covering 7 real-world e-commerce tasks, supporting both in-domain (IND) and out-of-domain (OOD) evaluations. Leveraging MMECInstruct, we develop CASLIE, a lightweight framework that enhances multimodal information understanding and integration for e-commerce. Our comprehensive evaluation demonstrates that MMECInstruct endows CASLIE with advanced capability and strong generalizability in e-commerce applications. MMECInstruct and CASLIE models are publicly accessible through https://github.com/ninglab/CASLIE.

pdf bib
EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-commerce Models
Xinyi Ling | Hanwen Du | Zhihui Zhu | Xia Ning

E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://github.com/ninglab/EcomMMMU.

pdf bib
Multilingual, Not Multicultural: Uncovering the Cultural Empathy Gap in LLMs through a Comparative Empathetic Dialogue Benchmark
Woojin Lee | Yujin Sim | Hongjin Kim | Harksoo Kim

Large Language Models (LLMs) demonstrate remarkable multilingual capabilities, yet it remains unclear whether they are truly multicultural. Do they merely process different languages, or can they genuinely comprehend the unique cultural contexts embedded within them? This study investigates this critical question by examining whether LLM’s perception of emotion and empathy differs across linguistic and cultural boundaries. To facilitate this, we introduce the Korean Empathetic Dialogues (KoED), a benchmark extending the English-based EmpatheticDialogues (ED) dataset. Moving beyond direct translation, we meticulously reconstructed dialogues specifically selected for their potential for cultural adaptation, aligning them with Korean emotional nuances and incorporating key cultural concepts like ‘jeong’ and ‘han’ that lack direct English equivalents. Our cross-cultural evaluation of leading multilingual LLMs reveals a significant “cultural empathy gap”: models consistently underperform on KoED compared to ED, struggling especially with uniquely Korean emotional expressions. Notably, the Korean-centric model, EXAONE, exhibits significantly higher cultural appropriateness. This result provides compelling evidence that aligns with the “data provenance effect”, suggesting that the cultural alignment of pre-training data is a critical factor for genuine empathetic communication. These findings demonstrate that current LLMs have cultural blind spots and underscore the necessity of benchmarks like KoED to move beyond simple linguistic fluency towards truly culturally adaptive AI systems.

pdf bib
HiPPO: Exploring A Novel Hierarchical Pronunciation Assessment Approach for Spoken Languages
Bi-Cheng Yan | Hsin Wei Wang | Fu-An Chao | Tien-Hong Lo | Yung-Chang Hsu | Berlin Chen

Automatic pronunciation assessment (APA) seeks to quantify a second language (L2) learner’s pronunciation proficiency in a target language by offering timely and fine-grained diagnostic feedback. Most existing efforts on APA have predominantly concentrated on highly constrained reading-aloud tasks (where learners are prompted to read a reference text aloud); however, assessing pronunciation quality in unscripted speech (or free-speaking scenarios) remains relatively underexplored. In light of this, we first propose HiPPO, a hierarchical pronunciation assessment model tailored for spoken languages, which evaluates an L2 learner’s oral proficiency at multiple linguistic levels based solely on the speech uttered by the learner. To improve the overall accuracy of assessment, a contrastive ordinal regularizer and a curriculum learning strategy are introduced for model training. The former aims to generate score-discriminative features by exploiting the ordinal nature of regression targets, while the latter gradually ramps up the training complexity to facilitate the assessment task that takes unscripted speech as input. Experiments conducted on the Speechocean762 benchmark dataset validates the feasibility and superiority of our method in relation to several cutting-edge baselines.

pdf bib
Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation
Leonid Boytsov | David Akinpelu | Nipun Katyal | Tianyi Lin | Fangwei Gao | Yutian Zhao | Jeffrey Huang | Eric Nyberg

We tested over 20 Transformer models for ranking long documents (including recent LongP models trained with FlashAttention and RankGPT models “powered” by OpenAI and Anthropic cloud APIs).We compared them with the simple FirstP baseline, which applied the same model to truncated input (up to 512 tokens).On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average).We hypothesized that this lack of improvement is not due to inherent model limitations,but due to benchmark positional bias (most relevant passages tend to occur early in documents),which is known to exist in MS MARCO.To confirm this, we analyzed positional relevance distributions across four long-document corpora (with six query sets) and observed the same early-position bias.Surprisingly, we also found bias in six BEIR collections, which are typically categorized asshort-document datasets.We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens.On this dataset, many long-context models—including RankGPT—performed at random-baseline level, suggesting overfitting to positional bias.We also experimented with debiasing training data, but with limited success.Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to debias training data.We release our code and data to support further research.

pdf bib
How Reliable are Causal Probing Interventions?
Marc E. Canby | Adam Davies | Chirag Rastogi | Julia Hockenmaier

Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: *completeness* (how thoroughly the representation of the target property has been transformed) and *selectivity* (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as *reliability*, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions.Our project webpage is available at: [https://ahdavies6.github.io/causal_probing_reliability/](https://ahdavies6.github.io/causal_probing_reliability/)

pdf bib
wavCSE: Learning Fixed-size Unified Speech Embeddings via Feature-based Multi-Task Learning
Braveenan Sritharan | Uthayasanker Thayasivam

Modern speech applications require compact embeddings that generalize across both linguistic and paralinguistic tasks. However, most existing embeddings are task-specific and fail to transfer effectively across domains. We propose wavCSE, a feature-based multi-task learning model that produces a fixed-size unified speech embedding suitable for both linguistic and paralinguistic tasks. wavCSE is jointly trained on keyword spotting, speaker identification, and emotion recognition, achieving state-of-the-art performance on all three tasks. The resulting unified embedding is then evaluated on twelve downstream tasks spanning both linguistic and paralinguistic domains. Experimental results show that it outperforms strong baselines on nine of the twelve tasks, indicating effective generalization across domains. To streamline embedding generation, we introduce a recursive layer selection strategy that identifies the most informative hidden layer outputs from the upstream model and refine how these selected outputs are aggregated in the downstream model. These enhancements reduce memory usage and computational cost while improving task performance, making them broadly applicable to self-supervised learning-based speech processing models.

pdf bib
Unveiling Empathic Triggers in Online Interactions via Empathy Cause Identification
Calliope Chloe Bandera | Gyeongeun Lee | Natalie Parde

Effectively learning language patterns that provoke empathetic expression is vital to creating emotionally intelligent technologies; however, this problem has historically been overlooked. We address this gap by proposing the new task of empathy cause identification: a challenging task aimed at pinpointing specific triggers prompting empathetic responses in communicative settings. We correspondingly introduce AcnEmpathize-Cause, a novel dataset consisting of 4K cause-identified sentences, and explore various models to evaluate and demonstrate the dataset’s efficacy. This research not only contributes to the understanding of empathy in textual communication but also paves the way for the development of AI systems capable of more nuanced and supportive interactions.

pdf bib
Assessing the Limits of In-Context Learning beyond Functions using Partially Ordered Relation
Debanjan Dutta | Faizanuddin Ansari | Swagatam Das

Generating rational and generally accurate responses to tasks, often accompanied by example demonstrations, highlights Large Language Model’s (LLM’s) remarkable In-Context Learning (ICL) capabilities without requiring updates to the model’s parameter space. Despite having an ongoing exploration focused on the inference from a document-level concept, its behavior in learning well-defined functions or relations in context needs a careful investigation. In this article, we present the performance of ICL on partially ordered relation by introducing the notion of inductively increasing complexity in prompts. In most cases, the saturated performance of the chosen metric indicates that while ICL offers some benefits, its effectiveness remains constrained as we increase the complexity in the prompts even in presence of sufficient demonstrative examples. The behavior is evident from our empirical findings and has further been theoretically justified in term of its implicit optimization process.The code is available here.

pdf bib
ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between Professional and Non-Professional Responses
Chang Zong | Yuyan Chen | Weiming Lu | Jian Shao | Yongfeng Huang | Heng Chang | Yueting Zhuang

Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including question answering and controlled text generation. However, studies into their ability to switch between opposite styles of responses in professional domains remain underexplored. This study introduces a novel approach, named ProSwitch, which enables a language model to switch between professional and non-professional answers, by tuning and evaluating through the guidance of domain and style knowledge. ProSwitch unfolds in three phases: LLM-augmented preparation to collect domain knowledge and QA pairs, instruction tuning to optimize LLMs with multiple levels of knowledge, and comprehensive evaluation to assess both style discrimination and reference-based quality of the generated text. Comparative analysis of ProSwitch against general and specialized LLMs reveals that our approach outperforms baselines in switching between professional and non-professional responses.

pdf bib
Reasoning Models Reason Well, Until They Don’t
Revanth Rameshkumar | Jimson Huang | Yunxin Sun | Fei Xia | Abulhair Saparov

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs)LLMs fine-tuned with incentives for step‐by‐step argumentation and self‐verification. LRM performance on graph and reasoning benchmarks such as **NLGraph** seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the **Deep Reasoning Dataset (DeepRD)**, along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs’ success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.

pdf bib
Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration
Songyuan Sui | Hongyi Liu | Serena Liu | Li Li | Soo-Hyun Choi | Rui Chen | Xia Hu

Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Extensive experiments across four models and five widely used benchmarks demonstrate that CoQ achieves substantial accuracy improvements and significantly lowers invalid SQL rates compared to prior generic LLM-based, SQL-aided, and hybrid baselines, confirming its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.

pdf bib
Multimodal Language Models for Financial Forecasting from Interleaved Sequences of Text and Time Series
Ross Koval | Nicholas Andrews | Xifeng Yan

Text and time series data offer complementary views of financial markets: news articles provide narrative context about company events, while stock prices reflect how markets react to those events. However, despite their complementary nature, effectively integrating these interleaved modalities for improved forecasting remains challenging. In this work, we propose a unified neural architecture that models these interleaved sequences using modality-specific experts, allowing the model to learn unique time series patterns, while still enabling joint reasoning across modalities and preserving pretrained language understanding capabilities. To further improve multimodal understanding, we introduce a cross-modal alignment framework with a salient token weighting mechanism that learns to align representations across modalities with a focus on the most informative tokens. We demonstrate the effectiveness of our approach on a large-scale financial forecasting task, achieving state-of-the-art performance across a wide variety of strong unimodal and multimodal baselines. We develop an interpretability method that reveals insights into the value of time series-context and reinforces the design of our cross-modal alignment objective. Finally, we demonstrate that these improvements translate to meaningful economic gains in investment simulations.

pdf bib
Comparing Language Models of Different Scales for Security-Focused Tabular Query Generation and Reasoning
Varivashya Poladi | Sandipan Dandapat

Security-related data often exists in complex, multi-table formats and is scarce due to privacy and compliance constraints—posing a major challenge for training and evaluating language models (LMs) on security reasoning tasks. In this work, we systematically investigate the performance of large language models (LLMs) across different parameter scales in generating and solving multi-step, semantically rich queries over realistic security scenarios represented through three interlinked tabular datasets. We assess models on three key axes (i) their ability to formulate insightful, high complexity security questions; (ii) the quality and coherence of their reasoning chains; and (iii) their accuracy in deriving actionable answers from the underlying data. To address data scarcity, we propose a diffusion-based synthetic data generation pipeline that amplifies the existing dataset while preserving domain semantics and statistical structure. Our findings reveal that while large models often outperform in reasoning depth and query formulation, smaller models show surprising efficiency and accuracy. The study provides actionable insights for deploying generative models in security analytics and opens avenues for synthetic data-driven evaluation of LLMs in low-resource, high-stakes domains.

pdf bib
Generate but Verify: Answering with Faithfulness in RAG-based Question Answering
Simone Filice | Elad Haramaty | Guy Horowitz | Zohar Karnin | Liane Lewin-Eytan | Alex Shtoff

Retrieval-Augmented Generation (RAG) enhances LLMs by grounding answers in retrieved passages, which is key in factual Question Answering. However, generated answers may still be unfaithful to the passages, either due to retrieval or generation errors. Many RAG downstream applications rely on assessing answer faithfulness for applying fallback strategies, yet address it implicitly, without a consistent evaluation methodology. We introduce the task of Answering with Faithfulness (AwF), which brings faithfulness prediction to the forefront, explicitly coupling it with answer generation. We define variants of the precision and recall metrics tailored to this task, facilitating direct evaluation and comparison of different AwF methods.We then demonstrate, both theoretically and empirically, that for RAG applications using AwF as a sub-procedure, an improvement to the AwF metrics translates to an improvement to the downstream performance. This results in improved performance for recently published results.

pdf bib
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
Celine Lee | Alexander M Rush | Keyon Vafa

Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we construct a controlled setting where task complexity can be precisely manipulated to study its effect on reasoning length. Deterministic finite automata (DFAs) offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.

pdf bib
MAJI: A Multi-Agent Workflow for Augmenting Journalistic Interviews
Kaiwen Guo | Yimeng Wu

Journalistic interviews are creative, dynamic processes where success hinges on insightful, real-time questioning. While Large Language Models (LLMs) can assist, their tendency to generate coherent but uninspired questions optimizes for probable, not insightful, continuations. This paper investigates whether a structured, multi-agent approach can overcome this limitation to act as a more effective creative partner for journalists. We introduce MAJI, a system designed for this purpose, which employs a divergent-convergent architecture: a committee of specialized agents generates a diverse set of questions, and a convergent agent selects the optimal one. We evaluated MAJI against a suite of strong LLM baselines. Our results demonstrate that our multi-agent framework produces questions that are more coherent, elaborate, and original (+36.9% for our best model vs. a standard LLM baseline), exceeded strong LLM baselines on key measures of creative question quality. Most critically, in a blind survey, professional journalists preferred MAJI’s selected questions over those from the baseline by a margin of more than two to one. We present the system’s evolution, highlighting the architectural trade-offs that enable MAJI to augment, rather than simply automate, journalistic inquiry. We will release the code upon publication.

pdf bib
How Aligned Are Unimodal Language and Graph Encodings of Chemical Molecules?
Congfeng Cao | Zhi Zhang | Jelke Bloem | Khalil Sima’an

Chemical molecules can be represented as graphs or as language descriptions. Training unimodal models on graphs results in different encodings than training them on language. Therefore, the existing literature force-aligns the unimodal models during training to use them in downstream applications such as drug discovery. But to what extent are graph and language unimodal model representations inherently aligned, i.e., aligned prior to any force-alignment training? Knowing this is useful for a more expedient and effective forced-alignment. For the first time, we explore methods to gauge the alignment of graph and language unimodal models. We find compelling differences between models and their ability to represent slight structural differences without force-alignment. We also present an unified unimodal alignment (U2A) benchmark for gauging the inherent alignment between graph and language encoders which we make available with this paper.

pdf bib
Interpreting Multi-Attribute Confounding through Numerical Attributes in Large Language Models
Hirohane Takagi | Gouki Minegishi | Shota Kizawa | Issey Sukeda | Hitomi Yanaka

Although behavioral studies have documented numerical reasoning errors in large language models (LLMs), the underlying representational mechanisms remain unclear. We hypothesize that numerical attributes occupy shared latent subspaces and investigate two questions: (1) How do LLMs internally integrate multiple numerical attributes of a single entity? (2) How does irrelevant numerical context perturb these representations and their downstream outputs? To address these questions, we combine linear probing with partial correlation analysis and prompt-based vulnerability tests across models of varying sizes. Our results show that LLMs encode real-world numerical correlations but tend to systematically amplify them. Moreover, irrelevant context induces consistent shifts in magnitude representations, with downstream effects that vary by model size. These findings reveal a vulnerability in LLM decision-making and lay the groundwork for fairer, representation-aware control under multi-attribute entanglement.

pdf bib
EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
Wan Jou She | Lis Pereira | Fei Cheng | Sakiko Yahata | Panote Siriaraya | Eiji Aramaki

This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4,125 two-turn dialogues, collected through crowdsourcing and expert review.To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation–dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of ≤ 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.

pdf bib
Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs
Wenyu Zhang | Yingxu He | Geyu Lin | Zhuohan Liu | Shuo Sun | Bin Wang | Xunlong Zou | Jeremy H. M. Wong | Qiongqiong Wang | Hardik Bhupendra Sailor | Nancy F. Chen | AiTi Aw

Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses. Experiments on two out-of-domain datasets demonstrate the generalization capabilities of the resulting model.

pdf bib
On the Convergence of Moral Self-Correction in Large Language Models
Guangliang Liu | Haitao Mao | Bochuan Cao | Xitong Zhang | Zhiyu Xue | Rongrong Wang | Kristen Johnson

Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

pdf bib
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya Jyoti Bajpai | Manjesh Kumar Hanawal

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model’s refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper-level insights from the full model’s architecture. Also, it maintains the performance integrity of the full model while training the draft model, achieving a balance between efficiency and accuracy. Our method speeds up the inference process by 1.55-1.85× as compared to the final layer with minimal loss in performance. The source code is available at https://github.com/Div290/SSD.

pdf bib
Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions
Wesley Scivetti | Melissa Torgbi | Mollie Shichman | Taylor Pellegrin | Austin Blodgett | Claire Bonial | Harish Tayyar Madabushi

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can “understand” the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

pdf bib
Improving Sign Language Understanding with a Multi-Stream Masked Autoencoder Trained on ASL Videos
Junwen Mo | MinhDuc Vo | Hideki Nakayama

Sign language understanding remains a significant challenge, particularly for low-resource sign languages with limited annotated data. Motivated by the success of large-scale pretraining in deep learning, we propose Multi-Stream Masked Autoencoder (MS-MAE) — a simple yet effective framework for learning sign language representations from skeleton-based video data. We pretrained a model with MS-MAE on the YouTube-ASL dataset, and then adapted it to multiple downstream tasks across different sign languages. Experimental results show that MS-MAE achieves competitive or superior performance on a range of isolated sign language recognition benchmarks and gloss-free sign language translation tasks across several sign languages. These findings highlight the potential of leveraging large-scale, high-resource sign language data to boost performance in low-resource sign language scenarios. Additionally, visualization of the model’s attention maps reveals its ability to cluster adjacent pose sequences within a sentence, some of which align with individual signs, offering insights into the mechanisms underlying successful transfer learning.

pdf bib
Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
George Flint | Kaustubh Kislay

Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations–both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes’ phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

pdf bib
Fine-grained Confidence Estimation for Spurious Correctness Detection in Large Language Models
Ai Ishii | Naoya Inoue | Hisami Suzuki | Satoshi Sekine

In the deployment of Large Language Models (LLMs), “spurious correctness”—where answers are correct but reasoning contains errors—poses a critical risk by creating an illusion of reliability. While prior work on LLM confidence estimation focuses on answer-level or entire reasoning path confidence, these coarse-grained approaches fail to identify which specific parts of the reasoning contain errors. We propose a fine-grained confidence estimation framework that computes confidence scores for individual evidence triplets within reasoning chains, enabling precise localization of errors. Using carefully designed prompts, we generate answers, evidence in triplet format, and their respective confidence scores simultaneously, allowing automatic detection of spurious correctness patterns where partial evidence contains factual errors. Evaluated on both Japanese and English multi-hop QA benchmarks across multiple models from three model families representing different architectures and training approaches, we show that our approach exhibits superior calibration performance for evidence confidence and demonstrates effective ability to detect spurious correct answers (up to 0.84 on our primary discrimination metric). The consistent improvements across languages demonstrate the generalizability of our method. As a secondary benefit, joint generation of confidence scores improves answer confidence calibration by up to 43%. This prompt-based approach requires no model retraining and is immediately applicable to existing LLMs.

pdf bib
Observing Micromotives and Macrobehavior of Large Language Models
Yuyang Cheng | Xingwei Qu | Tomas Goldsack | Chenghua Lin | Chung-Chi Chen

Thomas C. Schelling, awarded the 2005 Nobel Memorial Prize in Economic Sciences, pointed out that ”individuals decisions (micromotives), while often personal and localized, can lead to societal outcomes (macrobehavior) that are far more complex and different from what the individuals intended.” The current research related to large language models’ (LLMs’) micromotives, such as preferences or biases, assumes that users will make more appropriate decisions once LLMs are devoid of preferences or biases. Consequently, a series of studies has focused on removing bias from LLMs. In the NLP community, while there are many discussions on LLMs’ micromotives, previous studies have seldom conducted a systematic examination of how LLMs may influence society’s macrobehavior. In this paper, we follow the design of Schelling’s model of segregation to observe the relationship between the micromotives and macrobehavior of LLMs. Our results indicate that, regardless of the level of bias in LLMs, a highly segregated society will emerge as more people follow LLMs’ suggestions. We hope our discussion will spark further consideration of the fundamental assumption regarding the mitigation of LLMs’ micromotives and encourage a reevaluation of how LLMs may influence users and society.

pdf bib
SciHallu: A Multi-Granularity Hallucination Detection Dataset for Scientific Writing
Adiba Ibnat Hossain | Sagnik Ray Choudhury | Hamed Alhoori

Large Language Models (LLMs) are increasingly used to support scientific writing, but their tendency to produce hallucinated content threatens academic reliability. Existing benchmarks have addressed hallucination detection in general-domain tasks, such as fact-checking or question answering, but they do not reflect the fine-grained, domain-specific needs of scientific communication. We introduce SciHallu, a dataset for identifying hallucinations in academic text at three levels of granularity: token, sentence, and paragraph. To establish a reliable ground truth, we select source passages from research papers published prior to the widespread adoption of LLMs. Our dataset includes both hallucinated and non-hallucinated paragraph instances, constructed through controlled perturbations at varying levels of noise and validated by human annotators. A rationale is paired with each instance, explaining the nature of the modification. SciHallu covers multiple academic fields, such as Computer Science, Health Sciences, and Humanities and Social Sciences. It is built using a model-guided annotation pipeline, followed by expert human validation. We evaluate state-of-the-art LLMs on both binary and fine-grained classification tasks, revealing challenges in detecting subtle hallucinations. SciHallu supports the development of context-aware systems for more trustworthy scientific content generation.

pdf bib
Enhancing Investment Opinion Ranking through Argument-Based Sentiment Analysis
Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen | Hiroya Takamura | Ichiro Kobayashi | Yusuke Miyao

In the era of rapid Internet and social media development, individuals readily share their investment opinions online. The overwhelming volume of such opinions makes comprehensive evaluation impractical, highlighting the need for an effective recommendation system that can identify valuable insights. To address this challenge, we propose an argument-based sentiment analysis framework that incorporates a new perspective on opinion strength. Our approach introduces the concept of a Fuzzy Strength Degree (FSD), derived from the difference between analysts’ target and closing prices, to quantify the intensity of opinions. By integrating argument mining techniques, we further decompose each opinion into claims and premises, examine their relationships, and use these structures to evaluate the persuasive strength of the arguments. This dual strategy allows us to rank both professional and amateur investor opinions without relying on user history or social signals. Experiments show that our method works best for analyst reports, while on social media, simpler approaches based on wording and professionalism features perform better. Moreover, our analysis of professional analysts’ and traders’ behaviors reveals that top-ranked opinions are more likely to influence subsequent market actions. These findings demonstrate that argument structure and quantified opinion strength provide a novel and reliable foundation for investment opinion recommendation.

pdf bib
A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
Shinnosuke Ono | Issey Sukeda | Takuro Fujii | Kosei Buma | Shunsuke Sasaki

We present **JPharmatron**, a Japanese domain-specific large language model (LLM) for the pharmaceutical field, developed through continual pre-training on two billion Japanese pharmaceutical tokens and eight billion English biomedical tokens. For rigorous evaluation, we introduce **JPharmaBench**, a benchmark suite consisting of three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task involving cross-document consistency checking.We evaluate our model against open-source medical LLMs and commercial models, including GPT-4o. Experimental results show that **JPharmatron** outperforms existing open models and achieves competitive performance with commercial ones.Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge.**JPharmatron** enables secure and local model deployment for pharmaceutical tasks, where privacy and legal constraints limit the use of closed models. Besides, **JPharmaBench** offers a reproducible framework for evaluating Japanese pharmaceutical natural language processing. Together, they demonstrate the feasibility of practical and cost-efficient language models for Japanese healthcare and pharmaceutical sectors.Our model, codes, and datasets are available on HuggingFace: https://huggingface.co/collections/EQUES/jpharmatron and https://huggingface.co/collections/EQUES/jpharmabench.

pdf bib
Investigating Feasibility of Large Language Model Agent Collaboration in Minecraft and Comparison with Human-Human Collaboration
Yuki Hirota | Ryuichiro Higashinaka

In recent years, there has been growing interest in agents that collaborate with humans on creative tasks, and research has begun to explore such collaboration within Minecraft. However, most existing studies on agents in Minecraft focus on scenarios where an agent constructs objects independently on the basis of given instructions, making it difficult to achieve joint construction through dialogue-based cooperation with humans. Prior work, such as the Action-Utterance Model, used small-scale large language models (LLMs), which resulted in limited accuracy. In this study, we attempt to build an agent capable of collaborative construction using LLMs by integrating the framework of the Action-Utterance Model with that of Creative Agents, which leverages more recent and powerful LLMs for more accurate and flexible building. We had two agents conduct the Collaborative Garden Task through simulations and evaluate both the generated gardens and the dialogue content. Through this evaluation, we confirm that the agents are capable of producing gardens with a certain level of quality and can actively offer suggestions and assert their opinions. Furthermore, we conduct a comparative analysis with human-human collaboration to identify current challenges faced by agents and to discuss future directions for improvement toward achieving more human-like cooperative behavior.

pdf bib
Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data
Hao Xiong | Chuanyuan Tan | Wenliang Chen

Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE.To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA).In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%. Our code and data will be released on Github.

pdf bib
Optimizing the Arrangement of Citations in Related Work Section
Masashi Oshika | Ryohei Sasano

In related work section of a scientific paper, authors collect relevant citations and structure them into coherent paragraphs that follow a logical order. Previous studies have addressed citation recommendation and related work section generation in settings where both the citations and their order are provided in advance. However, they have not adequately addressed the optimal ordering of these citations, which is a critical step for achieving fully automated related work section generation. In this study, we propose a new task, citation arrangement, which focuses on determining the optimal order of cited papers to enable fully automated related work section generation. Our approach decomposes citation arrangement into three tasks: citation clustering, paragraph ordering, and citation ordering within a paragraph. For each task, we propose a method that uses a large language model (LLM) in combination with a graph-based technique to comprehensively consider the context of each paper and the relationships among all cited papers. The experimental results show that our method is more effective than methods that generate outputs for each task using only an LLM.

pdf bib
Ability Transfer Through Language Mixing
Petr Hyner | Jan Mrógala | Jan Hula

We systematically investigate cross-lingual ability transfer in language models through controlled experiments across three problem sets: algorithmic addition, graph navigation, and natural language modeling. Our experimental design creates high-resource and low-resource “language” pairs differing in vocabulary, grammar, and computational requirements. We show that training on mixed datasets consistently enables strong positive transfer, significantly improving low-resource language performance compared to training on low amount of data in isolation. We observe improvements from 0% to 100% accuracy in arithmetic tasks, from 24% to 98% accuracy in graph navigation tasks, and 69.6% perplexity reduction in natural language modeling. We demonstrate that transfer effectiveness depends on computational complexity and linguistic differences, where grammar modifications support stronger transfer than vocabulary modifications. These findings provide compelling evidence that cross-lingual ability transfer is a robust mechanism which contributes to the quality of large language models in low-resource languages.

pdf bib
Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection
Rabiraj Bandyopadhyay | Dennis Assenmacher | Jose Maria Alonso-Moral | Claudia Wagner

The quality of training data is crucial for the performance of supervised machine learning models. In particular, poor annotation quality and spurious correlations between labels and features in text dataset can significantly degrade model generalization. This problem is especially pronounced in harmful language detection, where prior studies have revealed major deficiencies in existing datasets. In this work, we design and test data selection methods based on learnability measures to improve dataset quality. Using a sexism dataset with counterfactuals designed to avoid spurious correlations, we show that pruning with EL2N and PVI scores can lead to significant performance increases and outperforms submodular and random selection. Our analysis reveals that in presence of label imbalance models rely on dataset shortcuts; especially easy-to-classify sexist instances and hard-to-classify non-sexist instances contain shortcuts. Pruning these instances leads to performances increases. Pruning hard-to-classify instances is in general a promising strategy as well when shortcuts are not present.

pdf bib
Mitigating Label Length Bias in Large Language Models
Mario Sanz-Guerrero | Katharina von der Wense

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call *label length bias*, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose *normalized contextual calibration* (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

pdf bib
Social Bias in Popular Question-Answering Benchmarks
Angelie Kraft | Judith Simon | Sonja Schimmler

Question-answering (QA) and reading comprehension (RC) benchmarks are commonly used for assessing the capabilities of large language models (LLMs) to retrieve and reproduce knowledge. However, we demonstrate that popular QA and RC benchmarks do not cover questions about different demographics or regions in a representative way. We perform a content analysis of 30 benchmark papers and a quantitative analysis of 20 respective benchmark datasets to learn (1) who is involved in the benchmark creation, (2) whether the benchmarks exhibit social bias, or whether this is addressed or prevented, and (3) whether the demographics of the creators and annotators correspond to particular biases in the content. Most benchmark papers analyzed provide insufficient information about those involved in benchmark creation, particularly the annotators. Notably, just one (WinoGrande) explicitly reports measures taken to address social representation issues. Moreover, the data analysis revealed gender, religion, and geographic biases across a wide range of encyclopedic, commonsense, and scholarly benchmarks. Our work adds to the mounting criticism of AI evaluation practices and shines a light on biased benchmarks being a potential source of LLM bias by incentivizing biased inference heuristics.

pdf bib
ProofTeller: Exposing recency bias in LLM reasoning and its side effects on communication
Mayank Jobanputra | Alisa Kovtunova | Brisca Balthes | Fedor Grigoryevich Pogulskiy | Yifan Wang | Stefan Borgwardt | Vera Demberg

Large language models (LLMs) are increasingly applied in domains that demand reliable and interpretable reasoning. While formal methods can generate provably correct proofs, these proofs are often inaccessible to non-expert users. This raises a natural question: can LLMs, when given a verified proof, faithfully interpret its reasoning and communicate it clearly? We introduce ProofTeller, a benchmark that evaluates this ability across three tasks: (1) identifying key proof steps, (2) summarizing the reasoning, and (3) explaining the result in concise natural language. The benchmark covers three domains: _Biology_, _Drones_, and _Recipes_, representing scientific, safety-critical, and everyday reasoning scenarios. We find a consistent near-conclusion bias: LLMs tend to focus on steps closest to the final proof conclusion rather than on the most informative ones. A targeted human study confirms that explanations based on such steps are rated less appropriate for end users. These findings indicate that even when reasoning is provided, current LLMs face challenges in communicating key information in a useful manner, highlighting the need for LLMs that can communicate important details reliably.

pdf bib
Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE
Varvara Arzt | Allan Hanbury | Michael Wiegand | Gabor Recski | Terra Blevins

Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability. We release our dataset splits with sample IDs and code for reproducibility.

pdf bib
Whose story is it? Personalizing story generation by inferring author styles
Nischal Ashok Kumar | Chau Minh Pham | Mohit Iyyer | Andrew Lan

Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author’s writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per author, across five distinct sources reflecting diverse story-writing settings. We propose a two-stage pipeline for personalized story generation: first, we infer authors’ implicit writing characteristics and organize them into an Author Writing Sheet, which is validated by humans to be of high quality; second, we simulate the author’s persona using tailored persona descriptions and personalized story rules. We find that stories personalized using the Author Writing Sheet outperform a non-personalized baseline, achieving a 78% win-rate in capturing authors’ past style and 59% in similarity to ground-truth author stories. Human evaluation supports these findings and further highlights trends, such as Reddit stories being easier to personalize, and the Creativity and Language Use aspects of stories being easier to personalize than the Plot.

pdf bib
The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
Niyati Bafna | Tianjian Li | Kenton Murray | David R. Mortensen | David Yarowsky | Hale Sirin | Daniel Khashabi

Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving→translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

pdf bib
TaCL-CoMoE: Task-adaptive Contrastive Learning with Cooperative Mixture of Experts for Multi-task Social Media Analysis
Wang Xingren | Hongde Liu | Liushanhong | Feiyang Meng | Chenyuan He | Senbin Zhu | Li Zechen | Yuxiang Jia

Social media has become a crucial platform for information dissemination and opinion expression. The massive and continuous generation of user content has given rise to various natural language processing tasks, such as sentiment analysis and topic classification. However, existing mainstream approaches typically focus on modeling individual tasks in isolation, lacking systematic exploration of collaborative modeling across multiple tasks. This neglects the inherent correlations among social media tasks, thereby limiting the model’s ability to fully comprehend and exploit the rich, multi-dimensional semantic information embedded in text. To address this challenge, we propose Task-adaptive Contrastive Learning with Cooperative Mixture of Experts (TaCL-CoMoE), a unified framework for social media multi-task learning. Specifically, we improve the gating mechanism by replacing the traditional softmax routing with sigmoid activation, enabling cooperative selection among multiple experts and mitigating the “expert monopoly” phenomenon. In addition, we introduce a task-adaptive contrastive learning strategy to further enhance the model’s ability to capture and distinguish semantic structures across different tasks. Experimental results on multiple public social media datasets demonstrate that TaCL-CoMoE consistently achieves state-of-the-art (SOTA) performance. The code is available at https://github.com/wxr2847/TaCL-CoMoE.

pdf bib
Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection
Saad Almohaimeed | Saleh Almohaimeed | Damla Turgut | Ladislau Bölöni

Implicit hate speech has increasingly been recognized as a significant issue for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators’ subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.

pdf bib
SEAGraph: Unveiling the Whole Story of Paper Review Comments
Jianxiang Yu | Jiaqi Tan | Zichen Ding | Jiapeng Zhu | Jiahao Li | Yao Cheng | Qier Cui | Yunshi Lan | Yao Liu | Xiang Li

Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer’s concerns but also improve their work. This raises the critical question of how to enhance authors’ comprehension of review comments. In this paper, we present SEAGraph a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author’s thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors. By bridging the gap between reviewers’ critiques and authors’ comprehension, SEAGraph contributes to a more efficient, transparent, and collaborative scientific publishing ecosystem. Our code is available at https://anonymous.4open.science/r/seagraph/.

pdf bib
A Comparative Study of Human-operated and AI-driven Guidance with a Teleoperated Mobile Robot
Ao Guo | Shota Mochizuki | Sanae Yamashita | Hoshimure Kenya | Jun Baba | Ryuichiro Higashinaka

Recent advances in large language models (LLMs) such as GPT-4o offer the potential for enhancing AI-driven robotic interactions, but their effectiveness in mobile tour guidance remains unexplored. This study investigates the differences between human-operated and AI-driven guidance at an aquarium using Teleco, a teleoperated mobile robot, in a real-world field experiment. A total of 277 guidance sessions were collected under two modes: human-operated, where the operator controlled all dialogue, actions, and movement, and AI-driven, where GPT-4o generated responses while the operator only controlled the robot’s actions and movement. Our results indicate that human-operated guidance places greater emphasis on visitor movement, spatial positioning during observation guidance, and empathetic expressions, whereas AI-driven guidance promotes conversational engagement by frequently prompting visitors to ask questions. In addition, we found that user behaviors, including users’ gaze patterns and vocabulary richness, also serve as valuable indicators reflecting their overall experience during guidance interactions. Furthermore, empathetic expression is recognized as the key differentiating factor between the two guidance modes, significantly influencing users’ overall experience.

pdf bib
R²-CoD: Understanding Text-Graph Complementarity in Relational Reasoning via Knowledge Co-Distillation
Zhen Wu | Ritam Dutt | Luke M. Breitfeller | Armineh Nourbakhsh | Siddharth Parekh | Carolyn Rose

Relational reasoning lies at the core of many NLP tasks, drawing on complementary signals from text and graphs. While prior research has investigated how to leverage this dual complementarity, a detailed and systematic understanding of text-graph interplay and its effect on hybrid models remains underexplored. We take an analysis-driven approach to investigate text–graph representation complementarity via a unified architecture that supports knowledge co-distillation (CoD). We explore five tasks involving relational reasoning that differ in how text and graph structures encode the information needed to solve that task. By tracking how these dual representations evolve during training, we uncover interpretable patterns of alignment and divergence, and provide insights into when and why their integration is beneficial.

pdf bib
ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung | Teetouch Jaknamon | Sirinya Chaiophat | Natapong Nitarach | Chanakan Wittayasakpan | Warit Sirichotedumrong | Adisai Na-Thalang | Kunat Pipatanakul

We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

pdf bib
An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring
Sana Ebrahimi | Mohsen Dehghankar | Abolfazl Asudeh

While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system’s effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.

pdf bib
Are LLMs Rigorous Logical Reasoners? Empowering Natural Language Proof Generation by Stepwise Decoding with Contrastive Learning
Ying Su | Mingwen Liu | Zhijiang Guo

Logical reasoning is a pivotal component in the field of artificial intelligence. Proof planning, particularly in contexts requiring the validation of explanation accuracy, continues to present challenges. The recent advancement of large language models (LLMs) has led to significant progress in natural language proof planning, evolving from one-stage generators to more complex three-stage systems that include additional searchers or verifiers. While these assisted methods improve the quality of generated results, they also introduce increased search efforts and computational costs. Furthermore, the generative process itself remains underexplored. In this study, we propose a stepwise decoding approach augmented by contrastive learning to address two common errors encountered during the LLM generator’s decoding process. We fine-tune the language model using both vanilla and enhanced hard negatives to mitigate these decoding errors. Empirical results demonstrate the effectiveness of our strategy. Additionally, our further analysis reveals that even larger LLMs still struggle to generate rigorous logical chains.

pdf bib
NyayaRAG: Realistic Legal Judgment Prediction with RAG under the Indian Common Law System
Shubham Kumar Nigam | Balaramamahanthi Deepak Patnaik | Shivam Mishra | Ajay Varghese Thomas | Noel Shallum | Kripabandhu Ghosh | Arnab Bhattacharya

Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

pdf bib
Exploring Working Memory Capacity in LLMs: From Stressors to Human-Inspired Strategies
Eunjin Hong | Sumin Cho | Juae Kim

Large language models (LLMs) exhibit inherent limitations in working memory, which often affect their overall capabilities. However, prior studies have largely focused on describing such constraints without identifying their causes or providing practical strategies to cope with them. In this paper, we investigate the limited working memory capacity of LLMs through a series of empirical studies. Specifically, we examine the factors involved in the limited capacity and explore strategies to make more effective use of it. Our analysis shows that the number and difficulty of tasks in a single input largely strain the working memory of LLMs. In response, we design a cognitive marker consisting of simple token sequences theoretically grounded in cognitive science. Further analyses show that the cognitive marker reduces the overall prediction difficulty and uncertainty for the models to process the input, and its effectiveness is confirmed across various evaluation settings. Overall, our study incorporates cognitively motivated perspectives into the analysis of model behavior and highlights the need for deeper exploration of working memory in LLMs.

pdf bib
CLASSER: Cross-lingual Annotation Projection enhancement through Script Similarity for Fine-grained Named Entity Recognition
Prachuryya Kaushik | Ashish Anand

We introduce CLASSER, a cross-lingual annotation projection framework enhanced through script similarity, to create fine-grained named entity recognition (FgNER) datasets for low-resource languages. Manual annotation for named entity recognition (NER) is expensive, and distant supervision often produces noisy data that are often limited to high-resource languages. CLASSER employs a two-stage process: first projection of annotations from high-resource NER datasets to target language by using source-to-target parallel corpora and a projection tool built on a multilingual encoder, then refining them by leveraging datasets in script-similar languages. We apply this to five low-resource Indian languages: *Assamese*, *Marathi*, *Nepali*, *Sanskrit*, and *Bodo*, a vulnerable language. The resulting dataset comprises 1.8M sentences, 2.6M entity mentions and 24.7M tokens. Through rigorous analyses, the effectiveness of our method and the high quality of the resulting dataset are ascertained with F1 score improvements of 26% in Marathi and 46% in Sanskrit over the current state-of-the-art. We further extend our analyses to zero-shot and cross-lingual settings, systematically investigating the impact of script similarity and multilingualism on cross-lingual FgNER performance. The dataset is publicly available at [huggingface.co/datasets/prachuryyaIITG/CLASSER](https://huggingface.co/datasets/prachuryyaIITG/CLASSER).

pdf bib
On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility
Kushal Tatariya | Wessel Poelman | Miryam de Lhoneux

Language model architectures are predominately first created for English and afterwards applied to other languages. This can lead to problems for languages that are structurally different from English. We study one specific architectural choice: positional encodings. We do this through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis states there exists a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pre-train and evaluate three monolingual model variants with absolute, relative and no position encodings for seven typologically diverse languages and evaluate on four downstream tasks. We fail to find a consistent trend with various proxies for morphological complexity and word order flexibility. Our work shows choice of tasks, languages, and metrics are essential for drawing stable conclusions.

pdf bib
Doppelganger-JC: Benchmarking the LLMs’ Understanding of Cross-Lingual Homographs between Japanese and Chinese
Yuka Kitamura | Jiahao Huang | Akiko Aizawa

The recent development of LLMs is remarkable, but they still struggle to handle cross-lingual homographs effectively. This research focuses on the cross-lingual homographs between Japanese and Chinese—the spellings of the words are the same, but their meanings differ entirely between the two languages. We introduce a new benchmark dataset named Doppelganger-JC to evaluate the ability of LLMs to handle them correctly. We provide three kinds of tasks for evaluation: word meaning tasks, word meaning in context tasks, and translation tasks. Through the evaluation, we found that LLMs’ performance in understanding and using homographs is significantly inferior to that of humans. We pointed out the significant issue of homograph shortcut, which means that the model tends to preferentially interpret the cross-lingual homographs in its easy-to-understand language. We investigate the potential cause of this homograph shortcut from a linguistic perspective and pose that it is difficult for LLMs to recognize a word as a cross-lingual homograph, especially when it shares the same part-of-speech (POS) in both languages. The data and code is publicly available here: https://github.com/0017-alt/Doppelganger-JC.git.

pdf bib
Don’t Take it Literally! Idiom-aware Vietnamese Translation via In-context Learning
Luan Thanh Nguyen | Parisa Kordjamshidi

The translation of idiomatic expressions often results in misunderstandings and inaccuracies, affecting everyday communication as well as machine translation systems. This paper introduces Idiom-aware Vietnamese Translation (IDiAT), a new framework for the evaluation of idiomatic translation for Vietnamese, along with state-of-the-art results for this task. We collect and curate a high-quality Vietnamese-English idiom set that serves as a resource for in-context learning (ICL). IDiAT’s evaluation benchmark includes both idiomatic and non-idiomatic text pairs to assess general translation quality and idiomatic translation performance. We leverage ICL in large language models to augment few-shot demonstrations with idiom and topic descriptions and consequently improve the translation accuracy. Empirical results demonstrate that our IDiAT-based ICL outperforms traditional supervised methods using only a few data samples. Multiple evaluations confirm the effectiveness of our proposed approach. Though focusing on the Vietnamese language, our approach advances idiomatic translation and contributes to the development of culturally aware translation systems, paving the way for future research in low-resource languages. The experimental materials are publicly available.

pdf bib
LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis
Favour Y. Aghaebe | Elizabeth A Williams | Tanefa Apekey | Nafise Sadat Moosavi

Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies.We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination.Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and underrepresented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.

pdf bib
KERLQA: Knowledge-Enhanced Reinforcement Learning for Question Answering in Low-resource Languages
Sello Ralethe | Jan Buys

Question answering in low-resource languages faces critical challenges when models encounter questions beyond their knowledge boundaries, often producing confident but incorrect answers. We propose Knowledge-Enhanced Reinforcement Learning for Question Answering (KERLQA), a novel approach that combines knowledge graph integration with reinforcement learning to enable principled abstention decisions. Unlike existing refusal-tuned methods that make binary decisions based solely on internal confidence, KERLQA implements a three-way decision process: answer with internal knowledge, answer with external knowledge assistance, or abstain. Using a composite reward function that jointly optimizes for correctness, appropriate abstention, and efficient knowledge utilization, we train policies via PPO and DPO with dynamic calibration for low-resource settings. Experiments on CommonsenseQA and OpenBookQA across English and four South African languages show KERLQA achieves improved F1 scores, with up to 6.2 point improvements in low-resource languages. Our analysis reveals that KERLQA reduces false positive abstention rates by 30% while expanding the boundary of answerable questions through external knowledge integration.

pdf bib
The Visual Counter Turing Test (VCT²): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (V_AI)
Nasrin Imanpour | Abhilekh Borah | Shashwat Bajpai | Subhankar Ghosh | Sainath Reddy Sankepally | Hasnat Md Abdullah | Nishoak Kosaraju | Shreyas Dixit | Ashhar Aziz | Shwetangshu Biswas | Vinija Jain | Aman Chadha | Song Wang | Amit Sheth | Amitava Das

The rapid progress and widespread availability of text-to-image (T2I) generation models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. To systematically address this generalization gap, we introduce the Visual Counter Turing Test (VCT^2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art (SoTA) T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL·E 3, and Midjourney 6. We curate two distinct subsets: COCO_AI, featuring structured captions from MS COCO, and Twitter_AI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCO_AI and 58.34% on Twitter_AI. To transcend binary classification, we propose the Visual AI Index (V_AI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between V_AI and detection accuracy: Pearson rho of -0.532 on COCO_AI and rho of -0.503 on Twitter_AI; suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCO_AI and Twitter_AI to catalyze future advances in robust AGID and perceptual realism assessment.

pdf bib
Hildoc: Leveraging Hilbert Curve Representation for Accurate and Efficient Document Retrieval
Muhammad AL-Qurishi | Zhaozhi Qian | Faroq AL-Tam | Riad Souissi

Document retrieval is a critical challenge in information retrieval systems, where the goal is to efficiently retrieve relevant documents in response to a given query. Dense retrieval methods, which utilize vector embeddings to represent semantic information, require effective indexing to ensure fast and accurate retrieval. Existing methods, such as MEVI, have attempted to address this by using hierarchical K-Means for clustering, but they often face limitations in computational efficiency and retrieval accuracy. In this paper, we introduce the Hildoc Index, a novel document indexing approach that leverages the Hilbert Curve to map document embeddings onto a one-dimensional space. This innovative representation facilitates efficient clustering using a 1D quantile-based algorithm, ensuring uniform partition sizes and preserving the inherent structure of the data. As a result, Hildoc Index not only reduces training complexity but also enhances retrieval accuracy and speed during inference. Our method can be seamlessly integrated into both dense retrieval systems and hybrid ensemble systems. Through comprehensive experiments on standard benchmarks like MSMARCO Passage and Natural Questions, we demonstrate that the Hildoc Index significantly outperforms the current state-of-the-art MEVI in terms of both retrieval speed and recall. These results underscore the Hildoc Index as a solution for fast and accurate dense document retrieval.

pdf bib
Rethinking what matters: Effective and Robust Multilingual Realignment for Low-Resource Languages
Quang Phuoc Nguyen | David Anugraha | Félix Gaschi | Jun Bin Cheng | En-Shiun Annie Lee

Realignment is a promising strategy to improve cross-lingual transfer in multilingual language models. However, empirical results are mixed and often unreliable, particularly for typologically distant or low-resource languages (LRLs) compared to English. Moreover, word realignment tools often rely on high-quality parallel data, which can be scarce or noisy for many LRLs. In this work, we conduct an extensive empirical study to investigate whether realignment truly benefits from using all available languages, or if strategically selected subsets can offer comparable or even improved cross-lingual transfer, and study the impact on LRLs. Our controlled experiments show that realignment can be particularly effective for LRLs and that using carefully selected, linguistically diverse subsets can match full multilingual alignment, and even outperform it for unseen LRLs. This indicates that effective realignment does not require exhaustive language coverage and can reduce data collection overhead, while remaining both efficient and robust when guided by informed language selection.

pdf bib
Decode Like a Clinician: Enhancing LLM Fine-Tuning with Temporal Structured Data Representation
Daniel Fadlon | David Dov | Aviya Bennett | Daphna Heller-Miron | Gad Levy | Kfir Bar | Ahuva Weiss-Meilik

Predictive modeling of hospital patient data is challenging due to its structured format, irregular timing of measurements, and variation in data representation across institutions. While traditional models often struggle with such inconsistencies, Large Language Models (LLMs) offer a flexible alternative. In this work, we propose a method for verbalizing structured Electronic Health Records (EHRs) into a format suitable for LLMs and systematically examine how to include time-stamped clinical observations—such as lab tests and vital signs—from previous time points in the prompt. We study how different ways of structuring this temporal information affect predictive performance, and whether fine-tuning alone enables LLMs to effectively reason over such data. Evaluated on two real-world hospital datasets and MIMIC-IV, our approach achieves strong in-hospital and cross-hospital performance, laying the groundwork for more generalizable clinical modeling.

pdf bib
Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces
Farhan Sheth | Girish | Mohd Mujtaba Akhtar | Muskaan Singh

In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms—including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individual synthesis families and often fails to generalize across paradigms due to overfitting to generation-specific artifacts. We hypothesize that synthetic speech, irrespective of its generative origin, leaves behind shared structural distortions in the embedding space that can be aligned through geometry-aware modeling. To this end, we propose RHYME, a unified detection framework that fuses utterance-level embeddings from diverse pretrained speech encoders using non-Euclidean projections. RHYME maps representations into hyperbolic and spherical manifolds—where hyperbolic geometry excels at modeling hierarchical generator families, and spherical projections capture angular, energy-invariant cues such as periodic vocoder artifacts. The fused representation is obtained via Riemannian barycentric averaging, enabling synthesis invariant alignment. RHYME outperforms individual PTMs and homogeneous fusion baselines, achieving top performance and setting new state-of-the-art in cross-paradigm ADD.

pdf bib
MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language
Farhan Farsi | Farnaz Aghababaloo | Shahriar Shariati Motlagh | Parsa Ghofrani | MohammadAli SadraeiJavaheri | Shayan Bali | Amir Hossein Shabani | Farbod Bijary | Ghazal Zamaninejad | AmirMohammad Salehoof | Saeedeh Momtazi

As large language models (LLMs) become increasingly embedded in our daily lives, evaluating their quality and reliability across diverse contexts has become essential. While comprehensive benchmarks exist for assessing LLM performance in English, there remains a significant gap in evaluation resources for other languages. Moreover, because most LLMs are trained primarily on data rooted in European and American cultures, they often lack familiarity with non-Western cultural contexts. To address this limitation, our study focuses on the Persian language and Iranian culture. We introduce 19 new evaluation datasets specifically designed to assess LLMs on topics such as Iranian law, Persian grammar, Persian idioms, and university entrance exams. Using these datasets, we benchmarked 41 prominent LLMs, aiming to bridge the existing cultural and linguistic evaluation gap in the field. The evaluation results are publicly available on our live leaderboard: https://huggingface.co/spaces/opll-org/Open-Persian-LLM-Leaderboard

pdf bib
Atomic Consistency Preference Optimization for Long-Form Question Answering
Jingfeng Chen | Raghuveer Thirukovalluru | Junlin Wang | Kaiwei Luo | Bhuwan Dhingra

Large Language Models (LLMs) often produce factoid hallucinations - plausible yet incorrect answers. A common mitigation strategy is model alignment, which improves factual accuracy by training on curated (factual, non-factual) pairs. However, this approach often relies on a stronger model (e.g., GPT-4) or an external knowledge base to assess factual correctness that may not always be accessible. Addressing this, we propose Atomic Consistency Preference Optimization (ACPO), a self-supervised preference-tuning method that enhances factual accuracy without external supervision. ACPO leverages atomic consistency signals (i.e., the agreement of individual facts across multiple stochastic responses) to identify high- and low-quality data pairs for model alignment. Despite being fully self-supervised, ACPO outperforms the strong supervised alignment baseline by 1.95 points averaged across Phi-3 and Llama3 on the LongFact and BioGen datasets, demonstrating its effectiveness in improving factual reliability without relying on external models or knowledge bases.

pdf bib
Breaking Bad: Norms for Valence, Arousal, and Dominance for over 10k English Multiword Expressions
Saif M. Mohammad

Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has **human ratings of valence, arousal, and dominance for ∼10k English Multiword Expressions (MWEs) and their constituent words**. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new **NRC VAD Lexicon v2 now has entries for ∼10k MWEs and ∼25k words, in addition to the entries in v1**. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project web- page: http://saifmohammad.com/ WebPages/nrc-vad.html

pdf bib
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
Kyle Buettner | Jacob T. Emmerson | Adriana Kovashka

When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

pdf bib
Characterizing Mamba’s Selective Memory using Auto-Encoders
Tamanna Hossain | Robert L. Logan Iv | Chandrasekhara Ganesh Jagadeesan | Sameer Singh | Joel R. Tetreault | Alejandro Jaimes

State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM’s hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M–1.4B) on sequences ranging from 4–256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba’s pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba’s ability to retain important information.

pdf bib
Task-Aligned Tool Recommendation for Large Language Models
Hang Gao | Yongfeng Zhang

By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.

pdf bib
INTERCHART: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
Anirudh Iyengar Kaniyar Narayana Iyengar | Srija Mukhopadhyay | Adnan Qidwai | Shubhankar Singh | Dan Roth | Vivek Gupta

We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2–3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open- and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.

pdf bib
LangCompress: Language-Aware Compression of Large Language Models
Dieu-Hien Nguyen | Nguyen-Khang Le | Truong Dinh Do | Le-Minh Nguyen

Large Language Models (LLMs) demonstrate strong multilingual capabilities but are costly to deploy due to their size and computational demands. To mitigate this, compression techniques such as pruning and quantization are widely used. However, these methods face two key limitations: (1) they assume access to high-quality instruction or calibration data, which is often unavailable for low-resource languages; and (2) they aim to preserve multilingual generality, making them inefficient for language-specific applications. We introduce LangCompress, a language-aware compression framework that enhances existing compression methods for targeted deployment. LangCompress is method-agnostic and improves state-of-the-art pruning and quantization approaches. It features two core components: an iterative self-supervised pipeline for generating instruction data in the target language, and a vocabulary simplification strategy that reduces the LM head to focus on key tokens. Experiments on perplexity, translation, and summarization tasks show that LangCompress improves performance in the target language. The code and data are publicly available.

pdf bib
The Confidence Paradox: Can LLM Know When It’s Wrong?
Sahil Tripathi | MD Tabrez Nafis | Imran Hussain | Jiechao Gao

Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses, especially under uncertainty. Existing models like LayoutLMv3, UDOP, and DONUT focus on accuracy but lack ethical calibration. We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning. We introduce two new metrics—Honesty Score (H-Score) and Ethical Confidence Index (ECI)—to evaluate ethical alignment. HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. It also generalizes well across domains, achieving 78.9% accuracy and 76.1% F1-score.

pdf bib
DharmaBench: Evaluating Language Models on Buddhist Texts in Sanskrit and Tibetan
Kai Golan Hashiloni | Shay Cohen | Asaf Shina | Jingyi Yang | Orr Meir Zwebner | Nicola Bajetta | Guy Bilitski | Rebecca Sundén | Guy Maduel | Ryan Conlon | Ari Barzilai | Daniel Mass | Shanshan Jia | Aviv Naaman | Sonam Choden | Sonam Jamtsho | Yadi Qu | Harunaga Isaacson | Dorji Wangchuk | Shai Fine | Orna Almogi | Kfir Bar

We assess the capabilities of large language models on tasks involving Buddhist texts written in Sanskrit and Classical Tibetan—two typologically distinct, low-resource historical languages. To this end, we introduce DharmaBench, a benchmark suite comprising 13 classification and detection tasks grounded in Buddhist textual traditions: six in Sanskrit and seven in Tibetan, with four shared across both. The tasks are curated from scratch, tailored to the linguistic and cultural characteristics of each language. We evaluate a range of models, from proprietary systems like GPT-4o to smaller, domain-specific open-weight models, analyzing their performance across tasks and languages. All datasets and code are publicly released, under the CC-BY-4 License and the Apache-2.0 License respectively, to support research on historical language processing and the development of culturally inclusive NLP systems.

pdf bib
BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages
Subhadip Maji | Arnab Bhattacharya

Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.

pdf bib
Are Humans as Brittle as Large Language Models?
Jiahui Li | Sean Papay | Roman Klinger

The output of large language models (LLMs) is unstable, due both to non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to prompt changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

pdf bib
Large Temporal Models: Unlocking Temporal Understanding in LLMs for Temporal Relation Classification
Omri Homburger | Kfir Bar

We present Large Temporal Model, a Large Language Model (LLM) that excels in Temporal Relation Classification (TRC). We show how a carefully designed fine-tuning strategy, using a novel two-step fine-tuning approach, can adapt LLMs for TRC. Our approach is focused on global TRC, enabling simultaneous classification of all temporal relations within a document. Unlike traditional pairwise methods, our approach performs global inference in a single step, improving both efficiency and consistency. Evaluations on the MATRES and OmniTemp benchmarks demonstrate that, for the first time, an LLM achieves state-of-the-art performance, outperforming previous pairwise and global TRC methods. Results show that our global approach produces more consistent and accurate temporal graphs. Ablation studies further validate the effectiveness of our two-step fine-tuning strategy, while analyses reveal why our approach succeeds in increasing performance and reducing inconsistencies.

pdf bib
What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization
Weixiao Zhou | Junnan Zhu | Gengyao Li | Xianfu Cheng | Xinnian Liang | Feifei Zhai | Zhoujun Li

Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.

pdf bib
CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts
Chuang Liang | Wei Li | Yanqiu Shao

Quotation recommendation is an inherently asymmetric retrieval task, where the intended meaning of a quote often diverges from surface expressions, creating significant semantic shifts. Combined with minimal lexical overlap, this poses a core challenge for classic dense retrievers, which struggle to capture non-literal and rhetorical alignments. To bridge this semantic gap, we propose introducing controllable signals to guide the model’s attention toward abstract, context-relevant concepts. We propose CtrlShift, a framework that leverages a Variational Autoencoder (VAE) to capture latent associations between context and quotation, which is used to derive context-aware control signals to modulate semantic focus and support bidirectional alignment and rhetorical intent modeling. Experiments show that our method consistently outperforms baselines on the quotation recommendation task and can be effectively transfered to the general purposed benchmark. Further, CtrlShift integrates seamlessly with general-purpose generative models without additional fine-tuning, and provides satisfactory interpretability by generating textual explaination to uncover the model’s focus on abstract, citation-aligned semantics.

pdf bib
PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
Morteza Alikhani | Mohammadtaha Bagherifard | Erfan Zinvandi | Mehran Sarmadi

We introduced PerCoR—Persian Commonsense Reasoning—the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural and other web sources. We adopt a linguistically grounded, conjunction-based segmentation strategy to generate coherent prefix–continuation pairs. To create challenging distractors, we propose DRESS-AF—Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering—a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://huggingface.co/datasets/MCINext/PerCoR .

pdf bib
Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta | Francis Ferraro

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Q2E outperforms the previous SOTA on the MultiVENT dataset by 8 NDCG points, while improving on MSR-VTT and MSVD by 4 and 3 points, respectively, outperforming several existing retrieval methods, including many fine-tuned and SOTA zero-shot approaches. We have released both code and data.

pdf bib
VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation
Arpan Phukan | Anupam Pandey | Deepjyoti Bodo | Asif Ekbal

Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions. To facilitate future research, we publicly release our code and dataset.

pdf bib
Interpreting the Effects of Quantization on LLMs
Manpreet Singh | Hassan Sajjad

Quantization offers a practical solution to deploy LLMs in resource-constraint environments. However, its impact on internal representations remains understudied, raising questions about the reliability of quantized models. In this study, we employ a range of interpretability techniques to investigate how quantization affects model and neuron behavior. We analyze multiple LLMs under 4-bit and 8-bit quantization. Our findings reveal that the impact of quantization on model calibration is generally minor. Analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. In terms of neuron contribution to predictions, we observe that smaller full precision models exhibit fewer salient neurons, whereas larger models tend to have more, with the exception of Llama-2-7B. The effect of quantization on neuron redundancy varies across models. Overall, our findings suggest that effect of quantization may vary by model and tasks, however, we did not observe any drastic change which may discourage the use of quantization as a reliable model compression technique.

pdf bib
Large Language Models Encode Semantics and Alignment in Linearly Separable Representations
Baturay Saglam | Paul Kassianik | Blaine Nelson | Sajana Weerawardhena | Yaron Singer | Amin Karbasi

Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. Yet it remains unclear to what extent LLMs linearly organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across six scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior—even when surface content remains unchanged. These findings motivate geometry-aware tools that operate directly in latent space to detect and mitigate harmful and adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model’s built-in safety alignment and external token-level filters.

pdf bib
Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Jonne Sälevä | Duygu Ataman | Constantine Lignos

We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks.We show how experimental variation in performance scores arises from both model and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications.Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for quantifying the replication uncertainty of various quantities used in leaderboards such as model rankings and pairwise differences between models.

pdf bib
What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Question Generation
Shashidhar Reddy Javaji | Zining Zhu

Large language models (LLMs) are increasingly widely used as critical components of knowledge retrieval systems and agentic systems. These systems can benefit from knowledge-seeking capabilities of LLMs, in other words, curiosity. However, this capability has not been evaluated quantitatively. Towards bridging this gap, we propose an evaluation framework, CDQG (Curiosity-Driven Question Generation). The CDQG task prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. The CDQG dataset contains 1,988 statements including physics, chemistry, and mathematics with distinct levels of difficulty, general knowledge statements, and intentionally erroneous statements. We score the qualities of the questions generated by LLMs along multiple dimensions. These scores are validated by rigorous controlled ablation studies and human evaluations. While large models like GPT-4 and Mistral 8x7b can generate highly coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model’s knowledge acquisition potential. CDQG quantifies a critical model capability, and opens up research opportunities for developing future knowledge retrieval systems driven by LLMs.

pdf bib
Can AI Validate Science? Benchmarking LLMs on Claim →Evidence Reasoning in AI Papers
Shashidhar Reddy Javaji | Yupeng Cao | Haohang Li | Yangyang Yu | Nikhil Muralidhar | Zining Zhu

Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs’ capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs’ ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs’ abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

pdf bib
More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation
Yangtian Zi | Harshitha Menon | Arjun Guha

State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval (Chen et al., 2021) but underperform on specialized suites such as ParEval (Nichols et al., 2024). Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

pdf bib
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec | Zhengyuan Liu | Nicholas Asher | Philippe Muller | Nancy F. Chen

Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay-like representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.

pdf bib
PMPO: A Self-Optimizing Framework for Creating High-Fidelity Measurement Tools for Social Bias in Large Language Models
Zeqiang Wang | Yuqi Wang | Xinyue Wu | Chenxi Li | Yiran Liu | Linghan Ge | Zhan Yu | Jiaxin Shi | Suparna De

The potential of Large Language Models (LLMs) as instruments for measuring social phenomena is constrained by the methodological limitations of current probing techniques. Prevailing methods rely on static, handcrafted probe sets whose quality is highly dependent on their authors’ subjective expertise. This results in measurement tools with inconsistent statistical reliability that defy systematic optimization. Such an “artisanal” approach, akin to using an “uneven ruler,” undermines the scientific rigor of its findings and severely limits the applicability of LLMs in the social sciences. To elevate bias measurement from a craft to a science, we introduce the Psychometric-driven Probe Optimization (PMPO) framework. This framework treats a probe set as an optimizable scientific instrument and, for the first time, utilizes a Neural Genetic Algorithm that leverages a powerful LLM as a “neural genetic operator.” Through a hybrid strategy of gradient-guided mutation and creative rephrasing, PMPO automatically enhances the probe set’s reliability, sensitivity, and diversity. We first establish the external validity of our foundational measurement method (PLC), demonstrating a high correlation between its measurement of occupational gender bias and real-world U.S. Bureau of Labor Statistics data (average Pearson’s r=0.83, p<.001). Building on this, we show that the PMPO framework can elevate a standard probe set’s internal consistency (Cronbach’s Alpha) from 0.90 to an exceptional 0.96 within 10 generations. Critically, in a rigorous, double-blind “Turing Test,” probes evolved by PMPO from non-expert seeds were judged by sociology experts to have achieved a level of quality, sophistication, and nuance that is comparable to, and even indistinguishable from, those handcrafted by domain experts. This work provides a systematic pathway to upgrade LLM measurement tools from artisanal artifacts to automated scientific instruments, offering an unprecedented and trustworthy tool for AI safety auditing and computational social science.

pdf bib
ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages
Neha Joshi | Pamir Gogoi | AasimBaig Mirza | Aayush Jansari | Aditya Yadavalli | Ayushi Pandey | Arunima Shukla | Deepthi Sudharsan | Kalika Bali | Vivek Seshadri

We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000—captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models’ capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context—including background information about the languages, translation examples, and guidelines for cultural preservation—leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

pdf bib
Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato | Seiya Kawano | Koichiro Yoshino

The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

pdf bib
IndicClaimBuster: A Multilingual Claim Verification Dataset
Pritam Pal | Shyamal Krishna Jana | Dipankar Das

The present article introduces **IndicClaimBuster**, a novel multilingual claim verification dataset comprising 9K claims and their corresponding evidence in English, Hindi, Bengali, and Hindi-English CodeMixed texts. The data set covers three key domains: politics, law and order, and health, to address the challenges of verifiable facts. Each claim was sourced from reputable Indian news portals and is accompanied by three pieces of evidence, two LLM-generated and one manually curated. Additionally, a separate attempt was conducted to generate refuted claims by employing an LLM. We further develop two frameworks: an unsupervised baseline and a two-stage pipeline that comprises evidence retrieval and veracity prediction modules. For retrieval, we fine-tuned SBERT models, with e5-base demonstrating superior average performance across languages, whereas for veracity prediction, multilingual transformers (mBERT, XLM-R, MuRIL, IndicBERTv2) were fine-tuned. Results indicate MuRIL and IndicBERTv2 excel in Indian languages, while XLM-R performs the best for CodeMix. Our work contributes a high-quality multilingual dataset and strong baseline methodologies, offering valuable resources for advancing automated claim verification in linguistically diverse and low-resource settings for Indian languages. The IndicClaimBuster dataset is available at: https://github.com/pritampal98/indic-claim-buster

pdf bib
IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
Ahmed Frikha | Nassim Walha | Krishna Kanth Nakka | Ricardo Mendes | Xue Jiang | Xuebing Zhou

In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than across 8 different private attributes. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model. Our results show the possibility of reducing privacy leakage by more than half with limited impact on utility.

pdf bib
Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data
Yohei Kobashi | Fumiya Uchiyama | Takeshi Kojima | Andrew Gambardella | Qi Cao | Yusuke Iwasawa | Yutaka Matsuo

As the adoption of large language models (LLMs) continues to grow, the risk of sensitive data leakage from their training datasets has become a critical concern. This study proposes a novel method for encrypting training data using a polyalphabetic substitution cipher. This approach prevents the model from learning sensitive information while allowing it to capture abstract linguistic patterns. We pre-trained a Llama 3 model (551M parameters) using approximately 7.5 billion tokens of encrypted data and subsequently conducted continual pre-training with another 2.5 billion tokens of plaintext data. The effectiveness of the model was evaluated by comparing its downstream task performance with a model trained solely on plaintext data. In addition, we evaluated the risk of sensitive data leakage through name reconstruction, true-prefix and data extraction attacks. These results demonstrate the potential of our approach to balance data security with model performance.

pdf bib
Not Just a Piece of Cake: Cross-Lingual Fine-Tuning for Idiom Identification
Ofri Hefetz | Kai Golan Hashiloni | Alon Mannor | Kfir Bar

We investigate cross-lingual fine-tuning for idiomatic expression identification, addressing the limited availability of annotated data in many languages. We evaluate encoder and generative decoder models to examine their ability to generalize idiom identification across languages. Additionally, we conduct an explainability study using linear probing and LogitLens to analyze how idiomatic meaning is represented across model layers. Results show consistent cross-lingual transfer, with English emerging as a strong source language. All code and models are released to support future research.

pdf bib
Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
Chunhua Liu | Hong Yi Lin | Patanamon Thongtanunam

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.

pdf bib
Differential Mamba
Nadav Schneider | Itamar Zimerman | Eliya Nachmani

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models.

pdf bib
From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen

Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-authored instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

pdf bib
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
Masahiro Kaneko | Zeerak Talat | Timothy Baldwin

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs—using the model’s previous responses to guide each new iteration—have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past‐Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

pdf bib
HARBOR: Exploring Persona Dynamics in Multi-Agent Competition
Kenan Jiang | Li Xiong | Fei Liu

We investigate factors contributing to LLM agents’ success in competitive multi-agent environments, using auctions as a testbed where agents bid to maximize profit. The agents are equipped with bidding domain knowledge, distinct personas that reflect item preferences, and a memory of auction history. Our work extends the classic auction scenario by creating a realistic environment where multiple agents bid on houses, weighing aspects such as size, location, and budget to secure the most desirable homes at the lowest prices. Particularly, we investigate three key questions: (a) How does a persona influence an agent’s behavior in a competitive setting? (b) Can an agent effectively profile its competitors’ behavior during auctions? (c) How can persona profiling be leveraged to create an advantage using strategies such as theory of mind? Through a series of experiments, we analyze the behaviors of LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a valuable platform for deepening the understanding of multi-agent workflows in competitive environments.

pdf bib
A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics
Angeline Charles | Srikant Panda | Amit Agarwal | Hitesh Laxmichand Patel | Priyaranjan Pattnayak | Bhargava Kumar | Tejaswini Kumar

Reference-free metrics such as CLIPScore and PAC-S are increasingly used in vision-language tasks due to their scalability and independence from human-written references. However, their reliability under linguistic, visual, and cultural variation remains underexplored. In this work, we present a systematic audit of CLIPScore and PAC-S using an eight-factor diagnostic framework applied to MS-COCO validation images. Our analysis reveals consistent failure modes across dimensions including object size, content category, syntax, named entities, spatial relations and cultural context. Both metrics penalize captions referencing African (−5.5%, −4.8%) and Arabian (−4.9%, −5.3%) cultures, favor large-object and animal-centric scenes (by 20-30%) and show limited sensitivity to spatial negation and word order. CLIPScore correlates more strongly with syntactic complexity, while PAC-S demonstrates greater robustness to verbosity and named–entity variation highlighting complementary strengths rather than superiority. These findings expose cultural and content bias, weak semantic robustness, and limited compositional understanding. We conclude with design recommendations to improve fairness, scale invariance, and semantic grounding in future reference-free evaluation metrics.

pdf bib
Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts
Preethi Seshadri | Hongyu Chen | Sameer Singh | Seraphina Goldfarb-Tarrant

Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making remains understudied in generative and retrieval settings. In this work, we examine the allocational fairness of LLM-based hiring systems through two tasks that reflect actual HR usage: resume summarization and applicant ranking. By constructing a synthetic resume dataset with controlled perturbations and curating job postings, we investigate whether model behavior differs across demographic groups. Our findings reveal that generated summaries exhibit meaningful differences more frequently for race than for gender perturbations. Models also display non-uniform retrieval selection patterns across demographic groups and exhibit high ranking sensitivity to both gender and race perturbations. Surprisingly, retrieval models can show comparable sensitivity to both demographic and non-demographic changes, suggesting that fairness issues may stem from broader model brittleness. Overall, our results indicate that LLM-based hiring systems, especially in the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.

pdf bib
CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
Raviraj Bhuminand Joshi | Rakesh Paul | Kanishk Singla | Anusha Kamath | Michael Evans | Katherine Luna | Shaona Ghosh | Utkarsh Vaidya | Eileen Margaret Peters Long | Sanjay Singh Chauhan | Niranjan Wartikar

The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Safety-Guard-Dataset-v3, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-8B-v3 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. Furthermore, we show our moderately multilingual fine-tuning enables robust cross-lingual transfer and strong zero-shot generalization to unseen languages. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work advances multilingual LLM safety by enabling the development of culturally aware safety guard models.

pdf bib
Revisiting Word Embeddings in the LLM Era
Yash Mahajan | Matthew Freestone | Naman Bansal | Sathyanarayanan N. Aakur | Santu Karmaker

Large Language Models (LLMs) have recently shown remarkable advancement in various NLP tasks. As such, a popular trend has emerged lately where NLP researchers extract word/sentence/document embeddings from these large decoder-only models and use them for various inference tasks with promising results. However, it is still unclear whether the performance improvement of LLM-induced embeddings is merely because of scale or whether underlying embeddings they produce significantly differ from classical encoding models like Word2Vec, GloVe, Sentence-BERT (SBERT) or Universal Sentence Encoder (USE). This is the central question we investigate in the paper by systematically comparing classical decontextualized and contextualized word embeddings with the same for LLM-induced embeddings. Our results show that LLMs cluster semantically related words more tightly and perform better on analogy tasks in decontextualized settings. However, in contextualized settings, classical models like SimCSE often outperform LLMs in sentence-level similarity assessment tasks, highlighting their continued relevance for fine-grained semantics.

pdf bib
Who Remembers What? Tracing Information Fidelity in Human-AI Chains
Suvojit Acharjee | Utathya Aich | Diptarka Mandal | Asfak Ali

In many real-world settings like journalism, law, medicine, and science communication, information is passed from one person or system to another through multiple rounds of summarization or rewriting. This process, known as multi-hop information transfer, also happens increasingly in workflows involving large language models (LLMs). But while summarization models and factuality metrics have improved, we still don’t fully understand how meaning and factual accuracy hold up across long chains of transformations, especially when both humans and LLMs are involved.In this paper, we take a fresh look at this problem by combining insights from cognitive science (Bartlett’s serial reproduction) and information theory (Shannon’s noisy-channel model). We build a new dataset of 700 five-step transmission chains that include human-only, LLM-only, mixed human-LLM, and cross-LLM settings across a wide range of source texts. To track how meaning degrades, we introduce three new metrics: Information Degradation Rate (IDR) for semantic drift, Meaning Preservation Entropy (MPE) for uncertainty in factual content, and Cascaded Hallucination Propagation Index (CHPI) for how hallucinations accumulate over time. Our findings reveal that hybrid chains behave asymmetrically. When a human summary is refined by a language model, the final output tends to preserve meaning well, suggesting that models can improve upon human-written summaries. The code and data will be available at : https://github.com/transtrace6/TransTrace.git.

pdf bib
QA-Noun: Representing Nominal Semantics via Natural Language Question-Answer Pairs
Maria Tseytlin | Paul Roit | Omri Abend | Ido Dagan | Ayal Klein

Decomposing sentences into fine-grained meaning units is increasingly used to model semantic alignment. While QA-based semantic approaches have shown effectiveness for representing predicate-argument relations, they have so far left noun-centered semantics largely unaddressed. We introduce QA-Noun, a QA-based framework for capturing noun-centered semantic relations. QA-Noun defines nine question templates that cover both explicit syntactical and implicit contextual roles for nouns, producing interpretable QA pairs that complement verbal QA-SRL. We release detailed guidelines, a dataset of over 2,000 annotated noun mentions, and a trained model integrated with QA-SRL to yield a unified decomposition of sentence meaning into individual, highly fine-grained, facts. Evaluation shows that QA-Noun achieves near-complete coverage of AMR’s noun arguments while surfacing additional contextually implied relations, and that combining QA-Noun with QA-SRL yields over 130% higher granularity than recent fact-based decomposition methods such as FactScore and DecompScore. QA-Noun thus complements the broader QA-based semantic framework, forming a comprehensive and scalable approach to fine-grained semantic decomposition for cross-text alignment.

pdf bib
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie | Yangsibo Huang | Chiyuan Zhang | Da Yu | Xinyun Chen | Bill Yuchen Lin | Bo Li | Badih Ghazi | Ravi Kumar

Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs’ reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using two dynamically generated logical reasoning benchmarks based on Knights and Knaves (K&K) puzzles and Zebra puzzles (DynamicZebra). We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. Through in-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers, we establish that LLMs develop reasoning skills on logical puzzles alongside memorization. Finally, our analysis based on a per-sample memorization score sheds light on how LLMs switch between reasoning and memorization when solving logical puzzles.

pdf bib
ControlMed: Adding Reasoning Control to Medical Language Model
Sung-Min Lee | Siyoon Lee | Juyeon Kim | Kyoungmin Roh

Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce ControlMed, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both direct and reasoning responses; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

pdf bib
No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning.
Abhishek Rajgaria | Kushagra Dixit | Mayank Vyas | Harshavardhan Kalalbandi | Dan Roth | Vivek Gupta

Temporal Table Reasoning poses a significant challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with “NO” single method consistently outperforming others. To address this, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. SEAR_Unified, its cost-efficient variant. We also demonstrate that optional table refactoring (preprocessing) enhances both approaches when tables lack structural consistency. Our results demonstrate that SEAR prompts achieve superior performance across all table types compared to baseline prompting techniques

pdf bib
ClinStructor: AI-Powered Structuring of Unstructured Clinical Texts
Karthikeyan K | Raghuveer Thirukovalluru | David Carlson

Clinical notes contain valuable, context-rich information, but their unstructured format introduces several challenges, including unintended biases (e.g., gender or racial bias), and poor generalization across clinical settings (e.g., models trained on one EHR system may perform poorly on another due to format differences) and poor interpretability. To address these issues, we present ClinStructor, a pipeline that leverages large language models (LLMs) to convert clinical free-text into structured, task-specific question–answer pairs prior to predictive modeling. Our method substantially enhances transparency and controllability and only leads to a modest reduction in predictive performance (a 2–3% drop in AUC), compared to direct fine-tuning, on the ICU mortality prediction task. ClinStructor lays a strong foundation for building reliable, interpretable, and generalizable machine learning models in clinical environments.

pdf bib
Adaptive Collaborative Labeling with MLLMs for Low-Resource Multimodal Emotion Recognition
Wenwen Zhuang | Lu Xiang | Shubei Tang | Yaping Zhang | Yu Zhou

Multimodal emotion recognition (MER) plays a crucial role in human-centric AI applications, yet existing models struggle in low-resource scenarios due to their heavy reliance on large amounts of high-quality labeled data. To address this challenge, we propose Adaptive Collaborative Labeling for Low-Resource MER (ACL-MER), a novel framework that leverages off-the-shelf multimodal large language models (MLLMs) to effectively exploit abundant unlabeled data. Specifically, ACL-MER incorporates a diverse teacher model zoo, wherein each MLLM specializes in a specific modality and is prompted to generate chain-of-thought predictions accompanied by scalar confidence scores. Rather than directly adopting these pseudo-labels, ACL-MER introduces an adaptive refinement strategy that selectively distills knowledge based on teacher confidence, iteratively guiding the lightweight student model toward robust learning under limited supervision. Extensive experiments on two benchmarks demonstrate that ACL-MER consistently outperforms strong baselines, especially in extremely low-resource settings.

pdf bib
Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
Nhi Hoai Doan | Tatsuya Hiraoka | Kentaro Inui

This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

pdf bib
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Abhijit Mishra | Mingda Li | Hsiang Fu | Richard Noh | Minji Kim

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores **Visual Instruction Rewriting**, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs **(250M parameters)** with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

pdf bib
Quantifying Cognitive Bias Induction in LLM-Generated Content
Abeer Alessa | Param Somane | Akshaya Thenkarai Lakshminarasimhan | Julian Skirzynski | Julian McAuley | Jessica Maria Echterhoff

Large language models (LLMs) are integrated into applications like shopping reviews, summarization, or medical diagnosis support, where their use affects human decisions. We investigate the extent to which LLMs expose users to biased content and demonstrate its effect on human decision-making. We assess five LLM families in summarization and news fact-checking tasks, evaluating the consistency of LLMs with their context and their tendency to hallucinate on a new self-updating dataset. Our findings show that LLMs expose users to content that changes the context’s sentiment in 26.42% of cases (framing bias), hallucinate on 60.33% of post-knowledge-cutoff questions, and highlight context from earlier parts of the prompt (primacy bias) in 10.12% of cases, averaged across all tested models. We further find that humans are 32% more likely to purchase the same product after reading a summary of the review generated by an LLM rather than the original review. To address these issues, we evaluate 18 mitigation methods across three LLM families and find the effectiveness of targeted interventions.

pdf bib
Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation
Daniil Gurgurov | Katharina Trinley | Yusser Al Ghussin | Tanja Baeumel | Josef Van Genabith | Simon Ostermann

Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity.Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming established replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

pdf bib
FOCUS: A Benchmark for Targeted Socratic Question Generation via Source-Span Grounding
Surawat Pothong | Machi Shimmei | Naoya Inoue | Paul Reisert | Ana Brassard | Wenzhi Wang | Shoichi Naito | Jungmin Choi | Kentaro Inui

We present FOCUS, a benchmark and task setting for Socratic question generation that delivers more informative and targeted feedback to learners. Unlike prior datasets, which rely on broad typologies and lack grounding in the source text, FOCUS introduces a new formulation: each Socratic question is paired with a fine-grained, 11-type typology and an explicit source span from the argument it targets. This design supports clearer, more actionable feedback and facilitates interpretable model evaluation. FOCUS includes 440 annotated instances with moderate partial-match agreement, establishing it as a reliable benchmark. Baseline experiments with representative state-of-the-art models reveal, through detailed error analysis, that even strong models struggle with span selection and context-sensitive categories. An extension study on the LogicClimate dataset further confirms the generalizability of the task and annotation framework. FOCUS sets a new standard for pedagogically grounded and informative Socratic question generation.

pdf bib
MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking
Nishant Mishra | Wilker Aziz | Iacer Calixto

Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths—i.e., from general to specific—in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.

pdf bib
AURA-QG: Automated Unsupervised Replicable Assessment for Question Generation
Rajshekar K | Harshad Khadilkar | Pushpak Bhattacharyya

Question Generation (QG) is central to information retrieval, education, and knowledge assessment, yet its progress is bottlenecked by unreliable and non-scalable evaluation practices. Traditional metrics fall short in structured settings like document-grounded QG, and human evaluation, while insightful, remains expensive, inconsistent, and difficult to replicate at scale. We introduce AURA-QG: an Automated, Unsupervised, Replicable Assessment pipeline that scores question sets using only the source document. It captures four orthogonal dimensions i.e., answerability, non-redundancy, coverage, and structural entropy, without needing reference questions or relative baselines. Our method is modular, efficient, and agnostic to the question generation strategy. Through extensive experiments across four domains i.e., car manuals, economic surveys, health brochures, and fiction, we demonstrate its robustness across input granularities and prompting paradigms. Chain-of-Thought prompting, which first extracts answer spans and then generates targeted questions, consistently yields higher answerability and coverage, validating the pipeline’s fidelity. The metrics also exhibit strong agreement with human judgments, reinforcing their reliability for practical adoption. The complete implementation of our evaluation pipeline is publicly available.

pdf bib
EFSA-CLC: Enhancing Zero-shot Entity-level Financial Sentiment Analysis with Cross-lingual Collaboration
Senbin Zhu | Hongde Liu | Chenyuan He | Yuxiang Jia

Entity-level sentiment analysis is becoming increasingly important in the context of diverse financial texts, and large language models demonstrate significant potential under zero-shot settings. While it is well recognized that different languages embody distinct cognitive patterns, the use of multilingual capabilities in large language models to enable cross-lingual collaborative reasoning in the financial domain remains insufficiently studied. To address this, we propose a Cross-Lingual Collaboration (CLC) method: first, financial texts are aligned from one language to another based on semantic and syntactic structures, enabling the model to capture complementary linguistic features. Then, we integrate sentiment analysis results from both languages through redundancy removal and conflict resolution, enhancing the effectiveness of cross-lingual collaboration. Our experiments cover seven languages from three language families, including six UN official languages, and evaluate CLC on two English datasets and one Chinese dataset. Results show that multilingual collaboration improves sentiment analysis accuracy, especially among linguistically similar languages. Furthermore, stronger reasoning capabilities in LLMs amplify these benefits. Our code is available at https://anonymous.4open.science/r/Cross-lingual-Collaboration.

pdf bib
Enhancing Low-Resource Text Classification with LLM-Generated Corpora : A Case Study on Olfactory Reference Extraction
Cédric Boscher | Shannon Bruderer | Christine Largeron | Véronique Eglin | Elöd Egyed-Zsigmond

Extracting sensory information from text, particularly olfactory references, is challenging due to limited annotated datasets and the implicit, subjective nature of sensory experiences. This study investigates whether GPT-4o-generated data can complement or replace human annotations. We evaluate human- and LLM-labeled corpora on two tasks: coarse-grained detection of olfactory content and fine-grained sensory term extraction. Despite lexical variation, generated texts align well with real data in semantic and sensorimotor embedding spaces. Models trained on synthetic data perform strongly, especially in low-resource settings. Human annotations offer better recall by capturing implicit and diverse aspects of sensoriality, while GPT-4o annotations show higher precision through clearer pattern alignment. Data augmentation experiments confirm the utility of synthetic data, though trade-offs remain between label consistency and lexical diversity. These findings support using synthetic data to enhance sensory information mining when annotated data is limited.

pdf bib
Learning from *Sufficient* Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies
Jonathan Kamp | Lisa Beinborn | Antske Fokkens

Human explanations of natural language, *rationales*, form a tool to assess whether models learn a label *for the right reasons* or rely on dataset-specific shortcuts. *Sufficiency* is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

pdf bib
ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
Xiao Wang | Daniil Larionov | Siwei Wu | Yiqi Liu | Steffen Eger | Nafise Sadat Moosavi | Chenghua Lin

Recent advances in automatic evaluation of natural language generation have increasingly relied on large language models as general-purpose metrics. While effective, these approaches often require high-capacity models, which introduce substantial computational costs, and remain susceptible to known evaluation pathologies, such as over-reliance on likelihood. We introduce ContrastScore, a contrastive evaluation paradigm that builds on the widely used BARTScore formulation by comparing token-level probabilities between a stronger and a weaker model. Instead of relying on single-model likelihoods or prompt-based judgments, ContrastScore captures disagreement between models to better reflect confidence and uncertainty in generation quality. Empirical results on summarization and machine translation benchmarks show that ContrastScore, instantiated with paired moderate-scale models across both Qwen and LLaMA families, consistently outperforms larger alternatives, such as Qwen 7B and LLaMA 8B, in correlation with human ratings. In addition to improving evaluation quality, ContrastScore significantly reduces susceptibility to likelihood bias, offering a more robust and cost-effective alternative to larger LLM-based evaluation methods.

pdf bib
ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
Wissam Antoun | Benoît Sagot | Djamé Seddah

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

pdf bib
Simplified Rewriting Improves Expert Summarization
Xingmeng Zhao | Tongnian Wang | Anthony Rios

Radiology report summarization (RRS) is critical for clinical workflows, requiring concise Impressions “distilled from detailed Findings.” This paper proposes a novel prompting strategy that enhances RRS by introducing a layperson summary as an intermediate step. This summary helps normalize key observations and simplify complex terminology using communication techniques inspired by doctor–patient interactions. Combined with few-shot in-context learning, this approach improves the model’s ability to map generalized descriptions to specific clinical findings. We evaluate our method on three benchmark datasets, MIMIC-CXR, CheXpert, and MIMIC-III, and compare it against state-of-the-art open-source language models in the 7B/8B parameter range, such as Llama-3.1-8B-Instruct. Results show consistent improvements in summarization quality, with gains of up to 5% on some metrics for prompting, and more than 20% for some models when instruction tuning.

pdf bib
RASTeR: Robust, Agentic, and Structured Temporal Reasoning
Dan Schumacher | Fatemeh Haji | Tara Grey | Niharika Bandlamudi | Nupoor Karnik | Gagana Uday Kumar | Cho-Yu Jason Chiang | Peyman Najafirad | Nishant Vishwamitra | Anthony Rios

Temporal question answering (TQA) remains a persistent challenge for large language models (LLMs), particularly in retrieval-augmented generation (RAG) settings where retrieved content may be irrelevant, outdated, or temporally inconsistent. This is especially critical in applications like clinical event ordering, policy tracking, and real-time decision-making, which require reliable temporal reasoning even under noisy or misleading context. To address this challenge, we introduce RASTeR: Robust, Agentic, and Structured, Temporal Reasoning, an agentic prompting framework that separates context evaluation from answer generation. RASTeR first assesses the relevance and temporal coherence of retrieved context, then constructs a structured temporal knowledge graph (TKG) to better facilitate reasoning. When inconsistencies are detected, RASTeR selectively corrects or discards context before generating an answer. Across multiple datasets and LLMs, RASTeR consistently improves robustness: defined here as the model’s ability to generate correct predictions despite suboptimal context. We further validate our approach through a “needle-in-the-haystack” study, in which relevant context is buried among irrelevant distractors. Even with forty distractors, RASTeR achieves 75% accuracy, compared to the runner-up model, which reaches only 62%.

pdf bib
ReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models
Zixuan Wu | Francesca Lucchetti | Aleksander Boruch-Gruszecki | Jingmiao Zhao | Carolyn Jane Anderson | Joydeep Biswas | Federico Cassano | Arjun Guha

Existing benchmarks for frontier models often test specialized, “PhD-level” knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models’ mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise.Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with “I give up” before providing an answer that it knows is wrong. R1 can also be remarkably “uncertain” in its output and in rare cases, it does not “finish thinking,” which suggests the need for techniques to “wrap up” before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

pdf bib
Noise May Drown Out Words but Foster Compositionality: The Advantage of the Erasure and Deletion Noisy Channels on Emergent Communication
Cezary Klamra | Francijn Keur | Raquel G. Alhama

We investigate communication emerging in noisy environments with the goal of capturing the impact of message disruption on the emerged protocols. We implement two different noise mechanisms, inspired by the erasure and deletion channels studied in information theory, and simulate a referential game in a neural agent-based model with a variable message length channel. We leverage a stochastic evaluation setting to apply noise only after a message is sampled, which adds ecological validity and allows us to estimate information-theoretic measures of the emerged protocol directly from symbol probabilities. Contrary to our expectations, the emerged protocols do not become more redundant with the presence of noise; instead, we observe that certain levels of noise encourage the sender to produce more compositional messages, although the impact varies depending on the type of noise and input representation.

pdf bib
Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels
Sourya Dipta Das | Shubham Kumar | Kuldeep Yadav

Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature. Developing accurate grammar scoring models further requires extensive expert annotation, making large-scale data creation impractical. To address these limitations, we propose a zero-shot grammar competency estimation framework that leverages unlabeled data and Large Language Models (LLMs) without relying on manual labels. During training, we employ LLM-generated predictions on unlabeled data by using grammar competency rubric-based prompts. These predictions, treated as pseudo labels, are utilized to train a transformer-based model through a novel training framework designed to handle label noise effectively. We show that the choice of LLM for pseudo-label generation critically affects model performance and that the ratio of clean-to-noisy samples during training strongly influences stability and accuracy. Finally, a qualitative analysis of error intensity and score prediction confirms the robustness and interpretability of our approach. Experimental results demonstrate the efficacy of our approach in estimating grammar competency scores with high accuracy, paving the way for scalable, low-resource grammar assessment systems.

pdf bib
LLM-Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations
Priyaranjan Pattnayak | Sanchari Chowdhuri | Amit Agarwal | Hitesh Laxmichand Patel

Clustering customer chat data is vital for cloud providers handling multi-service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Re-clustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi-turn chats into service-specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via Davies–Bouldin Index (DBI) and Silhouette Scores, with LLM-based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real-time analytics without full re-clustering.

pdf bib
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan | Tomohiro Sawada | Kartik Goyal

While question-answering (QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on **cascaded question disclosure** that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models.We further validate our findings by extensive ablation studies.

pdf bib
Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning
Yahui Fu | Zi Haur Pang | Tatsuya Kawahara

User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M²PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

pdf bib
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Kaiwen Zhou | Chengzhi Liu | Xuandong Zhao | Shreedhar Jangam | Jayanth Srinivasa | Gaowen Liu | Dawn Song | Xin Eric Wang

The rapid development of large reasoning models (LRMs), such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models (LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source reasoning models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on open LRMs is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model’s reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models poses greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models’ safety to close the gap.

pdf bib
Agnus LLM: Robust and Flexible Entity Disambiguation with decoder-only Language Models
Kristian Noullet | Ayoub Ourgani | Niklas Thomas Lakner | Lukas Kinder | Tobias Käfer

Entity disambiguation (ED) links ambiguous mentions in text to entries in a knowledge base and is a core task in entity linking systems. While pretrained decoder-only language models (DLMs) offer strong generalization capabilities, their effective use in ED has been restricted due to sensitivity to candidate order, susceptibility to hallucinated outputs, and potential dataset leakage. We introduce Agnus a zero-shot ED framework that addresses these challenges through three core innovations: (1) order-invariant candidate encoding via shared positional embeddings and modified autoregressive attention masking, which eliminates bias on input ordering; (2) constrained decoding that ensures outputs are restricted to valid candidates, effectively preventing hallucinations; and (3) synthetic dataset creation approach as a diagnostic tool for data contamination detection and mitigation. Agnus eliminates up to 15.2% of F1 variability caused by candidate permutations, delivering consistent and order-robust predictions previously unattainable with autoregressive architectures. In our experiments, Agnus achieves state-of-the-art performance on four standard ED benchmarks, surpassing prior zero-shot approaches by an average 3.7% using small language models. We release code, data including candidate sets, and a synthetic benchmark to support reproducibility and controlled evaluation.

pdf bib
MuSciClaims: Multimodal Scientific Claim Verification
Yash Kumar Lal | Manikanta Bandham | Mohammad Saqib Hasan | Apoorva Kashi | Mahnaz Koupaee | Niranjan Balasubramanian

Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

pdf bib
Program Synthesis Dialog Agents for Interactive Decision-Making
Matthew Toles | Nikhil Balwani | Rattandeep Singh | Valentina Giulia Sartori Rodriguez | Zhou Yu

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on the features of the user. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, suggesting a need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is important that these agents can ask the right questions. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. We therefore introduce ProADA, a novel approach that uses program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 56.2 while using nearly the same number of dialog turns.

pdf bib
Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Liran Ringel | Elad Tolochinsky | Yaniv Romano

Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "/think>” with “Wait”) can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment distilled versions of DeepSeek-R1 with a single learned "<|continue-thinking|>” token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.

pdf bib
Video-guided Machine Translation: A Survey of Models, Datasets, and Challenges
Pinaki Das | Virendra Singh | Pushpak Bhattacharyya | Gholamreza Haffari

In recent years, machine translation has evolved with the integration of multimodal information. Infusion of multi-modality into translation tasks decreases ambiguation and enhances translation scores. Common modalities include images, speech, and videos, which provide additional context alongside the text to be translated. While multimodal translation with images has been extensively studied, video-guided machine translation (VMT) has gained increasing attention, particularly since Wang et al. 2019 first explored this task. In this paper, we provide a comprehensive overview of VMT, highlighting its unique challenges, methodologies, and recent advancements. Unlike previous surveys that primarily focus on image-guided multimodal machine translation, this work explores the distinct complexities and opportunities introduced by adding video as a modality to the translation task.

pdf bib
ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models
Biddut Sarker Bijoy | Mohammad Saqib Hasan | Pegah Alipoormolabashi | Avirup Sil | Aruna Balasubramanian | Niranjan Balasubramanian

Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models.We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

pdf bib
Rethinking Large Language Model Architectures for Sequential Recommendations
Hanbing Wang | Xiaorui Liu | Wenqi Fan | Xiangyu Zhao | Venkataramana Kini | Devendra Pratap Yadav | Fei Wang | Zhen Wen | Hui Liu

In recent times, there has been a shift towards adapting sequential recommendation to LLM paradigm to harness the capabilities of LLMs. These methods typically formulate recommendation data into natural language and train the model to forecast the subsequent item in an auto-regressive manner. Despite their notable success, the significant computational burden during inference poses a major challenge to their practical implementation. In this study, we aim to streamline current LLM-based recommendation models and introduce a straightforward yet highly effective model Lite-LLM4Rec. The primary objective of Lite-LLM4Rec is to ensure efficient inference for the sequential recommendation task. Lite-LLM4Rec circumvents the step-by-step beam search decoding by employing a direct item projection head to produce ranking scores in one step. This design arises from our empirical finding that beam search decoding is ultimately unnecessary for sequential recommendations. Additionally, Lite-LLM4Rec introduces a hierarchical LLM structure crafted to efficiently handle the extensive contextual information of items and redundant computation issue, thus diminishing computational overhead while enjoying the power of LLMs. Experiments on four publicly available datasets validate the efficacy of Lite-LLM4Rec in enhancing both performance and inference efficiency (notably 46.8% performance improvement and 99.48% efficiency improvement on ML-1m) compared to existing LLM-based methods. Our implementations are available at: https://github.com/HanbingWang2001/Lite-LLM4Rec-PyTorch.

pdf bib
DSBC : Data Science task Benchmarking with Context engineering
Ram Mohan Rao Kadiyala | Jebish Purbey | Siddhant Gupta | Giulio Martini | Suman Debnath | Hamza Farooq

Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

pdf bib
Improving Document Retrieval Coherence for Semantically Equivalent Queries
Stefano Campese | Alessandro Moschitti | Ivano Lauriola

Dense Retrieval (DR) models have proven to be effective for Document Retrieval and Information Grounding tasks. Usually, these models are trained and optimized for improving the relevance of top-ranked documents for a given query. Previous work has shown that popular DR models are sensitive to the query and document lexicon: small variations of it may lead to a significant difference in the set of retrieved documents. In this paper, we propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents with respect to semantically similar queries. The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantically equivalent queries. We conducted extensive experiments on various datasets, MS-MARCO, Natural Questions, BEIR, and TREC DL 19/20. The results show that (i) models optimizes by our loss are subject to lower sensitivity, and, (ii) interestingly, higher accuracy.

pdf bib
Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Akshar Tumu | Varad Shinde | Parisa Kordjamshidi

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis papers often use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation (‘not’). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

pdf bib
FINDR: A Fast Influential Data Selector for NL2Code Pretraining
Xinliang Frederick Zhang | Lu Wang

Pretraining on massive corpora has given rise to large language models (LLMs) with multi-task capabilities. However, real-world applications often require more specialized training, as is the case of NL2Code. We approach this specialization through the lens of data selection, i.e., identifying a subset of a large corpus that aligns with a desired target distribution—a challenge that remains under-explored within NL2Code. Existing methods are typically designed for selecting instruction-tuning data, and might not easily scale to large-scale code repositories; while methods for NL2Code do exist, they primarily rely on coarse heuristics—–such as repo stars—–for filtering. To bridge this gap, we propose FINDR, an efficient data selection method that extends logistic regression with feature-wise importance reweighting—marking it, to our knowledge, the first fine-grained solution to NL2Code pretraining. Our method uses hashed n-grams and code-aware features to capture code-specific patterns, and then apply informative priors to reweight feature importance when computing influence scores. Extensive experiments on NL2Python and NL2SQL, with two model families, show that FINDR consistently outperforms strong baselines in both execution accuracy and token efficiency. Notably, pretraining on only 2% of FINDR-selected data boosts Gemma by over 29% in both domains, even surpassing CodeGemma (pretrained on 300x more examples) by 10% in Python.

pdf bib
Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate
Ashim Gupta | Maitrey Mehta | Zhichao Xu | Vivek Srikumar

Large language models (LLMs) provide detailed and impressive responses to queries in English. However, are they really consistent at responding to the same query in other languages? The popular way of evaluating for multilingual performance of LLMs requires expensive-to-collect annotated datasets. Further, evaluating for tasks like open-ended generation, where multiple correct answers may exist, is nontrivial. Instead, we propose to evaluate the predictability of model response across different languages. In this work, we propose a framework to evaluate LLM’s cross-lingual consistency based on a simple Translate then Evaluate strategy. We instantiate this evaluation framework along two dimensions of consistency: information and empathy. Our results reveal pronounced inconsistencies in popular LLM responses across thirty languages, with severe performance deficits in certain language families and scripts, underscoring critical weaknesses in their multilingual capabilities. These findings necessitate cross-lingual evaluations that are consistent along multiple dimensions. We invite practitioners to use our framework for future multilingual LLM benchmarking.

pdf bib
Do Persona-Infused LLMs Affect Performance in a Strategic Reasoning Game?
John Licato | Stephen Steinle

Although the use of persona prompting in large language models appears to trigger different styles of generated text, it is unclear whether these translate into measurable behavioral differences. Furthermore, little work has studied whether these differences, when they do exist, can affect decision-making in an adversarial strategic environment. We investigate the impact of persona prompting on strategic performance in PERIL, a world domination board game. Specifically, we compare the effectiveness of persona-derived heuristics to those chosen manually. Our findings reveal that personality traits intuitively associated with strategic thinking do appear to improve game performance, but only when an additional mediator is used to translate personas into heuristic values. We introduce this mediator as a structured translation process, inspired by exploratory factor analysis, that maps LLM-generated inventory responses into strategic heuristics. Results indicate our method enhances heuristic reliability and face validity when compared to directly inferred heuristics, allowing us to better study the effect of persona types on decision-making behaviors. These insights advance our understanding of how persona prompting influences LLM-based decision-making and propose a novel heuristic generation method that adds to the growing body of work applying psychometric principles to LLMs.

pdf bib
FarSense: A Comprehensive Commonsense Benchmark and Evaluation Framework for the Farsi Language
Kamyar Zeinalipour | Neda Jamshidi | Seyedehbahareh Hejazi | Marco Maggini | Monica Bianchini | Simone Paoletti | Marco Gori

Although Farsi is widely spoken, no comprehensive benchmark exists for assessing commonsense reasoning in language models. We therefore present FarSense, a 6‐task benchmark for Farsi covering True/False judgment, multiple-choice questions, Explanation, Cause‐Effect inference, Counterfactual reasoning, and Knowledge Completion. Starting from Farsi‐Wikipedia, we filtered noise and retained ~4,210 passages, rewrote them into realistic daily scenarios, and derived the above tasks from each scenario. Scenario and task generation quality was first judged via native‐speaker annotations on outputs from five major LLMs—GPT‐4o, Gemini-2.5-Flash, Mistral-Large, Qwen‐Plus, and DeepSeek‐Chat. Gemini-2.5-Flash demonstrated the highest performance, leading to its use in generating a large-scale dataset, subsequently finalized through meticulous two-step human validation. Using FarSense, we measured the commonsense ability of the same five flagship LLMs and also fine‐tuned six compact models (1B–24B parameters) before re‐evaluating them. To ensure broad applicability, task wording was designed to minimize dialectal, cultural, or religious bias. Experiments show that targeted fine‐tuning yields substantial gains, confirming FarSense as a reliable, openly licensed resource for advancing reproducible commonsense understanding research in Farsi NLP. We publicly release all code and data at https://github.com/KamyarZeinalipour/FarSense.

pdf bib
GL-CLiC: Global-Local Coherence and Lexical Complexity for Sentence-Level AI-Generated Text Detection
Rizky Adi | Bassamtiano Renaufalgi Irnawan | Yoshimi Suzuki | Fumiyo Fukumoto

Unlike document-level AI-generated text (AIGT) detection, sentence-level AIGT detection remains underexplored, despite its importance for addressing collaborative writing scenarios where humans modify AIGT suggestions on a sentence-by-sentence basis. Prior sentence-level detectors often neglect the valuable context surrounding the target sentence, which may contain crucial linguistic artifacts that indicate a potential change in authorship. We propose **GL-CLiC**, a novel technique that leverages both **G**lobal and **L**ocal signals of **C**oherence and **L**ex**i**cal **C**omplexity, which we operationalize through discourse analysis and CEFR-based vocabulary sophistication. **GL-CLiC** models local coherence and lexical complexity by examining a sentence’s relationship with its neighbors or peers, complemented with its document-wide analysis. Our experimental results show that **GL-CLiC** achieves superior performance and better generalization across domains compared to existing methods.

pdf bib
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
Ram Mohan Rao Kadiyala | Siddartha Pullakhandam | Siddhant Gupta | Jebish Purbey | Drishti Sharma | Kanwal Mehreen | Muhammad Arham | Suman Debnath | Hamza Farooq

Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

pdf bib
AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
Gabrial Zencha Ashungafac | Mardhiyah Sanni | Busayo Awobade | Alex Gichamba | Tobi Olatunji

Recent advances in speech‐enabled AI, including Google’s NotebookLM and OpenAI’s speech-to-speech API, are driving widespread interest in voice interfaces across sectors such as finance, health, agritech, legal services, and call‐centers in the global north and south. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa’s linguistic diversity. We present AfriSpeech‑MultiBench, the first domain‐specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities, and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversations drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open‐source ASR excels in spontaneous speech contexts but degrades on noisy, non‐native dialogue; multimodal LLMs are more accent‐robust yet struggle with domain‐specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Smaller models fine‐tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment. By releasing this benchmark, we empower practitioners and researchers to select voice technologies suited to African use‐cases, fostering inclusive voice applications for undeserved communities.

pdf bib
An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation
Vimaleswar A | Prabhu Nandan Sahu | Nilesh Kumar Sahu | Haroon R Lone

Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have increasingly been used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solutions. To address these challenges, we propose **EmoSApp (Emotional Support App)**: an entirely offline, smartphone-based conversational app designed to provide mental health and emotional support. EmoSApp leverages a language model, specifically the LLaMA-3.2-1B-Instruct, which is fine-tuned and quantized on a custom-curated “Knowledge Dataset” comprising 14,582 mental health QA pairs along with multi-turn conversational data, enabling robust domain expertise and fully on-device inference on resource-constrained smartphones.Through qualitative evaluation with students and mental health professionals, we demonstrate that EmoSApp has the ability to respond coherently and empathetically, provide relevant suggestions to user’s mental health problems, and maintain interactive dialogue. Additionally, quantitative evaluations on nine commonsense and reasoning benchmarks, along with two mental health specific datasets, demonstrate EmoSApp’s effectiveness in low-resource settings. By prioritizing on-device deployment and specialized domain-specific adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health support.

pdf bib
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
Tejas Anvekar | Krishna Singh Rajput | Chitta Baral | Vivek Gupta

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

pdf bib
SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Jing Chen | Zhiheng Yang | Yixian Shen | Jie Liu | Adam Belloum | Paola Grosso | Chrysa Papagianni

Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I performs survey-level retrieval to construct the initial outline and writing plan, then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across six scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage. The code is available at https://github.com/SurveyGens/SurveyGen-I.

pdf bib
VAGUEGate: Plug‐and‐Play Local‐Privacy Shield for Retrieval‐Augmented Generation
Arshia Hemmat | Matin Moqadas | Ali Mamanpoosh | Amirmasoud Rismanchian | Afsaneh Fatemi

Retrieval-augmented generation (RAG) still *forwards* raw passages to large-language models, so private facts slip through. Prior defenses are either (i) **heavyweight**—full DP training that is impractical for today’s 70B-parameter models—or (ii) **over-zealous**—blanket redaction of every named entity, which slashes answer quality.We introduce **VAGUE-Gate**, a lightweight, *locally* differentially-private gate deployable in front of *any* RAG system. A precision pass drops low-utility tokens under a user budget ε, then up to k(ε) high-temperature paraphrase passes further cloud residual cues; post-processing guarantees preserve the same ε-LDP bound.To measure both privacy and utility, we release **BlendPriv** (3k blended-sensitivity QA pairs) and two new metrics: a lexical Information-Leakage Score and an LLM-as-Judge score. Across eight pipelines and four SOTA LLMs, **VAGUE-Gate** at ε = 0.3 lowers lexical leakage by **70%** and semantic leakage by **1.8** points (1–5 scale) while retaining **91%** of Plain-RAG faithfulness with only a **240 ms** latency overhead.All code, data, and prompts are publicly released:- Code: < https://github.com/arshiahemmat/LDP_RAG > - Dataset: <https://huggingface.co/datasets/AliMnp/BlendPriv>

pdf bib
PII-Scope: A Comprehensive Study on Training Data Privacy Leakage in Pretrained LLMs
Krishna Kanth Nakka | Ahmed Frikha | Ricardo Mendes | Xue Jiang | Xuebing Zhou

In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies.

pdf bib
Indic-S2ST: a Multilingual and Multimodal Many-to-Many Indic Speech-to-Speech Translation Dataset
Nivedita Sethiya | Puneet Walia | Chandresh Kumar Maurya

Speech-to-Speech Translation (S2ST) converts speech from one language to speech in a different language. While various S2ST models exist, none adequately support Indic languages, primarily due to the lack of a suitable dataset. We fill this gap by introducing Indic-S2ST, a multilingual and multimodal many-to-many S2ST data of approximately 600 hours in 14 Indic languages, including Indian-accented English. To the best of our knowledge, this is the largest data for the S2ST task with parallel speech and text in 14 scheduled Indic languages. Our data also supports Automatic Speech Recognition (ASR), Text-to-Speech (TTS) synthesis, Speech-to-Text translation (ST), and Machine Translation (MT) due to parallel speech and text alignment. Thus, our data may be useful to train a model likeMeta’s SeamlessM4T for Indic languages. We also propose Indic-S2UT, a discrete unit-based S2ST model for Indic languages. To showcase the utility of the data, we present baseline results on the Indic-S2ST data using the Indic-S2UT. The dataset and codes are available at https://github.com/Nivedita5/Indic-S2ST/blob/main/README.md.