Computational Linguistics (2026)
up
Computational Linguistics, Volume 52, Issue 1 - March 2026
Truth or Mirage? Towards End-To-End Factuality Evaluation with LLM-O asis
Alessandro Scirè | Andrei Stefan Bejgu | Simone Tedeschi | Karim Ghonim | Federico Martelli | Roberto Navigli
Alessandro Scirè | Andrei Stefan Bejgu | Simone Tedeschi | Karim Ghonim | Federico Martelli | Roberto Navigli
After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
Are Formal and Functional Linguistic Mechanisms Dissociated in Language Models?
Michael Hanna | Yonatan Belinkov | Sandro Pezzelle
Michael Hanna | Yonatan Belinkov | Sandro Pezzelle
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: They excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this article, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the “circuits”, or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness—the ability of one circuit to solve another’s task—we observe a separation between formal and functional mechanisms, with formal task circuits achieving higher performance on other formal tasks. This suggests the existence of a set of formal linguistic mechanisms that is shared across formal tasks, even if not all mechanisms are strictly necessary for all formal tasks.
Training and Evaluating with Human Label Variation: An Empirical Study
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Human label variation (HLV) challenges the standard assumption that a labeled instance has a single ground truth, instead embracing the natural variation in human annotation to train and evaluate models. While various training methods and metrics for HLV have been proposed, it is still unclear which methods and metrics perform best in what settings. We propose new evaluation metrics for HLV leveraging fuzzy set theory. Because these new proposed metrics are differentiable, we then in turn experiment with using these metrics as training objectives. We conduct an extensive study over 6 HLV datasets testing 14 training methods and 6 evaluation metrics. We find that training on either disaggregated annotations or soft labels performs best across metrics, outperforming training using the proposed training objectives with differentiable metrics. We also show that our proposed soft micro F1 score is one of the best metrics for HLV data.1
Linguistic Steganography via Self-Adjusting Asymmetric Number System
Yiting Liu | Chungen Xu | Fei Yang | Pan Zhang | Linlong Wang
Yiting Liu | Chungen Xu | Fei Yang | Pan Zhang | Linlong Wang
Linguistic steganography (stego) seeks to conceal secret information within natural language text. However, existing methods often struggle to balance stego text quality with embedding efficiency, largely due to limitations in generation strategies and coding mechanisms. We propose SA-ANS, a self-adaptive linguistic steganography framework based on a self-adjusting Asymmetric Numeral System. SA-ANS allows user-specified embedding rates and uses probabilistic coding with adaptive candidate selection, dynamically tailoring the token pool to the language model’s probability distribution. This design produces fluent, semantically coherent stego text while preserving statistical indistinguishability from natural language. Extensive experiments on multiple benchmark datasets, evaluated across embedding efficiency, linguistic quality, statistical similarity, robustness to steganalysis, and human judgment, show that SA-ANS consistently outperforms state-of-the-art methods, demonstrating both effectiveness and practicality.
Defensive Dual Masking for Robust Adversarial Defense
Wangli Yang | Jie Yang | Yi Guo | Johan Barthelemy
Wangli Yang | Jie Yang | Yi Guo | Johan Barthelemy
Adversarial defenses for textual data have gained considerable attention in recent years due to the increasing vulnerability of Natural Language Processing (NLP) models to adversarial attacks. These attacks exploit subtle perturbations in input text to deceive models, posing significant challenges to model robustness and reliability. This article introduces Defensive Dual Masking (DDM), a simple yet effective algorithm that uses two unique masking strategies to mitigate adversarial threats. Specifically, during training, [MASK] tokens are directly inserted into input samples to prepare the model for handling perturbed inputs. At inference time, suspicious tokens are identified and strategically replaced with [MASK] tokens, effectively neutralizing perturbations while preserving core semantics of the input text. The theoretical foundation of DDM demonstrates how the proposed masking strategies enhance the model capacity to mitigate adversarial attacks. Empirical evaluations based on four benchmark datasets and four adversarial attacks consistently demonstrate that DDM outperforms state-of-the-art defense techniques, achieving superior robustness and substantial improvements in model accuracy. Furthermore, DDM seamlessly integrates with Large Language Models, enhancing their resilience to adversarial attacks and providing a scalable defense solution for large-scale NLP applications.
Meta4XNLI: A Cross-lingual Parallel Corpus for Metaphor Detection and Interpretation
Elisa Sanchez-Bayona | Rodrigo Agerri
Elisa Sanchez-Bayona | Rodrigo Agerri
Metaphors are a ubiquitous but often overlooked part of everyday language. As a complex cognitive-linguistic phenomenon, they provide a valuable means to evaluate whether language models can capture deeper aspects of meaning, including semantic, pragmatic, and cultural context. In this work, we present Meta4XNLI, the first parallel dataset for Natural Language Inference (NLI) newly annotated for metaphor detection and interpretation in both English and Spanish. Meta4XNLI facilitates the comparison of encoder- and decoder-based models in detecting and understanding metaphorical language in multilingual and cross-lingual settings. Our results show that fine-tuned encoders outperform decoder-only LLMs in metaphor detection. Metaphor interpretation is evaluated via the NLI framework with comparable performance of masked and autoregressive models, which notably decreases when the inference is affected by metaphorical language. Our study also finds that translation plays an important role in the preservation or loss of metaphors across languages, introducing shifts that might impact metaphor occurrence and model performance. These findings underscore the importance of resources like Meta4XNLI for advancing the analysis of the capabilities of language models and improving our understanding of metaphor processing across languages. Furthermore, the dataset offers previously unavailable opportunities to investigate metaphor interpretation, cross-lingual metaphor transferability, and the impact of translation on the development of multilingual annotated resources.
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger | Wessel Poelman | Andreas Holck Høeg-Petersen | Anders Schlichtkrull | Miryam de Lhoneux | Johannes Bjerva
Esther Ploeger | Wessel Poelman | Andreas Holck Høeg-Petersen | Anders Schlichtkrull | Miryam de Lhoneux | Johannes Bjerva
Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, “typologically diverse” language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.
Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units
Rebecca Pattichis | Dora LaCasse | Rena Torres Cacoullos
Rebecca Pattichis | Dora LaCasse | Rena Torres Cacoullos
Natural Language Processing (NLP) metrics for bilingual code-switching (CS) have, until now, used words as the token level. However, the assumption that any two words constitute an equally likely switch point is erroneous. In spoken language, a major delimiter of CS is a prosodic chunk known as the Intonation Unit (IU). Switch points are far more likely between words at IU boundaries than between words in the same IU. The word as an elementary NLP unit is thus incommensurate with bilingual speech patterns. Here, we put forward an IU-based adaptation of a familiar metric of CS probability. We then compare the token levels on this metric for ten bilingual datasets featuring multi-word CS. Our comparison shows that the currently standard two-significant-figure precision of the word-based metric is insufficient, as the token level compresses the range of values by inflating the universe of CS. More discerning CS probability values can be obtained by normalizing word-based counts using mean IU length.
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Atsuki Yamaguchi | Aline Villavicencio | Nikolaos Aletras
Atsuki Yamaguchi | Aline Villavicencio | Nikolaos Aletras
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks, and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences (∼0.01GB text data) from the target language.1
The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis
Aaron Mueller | Jannik Brinkmann | Millicent Li | Samuel Marks | Koyena Pal | Nikhil Prakash | Can Rager | Aruna Sankaranarayanan | Arnab Sen Sharma | Jiuding Sun | Eric Todd | David Bau | Yonatan Belinkov
Aaron Mueller | Jannik Brinkmann | Millicent Li | Samuel Marks | Koyena Pal | Nikhil Prakash | Can Rager | Aruna Sankaranarayanan | Arnab Sen Sharma | Jiuding Sun | Eric Todd | David Bau | Yonatan Belinkov
Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: Most studies use ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) utilized, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.
Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (auxiliary oversight), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (mechanistic chauvinism). Mitigating these biases requires an empirical, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, achieved by supplementing behavioral experiments with mechanistic studies.
up
Computational Linguistics, Volume 52, Issue 2 - June 2026
I was honored to receive the Association for Computational Linguistics Lifetime Achievement Award in 2025. I especially want to thank the people who nominated me for the award as I know nominations require time and effort. This retrospective is a rough transcript of the speech I gave accepting the award at the conference in Vienna, Austria. In the talk, I look back at my research at early stages of my career and then look at the arc that research takes and how it relates to work that I still carry out today. I look at the trajectories of four areas of my research: language generation, text summarization, social media analysis, and multimodal analysis of artwork. In the talk, I featured videos of my current students speaking about their research and where they think the field is heading. I dedicate the talk and this article to the amazing students I have had the honor to work with over the years.
LLMs and Cultural Values: The Impact of Prompt Language and Explicit Cultural Framing
Bram Bulté | Ayla Rigouts Terryn
Bram Bulté | Ayla Rigouts Terryn
Large language models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages. At the same time, there are well-documented imbalances in the training data and optimization objectives of this technology, raising doubts as to whether LLMs can accurately represent the cultural diversity of their broad user base. In this study, we look at LLMs and cultural values in particular, and examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries. We do so by probing 10 LLMs with 63 items from the Hofstede Values Survey Module and World Values Survey, translated into 11 languages, and formulated as prompts with and without different explicit cultural perspectives. Our study confirms that both prompt language and cultural perspective produce variation in LLM outputs, but with an important caveat: While targeted prompting can, to a certain extent, steer LLM responses in the direction of the predominant values of the corresponding countries, it does not overcome the models’ systematic bias toward the values associated with a restricted set of countries in our dataset: the Netherlands, Germany, the United States, and Japan. All tested models, regardless of their origin, exhibit remarkably similar patterns: They produce fairly neutral responses on most topics, with selective progressive stances on issues such as social tolerance. Alignment with cultural values of human respondents is improved more with an explicit cultural perspective than with a targeted prompt language. Unexpectedly, combining both approaches is no more effective than cultural framing with an English prompt. These findings reveal that LLMs occupy an uncomfortable middle ground: They are responsive enough to changes in prompts to produce variation, but they are also too firmly anchored to specific cultural defaults to adequately represent cultural diversity.
Sequence Labeling for Constituent Parsing: A Comparative Study and Encoding Innovations
Diego Roca | David Vilares | Carlos Gómez-Rodríguez
Diego Roca | David Vilares | Carlos Gómez-Rodríguez
Various encodings have been proposed to cast constituent parsing in terms of a sequence labeling task. However, unlike in the case of dependency parsing, existing comparisons have not been entirely homogeneous and, to the best of our knowledge, there is no systematic evaluation of these encodings under uniform configurations. A homogeneous evaluation needs to account for various aspects that could influence results, either by controlling for these aspects to ensure uniformity (e.g., network architecture, parameter settings, postprocessing of ill-formed output), or by systematically analyzing their impact (e.g., the impact of binary versus arbitrary structures). In this article, we: (1) compare different encodings comprehensively both theoretically and empirically, on a modern neural architecture and across nine languages, and (2) introduce new encodings and variants, including an encoding that our analysis finds particularly accurate and compact.
Multimodal OXYmorons: A Comprehensive Introduction and Computational Analysis Using a Dataset of Oxymoronic Memes in Italian and Spanish
Eliana Di Palma | Giulia Rizzi | Francesca Masini | Paolo Rosso | Elisabetta Fersini
Eliana Di Palma | Giulia Rizzi | Francesca Masini | Paolo Rosso | Elisabetta Fersini
This article introduces the concept of multimodal oxymorons. Multimodal oxymorons extend the traditional oxymoron theory by constructing and communicating meaning through the interplay of multiple modalities (such as visual and textual) rather than relying solely on language. We argue that multimodal oxymorons are central mechanisms of meaning-making in contemporary communication, as evidenced by the use of memes as an example. While textual oxymorons have long been the subject of analysis in order to ascertain their role in shaping thought and meaning, multimodal oxymorons demonstrate how human cognitive process transcends linguistic boundaries, integrating different modalities (e.g., visual) in order to convey complex ideas. To encourage further study, we present a curated multilingual dataset of Multimodal OXYmoron (MOXY), which can be used as a foundation for further analysis and experimentation. Furthermore, we propose a methodical approach for the identification of multimodal oxymorons along with a pipeline for automated generation. Through illustrative examples and a detailed methodology, this work establishes a comprehensive framework for understanding, identifying, and generating multimodal oxymorons, paving the way for advancements in computational linguistics, artificial intelligence, and figurative language studies.
VPO: Leveraging the Number of Votes in Preference Optimization
Jae Hyeon Cho | Minkyung Park | Byung-Jun Lee
Jae Hyeon Cho | Minkyung Park | Byung-Jun Lee
Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets, typically labeled with votes or scores, provide valuable insights into whether a sentence pair exhibits a clear preference or remains controversial. However, existing methods do not fully utilize this information. In this article, we propose a technique that leverages user voting data to better align language models with diverse subjective preferences. We use the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferred over another. Using this estimated probability as a target, we introduce the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and clearly preferred generation pairs. Furthermore, we demonstrate that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms. Additionally, our framework can be applied to reward modeling, demonstrating that our approach is compatible with the broader RLHF pipeline.
Yesterday’s News: Benchmarking Multi-Dimensional Out-of-Distribution Generalization of Misinformation Detection Models
Ivo Verhoeven | Pushkar Mishra | Ekaterina Shutova
Ivo Verhoeven | Pushkar Mishra | Ekaterina Shutova
This article introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalization. Misinformation changes rapidly, much more quickly than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation detectors need to be able to perform out-of-distribution generalization, an attribute they currently lack. Our benchmark uses distant labeling to enable simulating covariate shifts in misinformation content. We identify time, event, topic, publisher, political bias, and misinformation type as important axes for generalization, and we evaluate a common class of baseline models on each. Using article metadata, we show how this model fails desiderata, which is not necessarily obvious from classification metrics. Finally, we analyze properties of the data to ensure limited presence of modelling shortcuts. We make the dataset and accompanying code publicly available.1
Can a Large Language Model Replace Humans at Rating Lexical Semantic Relations Strength?
André Fernandes dos Santos | José Paulo Leal
André Fernandes dos Santos | José Paulo Leal
This article investigates the ability of large language models (LLMs) to evaluate semantic relations between word pairs by examining their alignment with human-generated semantic ratings. Semantic relations represent the degree of connection (e.g., relatedness or similarity) between linguistic elements and are traditionally validated against human-annotated datasets. Due to the challenges of building such datasets and recent progress in LLMs’ capacity to model human-like understanding, we explore whether LLMs can serve as reliable substitutes for traditional human ratings. We conducted experiments using multiple LLMs from OpenAI, Google, Mistral, and Anthropic, evaluating their performance across diverse English and Portuguese semantic relations datasets. We included in the analysis PAP900, a recently published dataset of semantic relations in Portuguese, to examine the influence of prior exposure to the dataset on LLM training. The results show that the LLM predictions correlate strongly with human ratings. The findings reveal the potential of LLMs to supplement or replace traditional semantic measure algorithms and crowd-sourced human annotations in semantic tasks.
The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
Sebastian Ochs | Ivan Habernal
Sebastian Ochs | Ivan Habernal
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII-removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted—and for good reasons—which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents
Abdulfattah Safa | Tamta Kapanadze | Arda Uzunoğlu | Gözde Gül Şahin
Abdulfattah Safa | Tamta Kapanadze | Arda Uzunoğlu | Gözde Gül Şahin
Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
It is demonstrated that the proofs given in prominent and well-established weak generative capacity arguments for natural language are flawed, due to unexpected interpretations of strings. However, once unique representations of lexical semantic senses form part of such intersection-based proofs, the arguments stand.