Computational Linguistics (2026)
Volumes
up
Computational Linguistics, Volume 52, Issue 1 - March 2026
Truth or Mirage? Towards End-To-End Factuality Evaluation with LLM-O asis
Alessandro Scirè | Andrei Stefan Bejgu | Simone Tedeschi | Karim Ghonim | Federico Martelli | Roberto Navigli
Alessandro Scirè | Andrei Stefan Bejgu | Simone Tedeschi | Karim Ghonim | Federico Martelli | Roberto Navigli
After the introduction of Large Language Models (LLMs), there have been substantial improvements in the performance of Natural Language Generation (NLG) tasks, including Text Summarization and Machine Translation. However, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: (i) they are tailored to a specific task or domain; (ii) they are limited in size, thereby preventing the training of new factuality evaluators; (iii) they are designed for simpler verification tasks, such as claim verification. To address these issues, we introduce LLM-Oasis, to the best of our knowledge the largest resource for training end-to-end factuality evaluators. LLM-Oasis is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for benchmarking factuality evaluation systems. Our experiments demonstrate that LLM-Oasis presents a significant challenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlighting its potential to drive future research in the field.
Are Formal and Functional Linguistic Mechanisms Dissociated in Language Models?
Michael Hanna | Yonatan Belinkov | Sandro Pezzelle
Michael Hanna | Yonatan Belinkov | Sandro Pezzelle
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: They excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this article, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the “circuits”, or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness—the ability of one circuit to solve another’s task—we observe a separation between formal and functional mechanisms, with formal task circuits achieving higher performance on other formal tasks. This suggests the existence of a set of formal linguistic mechanisms that is shared across formal tasks, even if not all mechanisms are strictly necessary for all formal tasks.
Training and Evaluating with Human Label Variation: An Empirical Study
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Kemal Kurniawan | Meladel Mistica | Timothy Baldwin | Jey Han Lau
Human label variation (HLV) challenges the standard assumption that a labeled instance has a single ground truth, instead embracing the natural variation in human annotation to train and evaluate models. While various training methods and metrics for HLV have been proposed, it is still unclear which methods and metrics perform best in what settings. We propose new evaluation metrics for HLV leveraging fuzzy set theory. Because these new proposed metrics are differentiable, we then in turn experiment with using these metrics as training objectives. We conduct an extensive study over 6 HLV datasets testing 14 training methods and 6 evaluation metrics. We find that training on either disaggregated annotations or soft labels performs best across metrics, outperforming training using the proposed training objectives with differentiable metrics. We also show that our proposed soft micro F1 score is one of the best metrics for HLV data.1
Linguistic Steganography via Self-Adjusting Asymmetric Number System
Yiting Liu | Chungen Xu | Fei Yang | Pan Zhang | Linlong Wang
Yiting Liu | Chungen Xu | Fei Yang | Pan Zhang | Linlong Wang
Linguistic steganography (stego) seeks to conceal secret information within natural language text. However, existing methods often struggle to balance stego text quality with embedding efficiency, largely due to limitations in generation strategies and coding mechanisms. We propose SA-ANS, a self-adaptive linguistic steganography framework based on a self-adjusting Asymmetric Numeral System. SA-ANS allows user-specified embedding rates and uses probabilistic coding with adaptive candidate selection, dynamically tailoring the token pool to the language model’s probability distribution. This design produces fluent, semantically coherent stego text while preserving statistical indistinguishability from natural language. Extensive experiments on multiple benchmark datasets, evaluated across embedding efficiency, linguistic quality, statistical similarity, robustness to steganalysis, and human judgment, show that SA-ANS consistently outperforms state-of-the-art methods, demonstrating both effectiveness and practicality.
Defensive Dual Masking for Robust Adversarial Defense
Wangli Yang | Jie Yang | Yi Guo | Johan Barthelemy
Wangli Yang | Jie Yang | Yi Guo | Johan Barthelemy
Adversarial defenses for textual data have gained considerable attention in recent years due to the increasing vulnerability of Natural Language Processing (NLP) models to adversarial attacks. These attacks exploit subtle perturbations in input text to deceive models, posing significant challenges to model robustness and reliability. This article introduces Defensive Dual Masking (DDM), a simple yet effective algorithm that uses two unique masking strategies to mitigate adversarial threats. Specifically, during training, [MASK] tokens are directly inserted into input samples to prepare the model for handling perturbed inputs. At inference time, suspicious tokens are identified and strategically replaced with [MASK] tokens, effectively neutralizing perturbations while preserving core semantics of the input text. The theoretical foundation of DDM demonstrates how the proposed masking strategies enhance the model capacity to mitigate adversarial attacks. Empirical evaluations based on four benchmark datasets and four adversarial attacks consistently demonstrate that DDM outperforms state-of-the-art defense techniques, achieving superior robustness and substantial improvements in model accuracy. Furthermore, DDM seamlessly integrates with Large Language Models, enhancing their resilience to adversarial attacks and providing a scalable defense solution for large-scale NLP applications.
Meta4XNLI: A Cross-lingual Parallel Corpus for Metaphor Detection and Interpretation
Elisa Sanchez-Bayona | Rodrigo Agerri
Elisa Sanchez-Bayona | Rodrigo Agerri
Metaphors are a ubiquitous but often overlooked part of everyday language. As a complex cognitive-linguistic phenomenon, they provide a valuable means to evaluate whether language models can capture deeper aspects of meaning, including semantic, pragmatic, and cultural context. In this work, we present Meta4XNLI, the first parallel dataset for Natural Language Inference (NLI) newly annotated for metaphor detection and interpretation in both English and Spanish. Meta4XNLI facilitates the comparison of encoder- and decoder-based models in detecting and understanding metaphorical language in multilingual and cross-lingual settings. Our results show that fine-tuned encoders outperform decoder-only LLMs in metaphor detection. Metaphor interpretation is evaluated via the NLI framework with comparable performance of masked and autoregressive models, which notably decreases when the inference is affected by metaphorical language. Our study also finds that translation plays an important role in the preservation or loss of metaphors across languages, introducing shifts that might impact metaphor occurrence and model performance. These findings underscore the importance of resources like Meta4XNLI for advancing the analysis of the capabilities of language models and improving our understanding of metaphor processing across languages. Furthermore, the dataset offers previously unavailable opportunities to investigate metaphor interpretation, cross-lingual metaphor transferability, and the impact of translation on the development of multilingual annotated resources.
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger | Wessel Poelman | Andreas Holck Høeg-Petersen | Anders Schlichtkrull | Miryam de Lhoneux | Johannes Bjerva
Esther Ploeger | Wessel Poelman | Andreas Holck Høeg-Petersen | Anders Schlichtkrull | Miryam de Lhoneux | Johannes Bjerva
Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world’s languages is practically infeasible. To attain generalizability, representative language sampling is essential. Previous work argues that generalizable multilingual evaluation sets should contain languages with diverse typological properties. However, “typologically diverse” language samples have been found to vary considerably in this regard, and popular sampling methods are flawed and inconsistent. We present a language sampling framework for selecting highly typologically diverse languages given a sampling frame, informed by language typology. We compare sampling methods with a range of metrics and find that our systematic methods consistently retrieve more typologically diverse language selections than previous methods in NLP. Moreover, we provide evidence that this affects generalizability in multilingual model evaluation, emphasizing the importance of diverse language sampling in NLP evaluation.
Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units
Rebecca Pattichis | Dora LaCasse | Rena Torres Cacoullos
Rebecca Pattichis | Dora LaCasse | Rena Torres Cacoullos
Natural Language Processing (NLP) metrics for bilingual code-switching (CS) have, until now, used words as the token level. However, the assumption that any two words constitute an equally likely switch point is erroneous. In spoken language, a major delimiter of CS is a prosodic chunk known as the Intonation Unit (IU). Switch points are far more likely between words at IU boundaries than between words in the same IU. The word as an elementary NLP unit is thus incommensurate with bilingual speech patterns. Here, we put forward an IU-based adaptation of a familiar metric of CS probability. We then compare the token levels on this metric for ten bilingual datasets featuring multi-word CS. Our comparison shows that the currently standard two-significant-figure precision of the word-based metric is insufficient, as the token level compresses the range of values by inflating the universe of CS. More discerning CS probability values can be obtained by normalizing word-based counts using mean IU length.
How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?
Atsuki Yamaguchi | Aline Villavicencio | Nikolaos Aletras
Atsuki Yamaguchi | Aline Villavicencio | Nikolaos Aletras
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks, and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences (∼0.01GB text data) from the target language.1
The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis
Aaron Mueller | Jannik Brinkmann | Millicent Li | Samuel Marks | Koyena Pal | Nikhil Prakash | Can Rager | Aruna Sankaranarayanan | Arnab Sen Sharma | Jiuding Sun | Eric Todd | David Bau | Yonatan Belinkov
Aaron Mueller | Jannik Brinkmann | Millicent Li | Samuel Marks | Koyena Pal | Nikhil Prakash | Can Rager | Aruna Sankaranarayanan | Arnab Sen Sharma | Jiuding Sun | Eric Todd | David Bau | Yonatan Belinkov
Interpretability provides a toolset for understanding how and why language models behave in certain ways. However, there is little unity in the field: Most studies use ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) utilized, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.
Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence (auxiliary oversight), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent (mechanistic chauvinism). Mitigating these biases requires an empirical, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, achieved by supplementing behavioral experiments with mechanistic studies.