Computational Linguistics, Volume 52, Issue 2 - June 2026


Anthology ID:
2026.cl-2
Month:
June
Year:
2026
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
URL:
https://preview.aclanthology.org/codex___ingest-cl-2026-issue-2/2026.cl-2/
DOI:
Bib Export formats:
BibTeX

I was honored to receive the Association for Computational Linguistics Lifetime Achievement Award in 2025. I especially want to thank the people who nominated me for the award as I know nominations require time and effort. This retrospective is a rough transcript of the speech I gave accepting the award at the conference in Vienna, Austria. In the talk, I look back at my research at early stages of my career and then look at the arc that research takes and how it relates to work that I still carry out today. I look at the trajectories of four areas of my research: language generation, text summarization, social media analysis, and multimodal analysis of artwork. In the talk, I featured videos of my current students speaking about their research and where they think the field is heading. I dedicate the talk and this article to the amazing students I have had the honor to work with over the years.
Large language models (LLMs) are rapidly being adopted by users across the globe, who interact with them in a diverse range of languages. At the same time, there are well-documented imbalances in the training data and optimization objectives of this technology, raising doubts as to whether LLMs can accurately represent the cultural diversity of their broad user base. In this study, we look at LLMs and cultural values in particular, and examine how prompt language and cultural framing influence model responses and their alignment with human values in different countries. We do so by probing 10 LLMs with 63 items from the Hofstede Values Survey Module and World Values Survey, translated into 11 languages, and formulated as prompts with and without different explicit cultural perspectives. Our study confirms that both prompt language and cultural perspective produce variation in LLM outputs, but with an important caveat: While targeted prompting can, to a certain extent, steer LLM responses in the direction of the predominant values of the corresponding countries, it does not overcome the models’ systematic bias toward the values associated with a restricted set of countries in our dataset: the Netherlands, Germany, the United States, and Japan. All tested models, regardless of their origin, exhibit remarkably similar patterns: They produce fairly neutral responses on most topics, with selective progressive stances on issues such as social tolerance. Alignment with cultural values of human respondents is improved more with an explicit cultural perspective than with a targeted prompt language. Unexpectedly, combining both approaches is no more effective than cultural framing with an English prompt. These findings reveal that LLMs occupy an uncomfortable middle ground: They are responsive enough to changes in prompts to produce variation, but they are also too firmly anchored to specific cultural defaults to adequately represent cultural diversity.
Various encodings have been proposed to cast constituent parsing in terms of a sequence labeling task. However, unlike in the case of dependency parsing, existing comparisons have not been entirely homogeneous and, to the best of our knowledge, there is no systematic evaluation of these encodings under uniform configurations. A homogeneous evaluation needs to account for various aspects that could influence results, either by controlling for these aspects to ensure uniformity (e.g., network architecture, parameter settings, postprocessing of ill-formed output), or by systematically analyzing their impact (e.g., the impact of binary versus arbitrary structures). In this article, we: (1) compare different encodings comprehensively both theoretically and empirically, on a modern neural architecture and across nine languages, and (2) introduce new encodings and variants, including an encoding that our analysis finds particularly accurate and compact.
This article introduces the concept of multimodal oxymorons. Multimodal oxymorons extend the traditional oxymoron theory by constructing and communicating meaning through the interplay of multiple modalities (such as visual and textual) rather than relying solely on language. We argue that multimodal oxymorons are central mechanisms of meaning-making in contemporary communication, as evidenced by the use of memes as an example. While textual oxymorons have long been the subject of analysis in order to ascertain their role in shaping thought and meaning, multimodal oxymorons demonstrate how human cognitive process transcends linguistic boundaries, integrating different modalities (e.g., visual) in order to convey complex ideas. To encourage further study, we present a curated multilingual dataset of Multimodal OXYmoron (MOXY), which can be used as a foundation for further analysis and experimentation. Furthermore, we propose a methodical approach for the identification of multimodal oxymorons along with a pipeline for automated generation. Through illustrative examples and a detailed methodology, this work establishes a comprehensive framework for understanding, identifying, and generating multimodal oxymorons, paving the way for advancements in computational linguistics, artificial intelligence, and figurative language studies.
Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets, typically labeled with votes or scores, provide valuable insights into whether a sentence pair exhibits a clear preference or remains controversial. However, existing methods do not fully utilize this information. In this article, we propose a technique that leverages user voting data to better align language models with diverse subjective preferences. We use the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferred over another. Using this estimated probability as a target, we introduce the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and clearly preferred generation pairs. Furthermore, we demonstrate that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms. Additionally, our framework can be applied to reward modeling, demonstrating that our approach is compatible with the broader RLHF pipeline.
This article introduces misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalization. Misinformation changes rapidly, much more quickly than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation detectors need to be able to perform out-of-distribution generalization, an attribute they currently lack. Our benchmark uses distant labeling to enable simulating covariate shifts in misinformation content. We identify time, event, topic, publisher, political bias, and misinformation type as important axes for generalization, and we evaluate a common class of baseline models on each. Using article metadata, we show how this model fails desiderata, which is not necessarily obvious from classification metrics. Finally, we analyze properties of the data to ensure limited presence of modelling shortcuts. We make the dataset and accompanying code publicly available.1
This article investigates the ability of large language models (LLMs) to evaluate semantic relations between word pairs by examining their alignment with human-generated semantic ratings. Semantic relations represent the degree of connection (e.g., relatedness or similarity) between linguistic elements and are traditionally validated against human-annotated datasets. Due to the challenges of building such datasets and recent progress in LLMs’ capacity to model human-like understanding, we explore whether LLMs can serve as reliable substitutes for traditional human ratings. We conducted experiments using multiple LLMs from OpenAI, Google, Mistral, and Anthropic, evaluating their performance across diverse English and Portuguese semantic relations datasets. We included in the analysis PAP900, a recently published dataset of semantic relations in Portuguese, to examine the influence of prior exposure to the dataset on LLM training. The results show that the LLM predictions correlate strongly with human ratings. The findings reveal the potential of LLMs to supplement or replace traditional semantic measure algorithms and crowd-sourced human annotations in semantic tasks.
Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII-removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted—and for good reasons—which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
It is demonstrated that the proofs given in prominent and well-established weak generative capacity arguments for natural language are flawed, due to unexpected interpretations of strings. However, once unique representations of lexical semantic senses form part of such intersection-based proofs, the arguments stand.