Computational Linguistics (2025)
up
Computational Linguistics, Volume 51, Issue 1 - March 2025
By the end of 2024, the journal Computational Linguistics has reached a significant milestone: It has published exactly 50 volumes over the past half-century. As we launch the first issue of Volume 51, this is an opportune moment to reflect on the journal’s legacy, ongoing evolution, and the exciting changes that lie ahead. Together, we embark on a journey to open a new chapter for this storied publication.
I want to thank the ACL for this Lifetime Achievement Award. I am deeply honored to be receiving it. I would also like to thank the students, faculty, and researchers who were members of the Proteus Project during most of my professional lifetime. It was an honor to serve that group.
eRST: A Signaled Graph Theory of Discourse Relations and Organization
Amir Zeldes | Tatsuya Aoyama | Yang Janet Liu | Siyao Peng | Debopam Das | Luke Gessler
Amir Zeldes | Tatsuya Aoyama | Yang Janet Liu | Siyao Peng | Debopam Das | Luke Gessler
In this article we present Enhanced Rhetorical Structure Theory (eRST), a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST). The framework encompasses discourse relation graphs with tree-breaking, non-projective and concurrent relations, as well as implicit and explicit signals which give explainable rationales to our analyses. We survey shortcomings of RST and other existing frameworks, such as Segmented Discourse Representation Theory, the Penn Discourse Treebank, and Discourse Dependencies, and address these using constructs in the proposed theory. We provide annotation, search, and visualization tools for data, and present and evaluate a freely available corpus of English annotated according to our framework, encompassing 12 spoken and written genres with over 200K tokens. Finally, we discuss automatic parsing, evaluation metrics, and applications for data in our framework.
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets
Nikita Moghe | Arnisa Fazla | Chantal Amrhein | Tom Kocmi | Mark Steedman | Alexandra Birch | Rico Sennrich | Liane Guillou
Nikita Moghe | Arnisa Fazla | Chantal Amrhein | Tom Kocmi | Mark Steedman | Alexandra Birch | Rico Sennrich | Liane Guillou
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgment. However, these results are often obtained by averaging predictions across large test sets without any insights into the strengths and weaknesses of these metrics across different error types. Challenge sets are used to probe specific dimensions of metric behavior but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from basic alterations at the word/character level to more intricate errors based on discourse and real-world knowledge. We conducted a large-scale study by benchmarking ACES on 47 metrics submitted to the WMT 2022 and WMT 2023 metrics shared tasks. We also measure their sensitivity to a range of linguistic phenomena. We further investigate claims that large language models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by using a dataset that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods are unreliable. We expose a number of major flaws with existing methods: Most metrics ignore the source sentence; metrics tend to prefer surface level overlap; and over-reliance on language-agnostic representations leads to confusion when the target language is similar to the source language. To further encourage detailed evaluation beyond singular scores, we expand ACES to include error span annotations, denoted as SPAN-ACES, and we use this dataset to evaluate span-based error metrics, showing that these metrics also need considerable improvement. Based on our observations, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing metrics to explicitly focus on the source sentence, focusing on semantic content rather than relying on the lexical overlap, and choosing the right pre-trained model for obtaining representations.
Compositionality and Sentence Meaning: Comparing Semantic Parsing and Transformers on a Challenging Sentence Similarity Dataset
James Fodor | Simon De Deyne | Shinsuke Suzuki
James Fodor | Simon De Deyne | Shinsuke Suzuki
One of the major outstanding questions in computational semantics is how humans integrate the meaning of individual words into a sentence in a way that enables understanding of complex and novel combinations of words, a phenomenon known as compositionality. Many approaches to modeling the process of compositionality can be classified as either “vector-based” models, in which the meaning of a sentence is represented as a vector of numbers, or “syntax-based” models, in which the meaning of a sentence is represented as a structured tree of labeled components. A major barrier in assessing and comparing these contrasting approaches is the lack of large, relevant datasets for model comparison. This article aims to address this gap by introducing a new dataset, STS3k, which consists of 2,800 pairs of sentences rated for semantic similarity by human participants. The sentence pairs have been selected to systematically vary different combinations of words, providing a rigorous test and enabling a clearer picture of the comparative strengths and weaknesses of vector-based and syntax-based methods. Our results show that when tested on the new STS3k dataset, state-of-the-art transformers poorly capture the pattern of human semantic similarity judgments, while even simple methods for combining syntax- and vector-based components into a novel hybrid model yield substantial improvements. We further show that this improvement is due to the ability of the hybrid model to replicate human sensitivity to specific changes in sentence structure. Our findings provide evidence for the value of integrating multiple methods to better reflect the way in which humans mentally represent compositional meaning.
User-generated content provides a rich resource to study social and behavioral phenomena. Although its application potential is currently limited by the paucity of expert labels and the privacy risks inherent in personal data, synthetic data can help mitigate this bottleneck. In this work, we introduce an evaluation framework to facilitate research on synthetic language data generation for user-generated text. We define a set of aspects for assessing data quality, namely, style preservation, meaning preservation, and divergence, as a proxy for privacy. We introduce metrics corresponding to each aspect. Moreover, through a set of generation strategies and representative tasks and baselines across domains, we demonstrate the relation between the quality aspects of synthetic user generated content, generation strategies, metrics, and downstream performance. To our knowledge, our work is the first unified evaluation framework for user-generated text in relation to the specified aspects, offering both intrinsic and extrinsic evaluation. We envisage it will facilitate developments towards shareable, high-quality synthetic language data.
Neural Semantic Parsing with Extremely Rich Symbolic Meaning Representations
Xiao Zhang | Gosse Bouma | Johan Bos
Xiao Zhang | Gosse Bouma | Johan Bos
Current open-domain neural semantics parsers show impressive performance. However, closer inspection of the symbolic meaning representations they produce reveals significant weaknesses: Sometimes they tend to merely copy character sequences from the source text to form symbolic concepts, defaulting to the most frequent word sense based in the training distribution. By leveraging the hierarchical structure of a lexical ontology, we introduce a novel compositional symbolic representation for concepts based on their position in the taxonomical hierarchy. This representation provides richer semantic information and enhances interpretability. We introduce a neural “taxonomical” semantic parser to utilize this new representation system of predicates, and compare it with a standard neural semantic parser trained on the traditional meaning representation format, employing a novel challenge set and evaluation metric for evaluation. Our experimental findings demonstrate that the taxonomical model, trained on much richer and complex meaning representations, is slightly subordinate in performance to the traditional model using the standard metrics for evaluation, but outperforms it when dealing with out-of-vocabulary concepts. We further show through neural model probing that training on a taxonomic representation enhances the model’s ability to learn the taxonomical hierarchy. This finding is encouraging for research in computational semantics that aims to combine data-driven distributional meanings with knowledge-based symbolic representations.
A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions
Junchao Wu | Shu Yang | Runzhe Zhan | Yulin Yuan | Lidia Sam Chao | Derek Fai Wong
Junchao Wu | Shu Yang | Runzhe Zhan | Yulin Yuan | Lidia Sam Chao | Derek Fai Wong
The remarkable ability of large language models (LLMs) to comprehend, interpret, and generate complex language has rapidly integrated LLM-generated text into various aspects of daily life, where users increasingly accept it. However, the growing reliance on LLMs underscores the urgent need for effective detection mechanisms to identify LLM-generated text. Such mechanisms are critical to mitigating misuse and safeguarding domains like artistic expression and social networks from potential negative consequences. LLM-generated text detection, conceptualized as a binary classification task, seeks to determine whether an LLM produced a given text. Recent advances in this field stem from innovations in watermarking techniques, statistics-based detectors, and neural-based detectors. Human-assisted methods also play a crucial role. In this survey, we consolidate recent research breakthroughs in this field, emphasizing the urgent need to strengthen detector research. Additionally, we review existing datasets, highlighting their limitations and developmental requirements. Furthermore, we examine various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues, and ineffective evaluation frameworks. Finally, we outline intriguing directions for future research in LLM-generated text detection to advance responsible artificial intelligence. This survey aims to provide a clear and comprehensive introduction for newcomers while offering seasoned researchers valuable updates in the field.1
up
Computational Linguistics, Volume 51, Issue 2 - June 2025
DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation
Federico Martelli | Stefano Perrella | Niccolò Campolungo | Tina Munda | Svetla Koeva | Carole Tiberius | Roberto Navigli
Federico Martelli | Stefano Perrella | Niccolò Campolungo | Tina Munda | Svetla Koeva | Carole Tiberius | Roberto Navigli
Despite the remarkable progress made in the field of Machine Translation (MT), current systems still struggle when translating ambiguous words, especially when these express infrequent meanings. In order to investigate and analyze the impact of lexical ambiguity on automatic translations, several tasks and evaluation benchmarks have been proposed over the course of the last few years. However, work in this research direction suffers from critical shortcomings. Indeed, existing evaluation datasets are not entirely manually curated, which significantly compromises their reliability. Furthermore, current literature fails to provide detailed insights into the nature of the errors produced by models translating ambiguous words, lacking a thorough manual analysis across languages. With a view to overcoming these limitations, we propose Disambiguation Biases in MT (DiBiMT), an entirely manually curated evaluation benchmark for investigating disambiguation biases in eight language combinations and assessing the ability of both commercial and non-commercial systems to handle ambiguous words. We also examine and detail the errors produced by models in this scenario by carrying out a manual error analysis in all language pairs. Additionally, we perform an extensive array of experiments aimed at studying the behavior of models when dealing with ambiguous words. Finally, we show the ineffectiveness of standard MT evaluation settings for assessing the disambiguation capabilities of systems and highlight the need for additional efforts in this research direction and ad-hoc testbeds such as DiBiMT. Our benchmark is available at: https://nlp.uniroma1.it/dibimt/.
Train and Constrain: Phonologically Informed Tongue Twister Generation from Topics and Paraphrases
Tyler Loakman | Chen Tang | Chenghua Lin
Tyler Loakman | Chen Tang | Chenghua Lin
Previous work in phonologically and phonetically grounded language generation has mainly focused on domains such as puns and poetry. In this article, we present new work on the generation of English tongue twisters—a form of language that is required to be conditioned on a phoneme level to maximize sound overlap, while maintaining semantic consistency with an input topic or phrase and still being grammatically correct. We present TwisterLister, a pipeline for generating phonologically informed tongue twisters from large language models (LLMs) that we use to generate TwistList 2.0, the largest annotated dataset of tongue twisters to date, consisting of 17k+ examples from a combination of human and LLM authors. Our generation pipeline involves the use of a phonologically constrained vocabulary alongside LLM prompting to generate novel, non-derivative tongue twister examples. We additionally present the results of automatic and human evaluation of smaller models trained on our generated dataset to demonstrate the extent to which phonologically motivated language types can be generated without explicit injection of phonological knowledge. Additionally, we introduce a phoneme-aware constrained decoding module (PACD) that can be integrated into an autoregressive language model and demonstrate that this method generates good quality tongue twisters both with and without fine-tuning the underlying language model. We also design and implement a range of automatic metrics for the task of tongue twister generation that is phonologically motivated and captures the unique essence of tongue twisters, primarily based on phonemic edit distance (PED).1
Eliciting and Improving the Causal Reasoning Abilities of Large Language Models with Conditional Statements
Xiao Liu | Da Yin | Chen Zhang | Dongyan Zhao | Yansong Feng
Xiao Liu | Da Yin | Chen Zhang | Dongyan Zhao | Yansong Feng
Causal reasoning, the ability to identify cause-and-effect relationships, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Complex causal structures are rarely expressed explicitly in the text, which could make learning them challenging for LLMs. Given the fact that programming code may express causal relations more often and explicitly with conditional statements like if, we want to explore whether large language models of code (Code-LLMs) acquire better causal reasoning abilities, and whether code prompts better describe the causal structure than text prompts. Our experiments show that compared with general-purpose LLMs like Llama-2 and GPT-3, Code-LLMs like CodeLlama and Codex are significantly better in causal reasoning. Code prompts not only work well for Code-LLMs, but also help improve the performance of most general-purpose LLMs. To understand why code prompts are effective, we intervene on the prompts from different aspects, and discover that the programming structure is crucial in code prompt design, while models are more robust towards format perturbations. We further explore whether exposing models with more code with conditional statements aids in enhancing causal reasoning abilities. We finetune LLMs on such code corpus, and find their performance improves when prompted with either code prompts or text prompts.1
Investigating Idiomaticity in Word Representations
Wei He | Tiago Kramer Vieira | Marcos Garcia | Carolina Scarton | Marco Idiart | Aline Villavicencio
Wei He | Tiago Kramer Vieira | Marcos Garcia | Carolina Scarton | Marco Idiart | Aline Villavicencio
Idiomatic expressions are an integral part of human languages, often used to express complex ideas in compressed or conventional ways (e.g., eager beaver as a keen and enthusiastic person). However, their interpretations may not be straightforwardly linked to the meanings of their individual components in isolation and this may have an impact for compositional approaches. In this article, we investigate to what extent word representation models are able to go beyond compositional word combinations and capture multiword expression idiomaticity and some of the expected properties related to idiomatic meanings. We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling 32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity. Affinity is a comparative measure of the similarity between an experimental item, a target and a potential distractor, and Scaled Similarity incorporates a rescaling factor to magnify the meaningful similarities within the spaces defined by each specific model. The results obtained with a variety of representative and widely used models indicate that, despite superficial indications to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in current models. Moreover, the performance of models with different levels of contextualization suggests that their ability to capture context is not yet able to go beyond more superficial lexical clues provided by the words and to actually incorporate the relevant semantic clues needed for idiomaticity. By proposing model-agnostic measures for assessing the ability of models to capture idiomaticity, this article contributes to determining limitations in the handling of non-compositional structures, which is one of the directions that needs to be considered for more natural, accurate, and robust language understanding. The source code and additional materials related to this paper are available at our GitHub repository.1
This article introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically, we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic text representations. Performance using both the representations was comparable across different tokenizations. However, dotless representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%. Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.
LMLPA: Language Model Linguistic Personality Assessment
Jingyao Zheng | Xian Wang | Simo Hosio | Xiaoxian Xu | Lik-Hang Lee
Jingyao Zheng | Xian Wang | Simo Hosio | Xiaoxian Xu | Lik-Hang Lee
Large language models (LLMs) are increasingly used in everyday life and research. One of the most common use cases is conversational interactions, enabled by the language generation capabilities of LLMs. Just as between two humans, a conversation between an LLM-powered entity and a human depends on the personality of the conversants. However, measuring the personality of a given LLM is currently a challenge. This article introduces the Language Model Linguistic Personality Assessment (LMLPA), a system designed to evaluate the linguistic personalities of LLMs. Our system helps to understand LLMs’ language generation capabilities by quantitatively assessing the distinct personality traits reflected in their linguistic outputs. Unlike traditional human-centric psychometrics, the LMLPA adapts a personality assessment questionnaire, specifically the Big Five Inventory, to align with the operational capabilities of LLMs, and also incorporates the findings from previous language-based personality measurement literature. To mitigate sensitivity to the order of options, our questionnaire is designed to be open-ended, resulting in textual answers. Thus, the Artificial Intelligence (AI) rater is needed to transform ambiguous personality information from text responses into clear numerical indicators of personality traits. Utilizing Principal Component Analysis and reliability validation methods, our findings demonstrate that LLMs possess distinct personality traits that can be effectively quantified by the LMLPA. This research contributes to Human-Centered AI and Computational Linguistics, providing a robust framework for future studies to refine AI personality assessments and expand their applications in multiple areas, including education and manufacturing.
A central goal of linguistic theory is to find a precise characterization of the notion “possible human language”, in the form of a computational device that is capable of describing all and only the languages that can be acquired by a typically developing human child. The success of recent large language models (LLMs) in NLP applications arguably raises the possibility that LLMs might be computational devices that meet this goal. This would only be the case if, in addition to succeeding in learning human languages, LLMs struggle to learn “impossible” human languages. Kallini et al. (2024) conducted experiments aiming to test this by training GPT-2 on a variety of synthetic languages, and found that it learns some more successfully than others. They present these asymmetries as support for the idea that LLMs’ inductive biases align with what is regarded as “possible” for human languages, but the most significant comparison has a confound that makes this conclusion unwarranted.
Do texts generated by language models (LMs) refer? Mandelkern and Linzen (2024) argue that externalist principles point to an affirmative conclusion. What grounds reference, according to their externalism, is a term’s “natural history”. For example, ‘water’ refers to H2O among English speakers, and not to the phenomenally indistinguishable chemical XYZ, because H2O, and not XYZ, is implicated in the natural history of ‘water’. Appealing to the literature on contrastive explanation, I show that a term’s natural history does not generally ground its referential properties. Thus, Mandelkern and Linzen’s quick route to the referentiality of LM-generated texts fails.
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao | Xinyu Hu | Xunjian Yin | Jie Ruan | Xiao Pu | Xiaojun Wan
Mingqi Gao | Xinyu Hu | Xunjian Yin | Jie Ruan | Xiao Pu | Xiaojun Wan
Evaluating natural language generation (NLG) is a vital but challenging problem in natural language processing. Traditional evaluation metrics mainly capturing content (e.g., n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human–LLM collaborative evaluation. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. Lastly, we discuss several open problems in this area and point out future research directions.
Socially Aware Language Technologies: Perspectives and Practices
Diyi Yang | Dirk Hovy | David Jurgens | Barbara Plank
Diyi Yang | Dirk Hovy | David Jurgens | Barbara Plank
Language technologies have advanced substantially, particularly with the introduction of large language models. However, these advancements can exacerbate several issues that models have traditionally faced, including bias, evaluation, and risk. In this perspective piece, we argue that many of these issues share a common core: a lack of awareness of the social factors, interactions, and implications of the social environment in which NLP operates. We call this social awareness. While NLP is improving at addressing linguistic issues, there has been relatively limited progress in incorporating social awareness into models to work in all situations for all users. Integrating social awareness into NLP will improve the naturalness, usefulness, and safety of applications while also opening up new applications. Today, we are only at the start of a new, important era in the field.
up
Computational Linguistics, Volume 51, Issue 3 - September 2025
Graded Suspiciousness of Adversarial Texts to Humans
Shakila Mahjabin Tonni | Pedro Faustini | Mark Dras
Shakila Mahjabin Tonni | Pedro Faustini | Mark Dras
Adversarial examples pose a significant challenge to deep neural networks across both image and text domains, with the intent to degrade model performance through carefully altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples, where adversarial changes are often desired to be indistinguishable to the human eye even when placed side by side with originals. Although this is generally not possible with text, textual adversarial content must still often remain undetected or non-suspicious to human readers. Even when the text’s purpose is to deceive NLP systems or bypass filters, the text is often expected to be natural to read. In this research, we expand the study of human suspiciousness by analyzing how individuals perceive adversarial texts. We gather and publish a novel dataset of Likert-scale human evaluations on the suspiciousness of adversarial sentences, crafted by four widely used adversarial attack methods and assess their correlation with the human ability to detect machine-generated alterations. Additionally, we develop a regression-based model to predict levels of suspiciousness and establish a baseline for future research in reducing the suspiciousness in adversarial text generation. We also demonstrate how the regressor-generated suspicious scores can be incorporated into adversarial generation methods to produce texts that are less likely to be perceived as computer-generated.
UniASA: A Unified Generative Framework for Argument Structure Analysis
Jianzhu Bao | Mohan Jing | Kuicai Dong | Aixin Sun | Yang Sun | Ruifeng Xu
Jianzhu Bao | Mohan Jing | Kuicai Dong | Aixin Sun | Yang Sun | Ruifeng Xu
Argumentation is a fundamental human activity that involves reasoning and persuasion, which also serves as the basis for the development of AI systems capable of complex reasoning. In NLP, to better understand human argumentation, argument structure analysis aims to identify argument components, such as claims and premises, and their relations from free text. It encompasses a variety of divergent tasks, such as end-to-end argument mining, argument pair extraction, and argument quadruplet extraction. Existing methods are usually tailored to only one specific argument structure analysis task, overlooking the inherent connections among different tasks. We observe that the fundamental goal of these tasks is similar: identifying argument components and their interrelations. Motivated by this, we present a unified generative framework for argument structure analysis (UniASA). It can uniformly address multiple argument structure analysis tasks in a sequence-to-sequence manner. Further, we enhance UniASA with a multi-view learning strategy based on subtask decomposition. We conduct experiments on seven datasets across three tasks. The results indicate that UniASA can address these tasks uniformly and achieve performance that is either superior to or comparable with the previous state-of-the-art methods. Also, we show that UniASA can be effectively integrated with large language models, such as Llama, through fine-tuning or in-context learning.
Large language models segment many words into multiple tokens, and there is mixed evidence as to whether tokenization affects how state-of-the-art models represent meanings. Chinese characters present an opportunity to investigate this issue: They contain semantic radicals, which often convey useful information; characters with the same semantic radical tend to begin with the same one or two bytes (when using UTF-8 encodings); and tokens are common strings of bytes, so characters with the same radical often begin with the same token. This study asked GPT-4, GPT-4o, and Llama 3 whether characters contain the same semantic radical, elicited semantic similarity ratings, and conducted odd-one-out tasks (i.e., which character is not like the others). In all cases, misalignment between tokens and radicals systematically corrupted representations of Chinese characters. In experiments comparing characters represented by single tokens to multi-token characters, the models were less accurate for single-token characters, which suggests that segmenting words into fewer, longer tokens obscures valuable information in word form and will not resolve the problems introduced by tokenization. In experiments with 12 European languages, misalignment between tokens and suffixes systematically corrupted categorization of words by all three models, which suggests that the tendency to treat malformed tokens like linguistic units is pervasive.
The Emergence of Chunking Structures with Hierarchical RNN
Zijun Wu | Anup Anand Deshmukh | Yongkang Wu | Jimmy Lin | Lili Mou
Zijun Wu | Anup Anand Deshmukh | Yongkang Wu | Jimmy Lin | Lili Mou
In Natural Language Processing (NLP), predicting linguistic structures, such as parsing and chunking, has mostly relied on manual annotations of syntactic structures. This article introduces an unsupervised approach to chunking, a syntactic task that involves grouping words in a non-hierarchical manner. We present a Hierarchical Recurrent Neural Network (HRNN) designed to model word-to-chunk and chunk-to-sentence compositions. Our approach involves a two-stage training process: pretraining with an unsupervised parser and finetuning on downstream NLP tasks. Experiments on multiple datasets reveal a notable improvement of unsupervised chunking performance in both pretraining and finetuning stages. Interestingly, we observe that the emergence of the chunking structure is transient during the neural model’s downstream-task training. This study contributes to the advancement of unsupervised syntactic structure discovery and opens avenues for further research in linguistic theory.1
Exploiting Contextual Embeddings in Hierarchical Topic Modeling and Investigating the Limits of the Current Evaluation Metrics
Felipe Viegas | Antonio Pereira | Washington Cunha | Celso França | Claudio Andrade | Elisa Tuler | Leonardo Rocha | Marcos André Gonçalves
Felipe Viegas | Antonio Pereira | Washington Cunha | Celso França | Claudio Andrade | Elisa Tuler | Leonardo Rocha | Marcos André Gonçalves
We investigate two essential challenges in the context of hierarchical topic modeling (HTM)—(i) the impact of data representation and (ii) topic evaluation. The data representation directly influences the performance of the topic generation, and the impact of new representations such as contextual embeddings in this task has been under-investigated. Topic evaluation, responsible for driving the advances in the field, assesses the overall quality of the topic generation process. HTM studies exploit the exact topic modeling (TM) evaluation metrics as traditional TM to measure the quality of topics. One significant result of our work is demonstrating that the HTM’s hierarchical nature demands novel ways of evaluating the quality of topics. As our main contribution, we propose two new topic quality metrics to assess the topical quality of the hierarchical structures. Uniqueness considers topic topological consistency, while the Semantic Hierarchical Structure (SHS) captures the semantic relatedness of the hierarchies. We also present an additional advance to the state-of-the-art by proposing the c-CluHTM. To the best of our knowledge, c-CluHTM is the first method that exploits contextual embeddings into NMF in HTM tasks. c-CluHTM enhances the topics’ semantics while preserving the hierarchical structure. We perform an experimental evaluation, and our results demonstrate the superiority of our proposal with gains between 12% and 21%, regarding NPMI and Coherence over the best baselines. Regarding the newly proposed metrics, our results reveal that Uniqueness and SHS can capture relevant information about the structure of the hierarchical topics that traditional metrics cannot.
This position paper’s primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models (LLMs). I do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.
Survey of Cultural Awareness in Language Models: Text and Beyond
Siddhesh Pawar | Junyeong Park | Jiho Jin | Arnav Arora | Junho Myung | Srishti Yadav | Faiz Ghifari Haznitrama | Inhwa Song | Alice Oh | Isabelle Augenstein
Siddhesh Pawar | Junyeong Park | Jiho Jin | Arnav Arora | Junho Myung | Srishti Yadav | Faiz Ghifari Haznitrama | Inhwa Song | Alice Oh | Isabelle Augenstein
Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive, going beyond multilinguality and building on findings from psychology and anthropology. In this article, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking definitions of culture from the anthropology and psychology literature as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of human–computer interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.1
Large Language Models have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES, which encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-à-vis systems of human language.
up
Computational Linguistics, Volume 51, Issue 4 - December 2025
Large language models (LMs) such as ChatGPT have revolutionized natural language processing and artificial intelligence more broadly. In this work, we discuss our research on understanding and advancing these models, centered around how they use the very large text corpora they are trained on. First, we describe our efforts to understand how these models learn to perform new tasks after training, demonstrating that their so-called in-context learning capabilities are almost entirely determined by what they learn from the training data. Next, we introduce a new class of LMs—nonparametric LMs—that repurpose this training data as a data store from which they retrieve information for improved accuracy and updatability. We discuss our work establishing the foundations of such models, including one of the first broadly used neural retrieval models and an approach that simplifies a traditional, two-stage pipeline into one. We also discuss how nonparametric models open up new avenues for responsible data use, e.g., by segregating permissive and copyrighted text and using them differently. Finally, we envision the next generation of LMs we should build, focusing on efficient scaling, improved factuality, and decentralization.
Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework that can encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, are also efficiently representable by simple finite-state transducers. For BPE, this is particularly surprising given that it does not tokenize strings from left to right and requires a notion of priority. We also discuss an application of subword-level pattern promotion to guided generation, where the outputs of a language model are constrained to match a specified pattern, and how tokenization-aware promotion offers a theoretical benefit to modeling.
IDT: Dual-Task Adversarial Rewriting for Attribute Anonymization
Pedro Faustini | Shakila Mahjabin Tonni | Annabelle McIver | Qiongkai Xu | Mark Dras
Pedro Faustini | Shakila Mahjabin Tonni | Annabelle McIver | Qiongkai Xu | Mark Dras
Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction, or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or—for instance, in cases where users might be at risk from an untrustworthy model, the sort of scenario of interest here—changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g., the gender of the author, or their location by the writing style) while keeping the text useful for its original intention (e.g., the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This article explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t. one task (privacy) while keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t. a privacy task.
Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance
Fermín Moscoso del Prado Martín
Fermín Moscoso del Prado Martín
In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate—both theoretically and empirically—that a grammar’s derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I evaluate the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.
LGDE: Local Graph-based Dictionary Expansion
Juni Schindler | Sneha Jha | Xixuan Zhang | Kilian Buehling | Annett Heft | Mauricio Barahona
Juni Schindler | Sneha Jha | Xixuan Zhang | Kilian Buehling | Annett Heft | Mauricio Barahona
We present Local Graph-based Dictionary Expansion (LGDE), a method for data-driven discovery of the semantic neighborhood of words using tools from manifold learning and network science. At the heart of LGDE lies the creation of a word similarity graph from the geometry of word embeddings followed by local community detection based on graph diffusion. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings to capture word similarities based on paths of semantic association, over and above direct pairwise similarities. Exploiting such semantic neighborhoods enables the expansion of dictionaries of pre-selected keywords, an important step for tasks in information retrieval, such as database queries and online data collection. We validate LGDE on two user-generated English-language corpora and show that LGDE enriches the list of keywords with improved performance relative to methods based on direct word similarities or co-occurrences. We further demonstrate our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on the expansion of a conspiracy-related dictionary from online data collected and analyzed by domain experts. Our empirical results and expert user assessment indicate that LGDE expands the seed dictionary with more useful keywords due to the manifold-learning-based similarity network.
BLiMP-NL: A Corpus of Dutch Minimal Pairs and Acceptability Judgments for Language Model Evaluation
Michelle Suijkerbuijk | Zoë Prins | Marianne de Heer Kloots | Willem Zuidema | Stefan L. Frank
Michelle Suijkerbuijk | Zoë Prins | Marianne de Heer Kloots | Willem Zuidema | Stefan L. Frank
We present a corpus of 8,400 Dutch sentence pairs, intended primarily for the grammatical evaluation of language models. Each pair consists of a grammatical sentence and a minimally different ungrammatical sentence. The corpus covers 84 paradigms, classified into 22 syntactic phenomena. Ten sentence pairs of each paradigm were created by hand, while the remaining 90 were generated semi-automatically and manually validated afterwards. Nine of the 10 hand-crafted sentences of each paradigm are rated for acceptability by at least 30 participants each, and for the same 9 sentences reading times are recorded per word, through self-paced reading. Here, we report on the construction of the dataset, the measured acceptability ratings and reading times, as well as the extent to which a variety of language models can be used to predict both the ground-truth grammaticality and human acceptability ratings.
A Novel Methodology for Enhancing Cross-language and Domain Adaptability in Temporal Expression Normalization
Alejandro Sánchez de Castro | Lourdes Araujo | Juan Martinez-Romo
Alejandro Sánchez de Castro | Lourdes Araujo | Juan Martinez-Romo
Accurate temporal expression normalization, the process of assigning a numerical value to a temporal expression, is essential for tasks such as timeline creation and temporal reasoning. While rule-based normalization systems are limited in adaptability across different domains and languages, deep-learning solutions in this area have not been extensively explored. An additional challenge is the scarcity of manually annotated corpora with temporal annotations. To address the adaptability limitations of current systems, we propose a highly adaptable methodology that can be applied to multiple domains and languages. This can be achieved by leveraging a multilingual Pre-trained Language Model (PTLM) with a fill-mask architecture, using a Value Intermediate Representation (VIR) where the temporal expression value format is adjusted to the fill-mask representation. Our approach involves a two-phase training process. Initially, the model is trained with a novel masking policy on a large English biomedical corpus that is automatically annotated with normalized temporal expressions, along with a complementary hand-crafted temporal expressions corpus. This addresses the lack of manually annotated data and helps to achieve sufficient capacity for adaptation to diverse domains or languages. In the second phase, we show how the model can be tailored to different domains and languages using various techniques, showcasing the versatility of the proposed methodology. This approach significantly outperforms existing systems.
Problem Solving Through Human–AI Preference-based Cooperation
Subhabrata Dutta | Timo Kaufmann | Goran Glavaš | Ivan Habernal | Kristian Kersting | Frauke Kreuter | Mira Mezini | Iryna Gurevych | Eyke Hüllermeier | Hinrich Schütze
Subhabrata Dutta | Timo Kaufmann | Goran Glavaš | Ivan Habernal | Kristian Kersting | Frauke Kreuter | Mira Mezini | Iryna Gurevych | Eyke Hüllermeier | Hinrich Schütze
While there is a widespread belief that artificial general intelligence—or even superhuman AI—is imminent, complex problems in expert domains are far from being solved. We argue that such problems require human–AI cooperation and that the current state of the art in generative AI is unable to play the role of a reliable partner due to a multitude of shortcomings, including difficulty in keeping track of a complex solution artifact (e.g., a software program), limited support for versatile human preference expression, and lack of adapting to human preference in an interactive setting. To address these challenges, we propose HAI-Co2, a novel human–AI co-construction framework. We take first steps towards a formalization of HAI-Co2 and discuss the difficult open research problems that it faces.
🧜Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang | Yafu Li | Leyang Cui | Deng Cai | Lemao Liu | Tingchen Fu | Xinting Huang | Enbo Zhao | Yu Zhang | Yulong Chen | Longyue Wang | Anh Tuan Luu | Wei Bi | Freda Shi | Shuming Shi
Yue Zhang | Yafu Li | Leyang Cui | Deng Cai | Lemao Liu | Tingchen Fu | Xinting Huang | Enbo Zhao | Yu Zhang | Yulong Chen | Longyue Wang | Anh Tuan Luu | Wei Bi | Freda Shi | Shuming Shi
While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this article, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.
The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers that include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.