Conference on Computational Natural Language Learning (2026)


up

bib (full) Proceedings of the 30th Conference on Computational Natural Language Learning

Recent studies examining cued recall in Transformers have observed that these language models remember information from the beginning or end of a passage more easily than information in the middle, a pattern which is evocative of serial position effects (primacy and recency) observed in human memory. However, while these effects have been documented in humans across a range of memory tasks (e.g., serial recall, free recall, item recognition), it is less clear whether they generalize beyond cued recall in Transformers.We address this limitation of previous work by performing novel behavioral evaluations on Transformers using a simple item recognition paradigm, which we compare against evaluations using cued recall. We find that Transformers show weak or absent recency effects in item recognition, a pattern which differs from human behavior and from Transformers’ own behavior in cued recall. A subsequent experiment examines the role of Transformers’ architectural biases in producing serial position effects in item recognition and cued recall.
Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.
Much recent work has been interested in modeling language processing using measures of predictability estimated from pretrained language models. These models, however, are primarily built as language technologies rather than cognitive models, and make many design choices that may align poorly with theories of human language processing. We investigate one such choice — the size of the vocabulary learned by a BPE tokenizer — and investigate (1) its effect on the linguistic plausibility of subword units the model learns, (2) whether vocabulary size has a substantial influence on the surprisal estimates a model generates, and (3) whether those differences in surprisal translate to differences in the quality of downstream reading time predictions. We find that while vocabulary size doesn’t substantially affect the rate of morphologically reasonable tokenizations, it does have an impact on surprisal estimates and reading time predictions from 5-gram, LSTM, and GPT-2 language models. Moreover, we find that these differences primarily affect words that are split by the tokenizer, suggesting that psycholinguists should take care to design stimuli meant for computational modeling with subword tokenization in mind.
Can LLMs make metalinguistic judgments? While LLM embeddings are often regarded as high-quality semantic representations, it is not clear that prompting an LLM is a useful way to obtain metalinguistic insights (e.g., whether a DIY gun kit is a “firearm”). While some prior work has suggested LLM prompting can simulate surveys with human participants, computational studies in the domain of legal interpretation have found that LLMs are unreliable for metalinguistic judgments due to prompt sensitivity. However, these studies did not directly compare humans and LLMs on identical tasks, nor did they test so-called “reasoning” models. The current study addresses these gaps by directly comparing the robustness of human and LLM judgments (with and without reasoning) in an English-language legal interpretation task. Our results show that LLMs were more sensitive to irrelevant prompt features compared to human participants. Enabling reasoning improved the stability of LLM responses. However, even reasoning model outputs had only moderate correlations with human judgments, and all models sometimes output interpretations that no humans reached in response to the same prompt. We conclude that while reasoning decreases prompt sensitivity, LLMs are still poor proxies for human metalinguistic judgments.
Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case in a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account. These patterns are not evident in English, however, and we highlight some issues to be resolved to understand the contribution of syntax in memory-efficient processing of various languages.
Humans and large language models (LLMs) both generate predictions during language processing, but whether they integrate structural and prosodic cues similarly during visually grounded speech remains underexplored. Multimodal LLMs that jointly process speech and vision now make it possible to compare not only what humans and models predict, but also when predictions emerge. We compared Mandarin speakers and Qwen2.5-Omni-7B on Mandarin dative constructions in a visual world paradigm (VWP), asking how these cues guide predictions about upcoming referents. Experiment 1 used a cloze-in-VWP task to assess offline prediction outputs; Experiment 2 examined online processing via human eye-tracking and a model audio-to-image cross-modal attention measure. In Experiment 1, humans and the model were both sensitive to structure and prosody, consistent with partial output-level alignment, but the model showed a larger structural effect and a condition-specific atypical prosody pattern. In Experiment 2, the time courses diverged: humans showed structural effects before the contrastive connective, whereas the model’s sensitivity emerged later, after connective onset. These findings indicate that output-level and process-level alignment can dissociate in this paradigm. This study contributes a methodology for multi-level human–model comparison and provides empirical constraints on claims about the cognitive plausibility of multimodal LLMs.
In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.
Emergent communication models support interaction-based language learning, benefiting both Natural Language Processing (NLP) applications and simulations of language evolution, but they are prone to destabilizing language drift. Inspired by developmental trajectories in human language acquisition, this paper investigates whether age-based plasticity, where younger agents learn quickly and older agents maintain stable representations, can reduce language drift. In our set-up, static populations first reliably develop shared languages, followed by a phase in which population turnover gradually replaces older agents with new learners. Age-based plasticity significantly reduces drift in this setting, maintaining high accuracy and language similarity. In contrast, in populations with uniformly low plasticity agents cannot adapt quickly enough to integrate newcomers and in those with uniformly high plasticity the language changes faster than stable conventions can form. These findings demonstrate that developmental trajectories in individual learners substantially reduce overall language drift in dynamic populations.
Language model training and inference ignore a fundamental linguistic fact: there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of ecological fallacy can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author’s language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author’s language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (HuFT:Human-aware Fine-Tuning). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
Transformer language models embed tokens in high-dimensional spaces, but whether geometry reflects linguistic structure remains unclear. We analyse token representations in BERT and GPT\mbox{-}2, selected as canonical encoder-only and decoder-only Transformer architectures, through a linguistically grounded geometric lens. We partition tokens from the UD English-EWT treebank by surface and syntactic features (position, length, POS, head distance and arity) and examine how their representational geometry evolves across layers. We employ complementary diagnostic metrics, including isotropy, linear and nonlinear intrinsic dimensionality, to capture distinct aspects of embedding structure. Our findings reveal that BERT maintains more isotropic and higher-dimensional subspaces, whereas GPT\mbox{-}2 exhibits stronger anisotropy driven by a compact cluster of sentence-initial tokens. Across models, open-class words, longer tokens, and high-arity predicates occupy more isotropic, higher-dimensional manifolds than short function words and pre-head modifiers, indicating that semantic richness and syntactic centrality play a key role in structuring embedding space. Our analysis provides a reusable framework for profiling how linguistic abstractions organize the geometry of Transformer embeddings.
Brain-tuning language models (LMs)—fine-tuning LMs to predict brain recordings elicited by linguistic stimuli—has been proposed as a promising way to align LMs closer to the human brain, with recent work reporting gains on a small number of downstream tasks. However, it remains unclear what benefits brain data provide beyond those obtainable from further training on the same underlying linguistic input, and whether such benefits generalize across tasks. Here, we present a comprehensive evaluation of jointly-tuned LMs, trained on both brain recordings and text-based stimuli, brain-tuned LMs and LMs tuned only on text-based stimuli (i.e., stimulus-tuned LMs). We compare models across a diverse suite of downstream linguistic tasks. We find that jointly-tuned LMs outperform other fine-tuned and pretrained models, and that brain-tuned LMs outperform stimulus-tuned LMs, demonstrating the richness of brain data as an additional training signal for LMs.
We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.
Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ”frames” (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a **frame-completion task**, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models’ comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.
Grasping the semantics of rare constructions (form–meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
Low-resource language varieties used by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual research focuses on inter-lingual language transfer which strives to align allied varieties and minimize differences between them. However, for low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VAÇAÍ-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on structural prediction tasks, which are among the few tasks available, as proxy for performance on other downstream tasks. Using VAÇAÍ-Bowl with TOPPing yields an average 54.62% improvement in the dependency parsing task, which serves as a proxy for performance on other downstream tasks across 10 low-resource varieties.
How does our perception of the world influence the way we talk about it? Psycholinguistic studies have investigated whether visual salience correlates with entity mention and ordering, but often disregarded its effect on grammar or relied on simplistic images or artificial cues. In this study, we explore the use of generative AI to better control for salience in visual stimuli while keeping them realistic, and to serve as a proxy for human participants in studying how different types of salience impact image descriptions.We consider three salience types: *perceptual* (e.g. relative size in the image), *inherent* (e.g. animacy), and *relational* (e.g. human–object interaction). We first analyze human- and AI-generated captions for natural images to examine how salience correlates with how early, and in what grammatical role, an entity is mentioned. We find strong correlations between models and humans in this observational study, justifying the use of AI models alone in a further causal study. For this second study, we created datasets composed of pairs of images, where we used an image-editing model to intervene on the salience of a target entity. We show that relational and perceptual salience lead to the entity being mentioned earlier in captions and being mapped to more prominent grammatical roles. The magnitude of this effect varies across entity types, with animate entities (high inherent salience) showing a particularly distinct pattern.
Vision–language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our Componential–Grammatical (C–G) paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid–90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
Generative AI systems, especially those driven by autoregressive and diffusion-based models, are known to struggle with spatial reasoning. As such, it becomes critical to understand how humans regard those failure modes. In this paper, we examine how humans judge different types of errors in images generated by a text-to-image model. We curated prompts that described common household objects with variance in number, spatial relations, and orientations, and generated a variety of images using each prompt. Humans observed pairs of images generated using the same prompt and answered a set of systematic questions about each image. Survey results showed that incorrect spatial *orientation* regularly emerges as a reason that the generated images do not accurately represent the prompt. We further investigated how RLHF-based multimodal reward models score prompt-image alignment over the same data, and whether they can reliably distinguish the better image in a pairwise setting, as humans do. We find that even though a general cross-task reward model may output alignment scores that accord with those of humans, its reasoning traces are flawed with respect to spatial orientational and relational indicators—the very factors that human annotators rated as the most consequential errors in generated images. Our results show that human annotators regard spatial reasoning errors as highly impactful on the correctness of generated images, and undermine the reliability of multimodal reward model scores as a baseline for evaluating image quality.
This paper compares the empathetic quality of responses generated by humans and large language models (LLMs). We evaluate four LLMs that were widely used at the time of study—GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct—against a human baseline using a large-scale between-subjects study. A total of 1,000 human participants evaluated the empathetic quality of human- and LLM-generated responses to 2,000 dialogue prompts spanning 32 positive and negative emotions. To complement human judgments, we also employed an LLM-as-judge (GPT-4o-mini) to assess the same responses. Across emotions and evaluators, LLM-generated responses were rated as significantly more empathetic than human-written responses. We also observed that both human judges and the LLM-as-judge tended to rate responses generated by their own group more favorably, indicating self-favoring tendencies. These findings highlight both the strong performance of contemporary LLMs in empathetic responding and the need to interpret human- and LLM-based evaluations with care.
Adversarial red teaming is a central component of large language model (LLM) safety evaluation. While prior work has cataloged attack types and measured aggregate failure rates, less attention has been paid to the structured decision-making behavior of human attackers in multi-turn interaction. In this work, we model adversarial dialogue as a hierarchical and sequential process. We introduce a structured representation that decomposes red teaming conversations into goals, strategies, and tactics, where strategies capture distinct vulnerability dimensions and tactics operationalize these strategies at the linguistic level. Using 38,961 multi-turn conversations from a large-scale red teaming dataset, we analyze both first-turn strategy effects and multi-turn adaptation dynamics. Causal estimation reveals systematic differences in success rates across strategic categories. Predictive modeling further shows that incorporating structured strategy, tactic, and adaptation features improves AUC from 0.719 to 0.746 over a baseline without structure. Our findings suggest that adversarial effectiveness is not uniform but varies across structured vulnerability dimensions, and that modeling red teaming as sequential strategic interaction provides measurable explanatory and predictive gains.
CHILDES is a paramount resource for language acquisition studies—yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child–adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Annotation Toolkit for Child–Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.
Recent work suggests that transformer language models show a bias towards human languages over unnatural ("impossible") languages argued to be unacquirable by humans. However, this literature has largely based these claims on differences in sample efficiency and test-set perplexity, rather than on direct evaluations of the linguistic capacities that could plausibly explain non-attestation in human languages. We evaluate two theoretically motivated linking hypotheses: impossibility arising from deficiencies in grammatical sensitivity or generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, we measure sensitivity to grammaticality using BLiMP minimal pairs, finding that model performance exhibits only gradual degradation, mediated by the language’s information locality. In contrast, these models exhibited pronounced failures in generation, producing substantially fewer high-quality sentences at longer lengths. Together, these results suggest generative deficiency and transmission failures as a plausible linking hypothesis between language model behaviour and non-attestation of impossible languages.
A key question in psycholinguistics is how inferences about the meaning of linguistic input unfold incrementally a comprehender’s mind. In this work, we study reading dynamics for “noisy-channel garden-path” sentences, which temporarily appear well-formed but feature late-appearing violations of expectation that can be resolved not by inferring an alternative syntactic structure, but by inferring the presence of an error. We find evidence for targeted regressions – eye movements towards regions that are promising loci of possible errors in light of later-arriving information, showing patterns consistent with the posterior inferences of a model of noisy-channel processing with reanalysis. We discuss the implications of these findings for theories of noisy-channel language comprehension and information-theoretic explanations of reading dynamics.
Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs’ performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.
Logical Reasoning is a novel approach to deal with challenging Machine Reading Comprehension tasks by utilizing the ability to construct logical structures in natural language. However, previous promising studies struggle with the accuracy of logical unit division and the consistency of model prediction on equivalent semantics. In this paper, we propose ThinkStruct, a new method that leverages a transformer network enhanced with the information of Rhetorical Structure (RS) relations for logical reasoning. Specifically, our method uses Rhetorical Structure Theory (RST) to split natural language text into Elementary Discourse Units (EDUs) and identify the relationship among these units. Node information is then fed into the fully connected transformer network, which is enhanced with logical relationships among the extracted units via adjacency matrix. Subsequently, the features of the transformer network are integrated before being passed into the answer prediction module. In addition, we employ a contrastive learning module for improving its understanding of the relationship between Elementary Discourse Units. Our experiments on the LogiQA and Reclor datasets demonstrate that our results outperform other state-of-the-art models.
Linguistic puzzles, wherein the solver must deduce rules of an unfamiliar language purely in-context, represent a uniquely perplexing problem format even for state-of-the-art large language models. Yet by exploring various inference-time scaling methods, we demonstrate that language models’ performance on these problems can be improved without the need for fine-tuning or providing supplementary linguistic context. To this end, this paper introduces the first domain-specific inference-time scaling framework for linguistic puzzles, which we use to improve the performance of three model families - R1 (Deepseek), Gemini 2.5 Flash (Google), and Llama 3.3 70B Instruct (Meta) - on a challenging Linguistics Olympiad-based benchmark by 4.9, 13.1, and 4.9 percentage points, respectively. Nonetheless, even when multiple optimisations are applied, we find that LLMs’ linguistic puzzle performance remains well below comparable mathematical and commonsense benchmarks, and we speculate as to why linguistic reasoning continues to pose a distinctive challenge for even the most capable large language models.
Visual Word Sense Disambiguation (Visual-WSD) requires ranking the correct image for an ambiguous word given a short trigger phrase. For low-resource languages, it is bottle­necked by scarce sense-level benchmarks and limited sense-aligned multimodal supervision. We study Ukrainian and (i) extend the Ukrainian Visual-WSD benchmark from 87 to 381 instances and benchmark multilingual CLIP checkpoints and multimodal large models, and (ii) introduce two scalable Wikipedia-derived dataset construction methods. Using compute-efficient adaptation we fine-tune a multilingual CLIP backbone and show that sense-grounded supervision drives the improvements: combining our two Wikipedia-derived datasets improves HIT@1 from 37.00% to 43.05%.
Multi-domain Dialogue State Tracking (DST) requires discourse coherence that transcends independent slot-filling. Most existing approaches rely on statistical regularities within static schemas, failing to capture the semantic coordination governing simultaneous slot updates. In this paper, we propose Event-DST, which models latent events as cognitive organizing units to dynamically coordinate slot interactions. By projecting dialogue context into a continuous semantic space, our model induces a dynamic structural bias to enforce pragmatic consistency. This structural guidance is integrated via a dual-stream fusion strategy that balances top-down structural constraints with bottom-up textual precision. Experimental results on two benchmarks demonstrate the superiority of our framework, providing an interpretable and parameter-efficient path toward robust dialogue understanding.
Evaluating and optimizing authorial style in long-form story generation is challenging because style judgments often rely on subjective human voting, and there is no stable automatic evaluation method. We propose a two-stage pipeline. First, we train a style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded [0,1] reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modeling provides a practical mechanism for controllable long-form style transfer under moderate model size and training budget.
Recent work has shown that larger language models have better predictive power for eye movement and reading time data. However, we know less about how model capacity relates to human production statistics in the cloze task, which are used to predict reading times as well. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.
Recognizing and navigating client resistance is critical for effective mental health counseling, yet its detection remains particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Expert evaluations confirm that the generated explanations are highly faithful and reliable. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships, and its potential to improve counselors’ understanding and intervention strategies.
Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: containment, which tests if a concept is fully represented in a subspace but not outside it, and disentanglement, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.
Among English child speech corpora, very few focus on oral reading. Existing resources such as the CMU Kids Corpus (Ellis Weismer et al., 2013) face limitations in the lack of grade-appropriate, curriculum-aligned reading texts, the annotation scope and quality, and most crucially, comprehensive annotation scheme for characterization of children’s reading errors. This study presents a multi-layered, fully manually annotated corpus of oral reading from 63 1st-3rd grade students residing in the U.S. who grow up hearing and speaking English. Additionally, we contribute methodologically rigorous annotation guidelines that categorize 10 reading error categories and 26 sublevel error labels. Using a digital reading platform supported by GPT-4o-mini (OpenAI, 2024), children read stories on topics of their own interest, while the system records their speech and logs their interactions with embedded digital supports. Each recording is paired with detailed demographic and educational metadata and subjected to linguistic annotations, including: (1) sentence- and word-level time alignment; (2) phonemic transcription; (3) reading errors.
Combinatory Categorial Grammar (CCG), a lexicalized formalism known for its flexible constituency, is well-suited for modeling headfinal languages with flexible word order like Turkish. Building on Kuzgun et al. (2023), we first develop a Turkish CCG lexicon by automatically inducing categories from a dependency treebank. By leveraging standard and extended operations tailored to Turkish syntax, our parser achieves a robust coverage of 92.5%. Furthermore, we introduce the first (partially) incremental, left-to-right CCG parser for Turkish, designed to facilitate the immediate integration of words into the evolving representation. Finally, we present an example experiment showing that CCG parsers can model psycholinguistic evidence for extra processing costs associated with arguments in noncanonical positions, via the frequency of order-reversing operations. These findings provide evidence that CCG offers a cognitively plausible framework for modeling real-time processing in languages like Turkish.
Construction Grammar (CxG) knowledge in language models has been extensively studied for English, but remains underexplored in other languages. In Mandarin Chinese, the ba (把, disposal) and bei (被, passive) constructions are widely used for managing information structure. They foreground topical elements (information structure) and encode systematic form-meaning mappings (CxG), particularly with respect to the semantic role of the object. We probe language models’ linguistic competence with these constructions using minimal pairs, constructing a new minimal-pair dataset comprising seven paradigms that target both syntactic constraints and verb–construction compatibility. Our results show that it remains a challenge for many models to capture the form-meaning mappings underlying the ba construction, although they achieve high accuracy on paradigms driven by surface syntactic cues.
Language models (LMs) exhibit human-like behavior across linguistic tasks, yet behavioral similarity does not establish mechanistic correspondence. Animacy — whether an entity is alive and sentient — is a well-documented semantic feature shaping linguistic behavior in humans. Although LMs show animacy sensitivity behaviorally, the mechanistic basis remains unexplored. In this study, we probe GPT-2 Small’s internal circuitry to test whether animacy representations causally drive syntactic structure choice. Activation patching confirms causality: swapping animacy representations in the model shifts its downstream output. Critically, bidirectional patching reveals that animacy conditions differ in how strongly they commit to a structure: some animacy configurations resist perturbation and exert strong causal influence, while others remain flexible. We identify 22 attention heads mediating these effects, split between passive-promoting and passive-suppressing populations, suggesting GPT-2 Small’s structure choice likely emerges from internal competition between opposing heads. These findings provide mechanistic grounding for animacy effects documented in extensive psycholinguistics research and demonstrate how interpretability methods can enrich and test psycholinguistic theory.
Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.
Previous work has found that ordering training data by children’s Age of Acquisition (AoA) for words increases the stability of distributional word embeddings, suggesting that early-learned words play a privileged role in shaping semantic structure. In this study, we determine whether AoA itself drives these effects, or whether they emerge from correlated lexical factors such as frequency, concreteness, and phonological complexity. Using incremental Word2Vec training, we construct curricula ordered by AoA and by individual lexical features, while systematically controlling for vocabulary growth and deterministic ordering effects. We show that AoA-ordered curricula produce greater early-phase stability than shuffled baselines, even under controlled exposure conditions. We find that the advantage observed with AoA can be largely explained by correlated factors like overall word frequency. Despite limited gains on general similarity benchmarks, AoA-ordered embeddings outperform shuffled embeddings on a proxy domain-specific task: predicting human AoA norms. This advantage persists after debiasing timestamp effects, implying that AoA curricula induce developmentally meaningful semantic structure.
Logical Table-to-Text (LT2T) generation aims to produce natural-language sentences that are logically faithful to structured tabular data. While recent Large Language Models (LLMs) show high performance on aggregate fidelity metrics, these scores provide only a coarse view of performance, obscuring specific logic-type reasoning failures and models’ meta-logical awareness. We propose an operation-aware diagnostic framework that evaluates four core competencies: (1) Logical Form (LF) execution accuracy, (2) fidelity of LF-conditioned generation, (3) logic-type identification, and (4) LF-free generation.We apply this framework to a suite of frontier LLMs and perform fine-grained analysis across logic types such as aggregation, ordinal, and superlative reasoning. Our results show that LT2T fidelity assessment can be unstable; the choice of verifier and logic type can substantially alter conclusions and model rankings. Crucially, we identify a meta-logical gap: models often generate faithful statements while failing to identify the underlying operation.
Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
Reinforcement Learning with Human Feedback(RLHF) is a common post-training procedureto align the outputs of Large Language Mod-els (LLMs) with human preferences. As a re-sult, one might expect RLHF to induce someelements of human-like audience design intoLLMs. However, RLHF and other post-trainingalignment methods have many complex effectson the outputs of LLMs that have yet to be stud-ied quantitatively. We apply an information-theoretic lens to investigate the changes in the"naturalness" of language and the presence ofaudience design in LLMs before and after post-training. The Uniform Information Density(UID) Hypothesis posits that humans optimizelanguage production and comprehension acrossa noisy channel by transferring information ata more uniform rate. Accordingly, we analyzeand compare how information is distributedwithin model- and human-generated text fromdifferent domains. We find that pretrained andpost-trained LLMs both show superhuman uni-formity across various text domains, and bothRLHF and other post-training methods reduceuniformity slightly from their pretrained coun-terparts. However, RLHF uniquely encourageslower variance in uniformity between docu-ments, potentially demonstrating that trainingon human preferences encourages consistencyin information flow.
Understanding how language models compose meaning from linguistic input remains a central problem in interpretability research. Mechanistic studies have attributed functional roles to core transformer components; however, these findings derive largely from factual retrieval settings. Whether the same mechanisms support conceptual interpretation, the compositional mapping from definitional expressions to abstract meaning, remains insufficiently characterised. We introduce DSRA (Definitional Semantic Role Analysis), a methodology that applies causal tracing within the reverse dictionary task and augments restoration traces with definitional semantic roles (DSRs) grounded in Argument Structure Theory. This linguistic overlay identifies which compositional functions (e.g., genus, differentia quality) are associated with high-recovery states, extending activation patching beyond token-level localisation. Applied to GPT-J-6B (English) and BERTIN GPT-J-6B (Spanish), the results show that MLP layers associate content-bearing tokens with high-specificity DSR categories in early layers, MHA layers distribute integration across middle-to-upper layers with concentration at the final token, and hidden states aggregate information in upper layers. Alignment between restored states and DSR categories indicates systematic correspondence between internal activations and definitional structure, with consistent localisation patterns across both languages.
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. ForLarge Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X *thinks*) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a *think* vector as the causal driver of observed FBT behaviour.
Grammatical theories which specify grammars by means of symbolic well-formedness constraints (e.g., Context Free Grammars, HPSG, LFG, Minimalism, Dependency Grammars, etc.) are ill-suited to model the (semantically and statistically) gradual character of grammatical change as it manifests in successive historical corpora. Grammatical theories which claim that the language system is subject to change based on what speakers do in life (i.e., usage-based accounts) are better-suited to handle such phenomena. Nevertheless, current usage-based theories (e.g., Cognitive Grammar, Construction Grammar) lack a clearly formalized model that specifies how usage can affect the grammatical system. In this paper, we describe Stretched Tree Metric Grammars (STMGs), a new formal model of syntax and semantics that exhibits usage-based effects. We show that the model can generate and parse simple sentences. Then we show how it supports morphological innovation in appropriately limited circumstances. We conclude by noting that STMGs are closely related to Large Language Models (LLMs), but they have the benefit of being more analytically interpretable.
Large Language Models (LLMs) are increasingly used for Automated Essay Scoring (AES), yet the scoring rubrics they rely on are typically designed for human raters and may not be optimal for LLMs. Inspired by the calibration process that human raters undergo before formal scoring, we propose Reflect-and-Revise, an iterative framework that refines scoring rubrics by prompting models to reflect on their own chain-of-thought rationales and score discrepancies with human labels. At each iteration, the model identifies scoring-error patterns from sampled mismatches and revises the rubric accordingly. Experiments on three essay scoring benchmarks (ASAP, ASAP 2.0, and TOEFL11) with three LLMs (GPT-5 mini, Gemini 3 Flash, and Qwen3-Next-80B-A3B-Instruct) demonstrate that our method yields improvements in Quadratic Weighted Kappa (QWK), achieving gains of up to +0.403 over human-authored rubrics. Starting from a minimal seed rubric that specifies only the score scale, our method matches or exceeds expert rubric performance in most dataset-model combinations, indicating that iterative refinement can reduce the manual effort of rubric authoring. Analysis of the refined rubrics reveals that the refinement process introduces explicit procedural structures, such as conditional gating rules and quantitative thresholds, that are absent from human-authored rubrics, highlighting a gap between rubrics designed for human raters and those effective for LLMs.