Brian W. Dillon

Also published as: Brian Dillon


2026

Suspended affixation (SA) allows a suffix on one conjunct to scope over all coordinated elements. While inflectional SA is productive in Turkish, derivational SA is claimed to be highly restricted; yet speakers readily accept certain cases. We propose that this gradient acceptability reflects a frequency-modulated choice between two possible syntactic representations: base-generation, which licenses derivational SA, and ellipsis. To test this, we conducted a rating task on the acceptability of four derivational suffixes in SA form while manipulating the frequency of coordinations. Using a Multinomial Processing Tree model to isolate latent structural choices from surface ratings, we found that frequency modulated SA acceptability for some suffixes (i.e., sIz ’-less’ and cI ’-maker’), but not others (i.e., lI ’-having’ and lIk ’-for’). These findings suggest that frequency shapes syntactic parsing in morphologically complex environments.
Many gradable properties have been found to be encoded as axes in embedding space. Most commonly, property axes are computed using seed words, but recent work has noted limitations to seed-based axes. Here, we present a novel methodology for computing property axes that is based on human ratings and does not require seeds. We apply this methodology to a particular problem at the syntax-semantics interface: which semantic properties of intransitive verbs affect their likelihood to occur in one of two syntactic structures, unergative and unaccusative. Comparing property axes that encode different semantic dimensions of the concept of agentivity, we find that properties like movement and being alive are a better predictor of the syntactic behavior of intransitives than goal-directedness or intentionality. We discuss the potential of rating-based axes for future work in semantics and at the syntax-semantics interface.
Evaluating how large language models (LLMs) capture the grammatical structure of low-resource languages remains underexplored. This paper presents the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP)—a diagnostic suite of 5,696 minimal pairs that contrast grammatical acceptability across ten core syntactic and morpho-syntactic phenomena in Urdu. The dataset is constructed from the Urdu Treebank and diverse text corpora, and human validation achieves a 96.1% inter-annotator agreement, confirming its reliability. We evaluate twenty one multilingual LLMs, including LLaMA-3-70B and Gemma-3-27B-PT, and additionally assess the proprietary GPT-4o model using grammar-prompting techniques. GPT-4o (grammar-prompted) attains the highest average accuracy (97.4%), reaching near-human performance on regular phenomena such as aspect agreement and ergativity. However, all models continue to struggle with flexible syntactic patterns like word-order variation and long-distance subject–verb agreement. UrBLiMP provides the first controlled evaluation framework for probing morpho-syntactic competence in Urdu and highlights both the progress and remaining challenges of multilingual and proprietary LLMs in low-resource settings.
Combinatory Categorial Grammar (CCG), a lexicalized formalism known for its flexible constituency, is well-suited for modeling headfinal languages with flexible word order like Turkish. Building on Kuzgun et al. (2023), we first develop a Turkish CCG lexicon by automatically inducing categories from a dependency treebank. By leveraging standard and extended operations tailored to Turkish syntax, our parser achieves a robust coverage of 92.5%. Furthermore, we introduce the first (partially) incremental, left-to-right CCG parser for Turkish, designed to facilitate the immediate integration of words into the evolving representation. Finally, we present an example experiment showing that CCG parsers can model psycholinguistic evidence for extra processing costs associated with arguments in noncanonical positions, via the frequency of order-reversing operations. These findings provide evidence that CCG offers a cognitively plausible framework for modeling real-time processing in languages like Turkish.
There is a growing consensus that, in order to serve as models of human language processing, language models (LMs) need to be constrained in their use of memory for context, the analogue to human working memory (WM). Here we take a novel yet simple approach to constraining WM in language models, in a way that reflects models of human cognition where memory is treated as a limited resource and deployed strategically. In order to capture this constraint on memory encoding, we inject noise into the hidden representations of Transformer-based LMs at tunable rates. Then we train the models with a hybrid objective, such that they learn to maximize the performance of next-word prediction subject to explicit constraints on the total encoding precision. We find that explicit WM constraints improve the model’s alignment with human reading times. More importantly, we find that the need to manage encoding precision reshapes the nature of the models’ context representations, making them more compressed and categorical. Our results show how resource-rational models of WM allocation can be implemented in neural models simply and successfully, and point to a dissociation between WM retrieval mechanisms and the underlying memory representations in models of human sentence processing.

2025

2022

Recent progress in large pretrained language models (LMs) has led to a growth of analyses examining what kinds of linguistic knowledge are encoded by these models. Due to computational constraints, existing analyses are mostly conducted on publicly-released LM checkpoints, which makes it difficult to study how various factors during training affect the models’ acquisition of linguistic knowledge. In this paper, we train a suite of small-scale Transformer LMs that differ from each other with respect to architectural decisions (e.g., self-attention configuration) or training objectives (e.g., multi-tasking, focal loss). We evaluate these LMs on BLiMP, a targeted evaluation benchmark of multiple English linguistic phenomena. Our experiments show that while none of these modifications yields significant improvements on aggregate, changes to the loss function result in promising improvements on several subcategories (e.g., detecting adjunct islands, correctly scoping negative polarity items). We hope our work offers useful insights for future research into designing Transformer LMs that more effectively learn linguistic knowledge.
Humans exhibit garden path effects: When reading sentences that are temporarily structurally ambiguous, they slow down when the structure is disambiguated in favor of the less preferred alternative. Surprisal theory (Hale, 2001; Levy, 2008), a prominent explanation of this finding, proposes that these slowdowns are due to the unpredictability of each of the words that occur in these sentences. Challenging this hypothesis, van Schijndel and Linzen (2021) find that estimates of the cost of word predictability derived from language models severely underestimate the magnitude of human garden path effects. In this work, we consider whether this underestimation is due to the fact that humans weight syntactic factors in their predictions more highly than language models do. We propose a method for estimating syntactic predictability from a language model, allowing us to weigh the cost of lexical and syntactic predictability independently. We find that treating syntactic predictability independently from lexical predictability indeed results in larger estimates of garden path. At the same time, even when syntactic predictability is independently weighted, surprisal still greatly underestimate the magnitude of human garden path effects. Our results support the hypothesis that predictability is not the only factor responsible for the processing cost associated with garden path sentences.

2019

2018

Sequence to sequence (seq2seq) models are often employed in settings where the target output is natural language. However, the syntactic properties of the language generated from these models are not well understood. We explore whether such output belongs to a formal and realistic grammar, by employing the English Resource Grammar (ERG), a broad coverage, linguistically precise HPSG-based grammar of English. From a French to English parallel corpus, we analyze the parseability and grammatical constructions occurring in output from a seq2seq translation model. Over 93% of the model translations are parseable, suggesting that it learns to generate conforming to a grammar. The model has trouble learning the distribution of rarer syntactic rules, and we pinpoint several constructions that differentiate translations between the references and our model.