Society for Computation in Linguistics (2026)
Volumes
up
Proceedings of the Society for Computation in Linguistics 2026
Proceedings of the Society for Computation in Linguistics 2026
Rob Voigt | Alex Warstadt | Naomi Feldman | Tal Linzen
Rob Voigt | Alex Warstadt | Naomi Feldman | Tal Linzen
Measuring Perceptions of Personhood with Semantic Proto-role Properties
Elizabeth Spaulding Hoefer | James Martin
Elizabeth Spaulding Hoefer | James Martin
We show that semantic proto-role properties can be used as a tool to measure implicit human perceptions of agency and patiency of entities in human-generated text. First, we demonstrate that silver-generated semantic proto-role property labels are strongly correlated with both human judgment and a probabilistic text-based measure of anthropomorphism. Then, we use our measure to quantify linguistic idiosyncrasies across different AI-related Reddit communities. Our measure shows that subreddits dedicated to discussing AI companionship ascribe higher sentience to "bots" and higher agency to "companies" when compared to other subreddits. This phenomenon reveals not only the unique way in which chatbots are anthropomorphized in such subreddits, but also the users’ keen awareness of their power imbalance with the companies that created the chatbots.
Given a listener’s native language, some non-native contrasts may be harder to discriminate than others. The computation required to mimic this variable difficulty is not yet known. The present work approaches this question by training small supervised feedforward neural networks to perform Spanish vowel classification and then evaluating model classification of Catalan vowels, thereby approximating Spanish-listeners’ cross-linguistic perception of Catalan. Vowels were extracted from Spanish and Catalan audio corpora, respectively. Ultimately, Spanish models exhibited expected misperception of Catalan’s /e/-/ɛ/, /o/-/ɔ/, and /ɛ/-/a/ contrasts; Spanish-dominant listeners have difficulty perceiving these contrasts, and Spanish models classified Catalan /ɛ/ as /e/ or /a/, and Catalan /ɔ/ as /o/. This demonstrates that small supervised neural models are capable of making specific, cross-linguistic perceptual predictions given realistic input.
A Family of Effective Methods for Decompiling Canonical Acceptors, Instantiated for Languages of Dot-Depth One and Tier-Based Extensions
Dakotah Lambert
Dakotah Lambert
Many kinds of logical systems have been employedin constructing formal languages to model phonological phenomena.A common theme among them is that the systems compile into finite automata.Two questions naturally arise.Can a given phenomenon be described with another logical system?And, if so, what is that description?To the first question, algebraic techniques are well establishedthrough deep connections with logic and automata.To the second, the situation is less clear.Translations from automata are establishedfor first-order and monadic second-order logicsunder precedence,but these may not translate easily to the simpler systems we often use.Translations for simple cases of restricted propositional logic(strictly local or strictly piecewise languages)are established,but insufficient to describe attested phenomena.The present work establishes a general way to handle many systems in between.Specifically,we show how to translate between certain kinds of algebraic varieties𝐕(systems defined by universally satisfied identities)and associated logical systems,then use decomposition to handle classes of the form𝐕∗𝐃,where the notion of “symbol” is replaced by “k-block”.With this, we handle several (unrestricted) propositional logics,facilitating logical description of natural language.
Word2Vec’s effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance—a topic rarely addressed in word embedding literature—we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec’s effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.
Learning Latent Representations with Progressive Hypothesis Space Expansion
Jonathan Charles Paramore
Jonathan Charles Paramore
This paper introduces a learning model to address the computational challenges arising from including highly abstract underlying representations (URs) in morphophonemic learning. The proposed learner structures the UR hypothesis space by disparity distance and considers potential URs in batches, beginning with fully concrete URs, only expanding the UR candidate space if the current set of UR candidates fails to meet a predetermined likelihood threshold. When expanding the UR candidate set, the learner uses markedness constraint weights and violation profiles to identify features that are potentially mis-specified underlyingly, limiting the generation of new URs to changes of those feature values. Overall, the learner inherently restricts abstraction to cases where introducing it demonstrably improves likelihood, while avoiding issues associated with the exhaustive search of an unbounded hypothesis space. Applied to Pakistani Punjabi a vowel nasality pattern, the model is shown to successfully acquire abstract URs for phonological patterns that parallel learners fail to capture.
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
Determinants of Hesitations and Repetitions in Hindi Spontaneous Speech
Eashani Sharma | Ishita Arun | Samar Husain
Eashani Sharma | Ishita Arun | Samar Husain
This study investigates the factors that predict disfluencies in Hindi spontaneous speech. In particular, we probe the influence of lexical, syntactic, phonological, and prosodic factors on two kinds of disfluencies, namely, hesitations and repetitions. These disfluencies are probed through both the nature of linguistic factors as well as through the source (preceding vs. following word) of these factors. Our results show that hesitations and repetitions pattern differently during spontaneous speech. Hesitations increase due to lexical, syntactic, as well as articulatory features from both preceding and following words. On the other hand, repetitions arise mainly due to lexical and articulatory factors of the upcoming word. Further, while previous research (e.g., Bell et al., 2009; Dammalapati et al., 2021) on English highlights the importance of upcoming difficulty on disfluencies, our results suggest that previously encountered difficulties can also lead to an increase in disfluencies. This suggests that language typology (SVO vs SOV) can play a critical role in determining the planning process and thereby affecting the distribution of disfluencies in a language. Together, these findings highlight the need for increased cross-linguistic research to understand the nature of incrementality and monitoring of the production system cross-linguistically.
An LLM Investigation into Inherent and Structural Case Representation: a German Case Study
Iona Carslaw | András Bárány | Itamar Kastner | Mark Steedman
Iona Carslaw | András Bárány | Itamar Kastner | Mark Steedman
A question for computational linguistics has been to what degree do language models encode case information. However, the majority of the work has focused on structural cases (cases which change when the syntactic configuration changes). On the other hand, inherent cases (which are assigned by specific lexical items and do not change if the syntactic configuration changes) have been overlooked. This paper sets out to investigate if German language models distinctly encode inherent dative from structural accusative and nominative. We conducted a linguistic probing investigation where probes are trained on contextual word embeddings of active nominative, accusative, and dative arguments to predict if passivised datives are analysed as a structural nominative. We provide a cased and caseless version of the experiment. Our results suggest that when case information is removed language models can distinguish between inherent dative and structural accusative, regardless of argument position, due to verb information. However, language models cannot distinguish between structural nominative and inherent dative when the dative appears in a position where there is an expected nominative, due to over-relying on surface patterns.
Autosegmental approaches to Arabic root-and-pattern morphology generally take a three-tier approach, with tiers corresponding to the prosodic template, consonantal root, and affixes (e.g., McCarthy 1981); association between these tiers proceeds from left-to-right. However, Jardine (2017) shows that left-to-right association exceeds regular computation for autosegmental representations of arbitrary length, challenging the cognitive plausibility of this approach. This paper demonstrates that in the case of Arabic morphology, the constraints of the system itself — in particular, the finite length of the consonantal root — allow such a left-to-right autosegmental association to not only be definable with Monadic Second Order (MSO) logic, but with First Order logic. This paper introduces a logical relational structure formalizing the three-tier autosegmental representations and defines a set of transductions which apply in parallel over these structures to yield well-formed root and affix associations.
Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish
Yongseok Jang
Yongseok Jang
Composing finite-state transducers (FSTs) for context-dependent morphophonological rules—consonant gradation, vowel harmony, possessive suffix assimilation—leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment—the type of a local rule familiar from cellular automata—and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology—a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.
This study uses a modeling approach to explore the development of spectral and positional encodings in speech sounds. Humans rely on their auditory system to differentiate between individual sounds in words by analyzing both spectral properties of phonemes and their relative positions. Previous neuroscientific research has identified specific neural populations in the auditory cortex that respond to spectral processing, while behavioral studies have confirmed humans’ ability to perceive the relative positions of phonemes in speech sequences. To investigate these encodings, a Long Short-Term Memory (LSTM) autoencoder with a cross-attention mechanism trained on Mel-spectrogram transformed from raw speech data is employed in this research. By conducting ABX tests on the model’s representations at various learning stages, we observe the emergence of spectral and positional encodings. The results show that the model excels in distinguishing spectral features similar to neuroscientific findings, and also reveals independent positional encoding through accurate temporal distinctions. Furthermore, we illustrate the developmental trajectory of spectral and positional encodings during the learning process, proposing the need for further investigating their neural correlates.
The Spanish Learner and Heritage Speaker Dependency Treebank
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
We present a manually curated L2-Heritage Speaker Spanish dataset (N = 49,247) following the Universal Dependencies framework, including lemmatizations, part-of-speech tags, syntactic dependencies, and instances of pro-drop and ungrammatical structures. In addition to this, for dependency parsing we examined different data partitioning strategies and data representations, as well as different training configurations using our data and the AnCora treebank. Overall, the results yield reasonable LAS scores and comparable performance between AnCora and our dataset.
Omnivorous Agreement, like Uyghur Backness Harmony, is a Challenge for Tier-Based Strict Locality
Allison Verbil | Tim Hunter
Allison Verbil | Tim Hunter
A well-known exception to the characterization that phonological patterns belong to the subregular class of TSL dependencies is found in Uyghur backness harmony (Mayer and Major, 2018). At the same time, a recent line of work has argued that many long-distance syntactic phenomena are subsumed by the TSL class, revealing an interesting parallel between phonology and syntax. We show that a certain omnivorous syntactic agreement pattern, namely Mundari object agreement (Murugesan et al., 2025), poses the same challenge to TSL as Uyghur backness harmony.
Modelling the Diachronic Emergence of Phoneme Frequency Distributions
Fermin Moscoso Del Prado Martin | Suchir Salhan
Fermin Moscoso Del Prado Martin | Suchir Salhan
Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A naïve version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions –an effect related to frequency and a stabilising tendency toward a preferred inventory size– yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as a result of diachronic sound change instead of –or in addition to– explicit optimisation or compensatory mechanisms.
Morpheme structure phonotactics: a categorical model for morpho-phonological productivity in Russian vowel-zero alternations
Daniar Kasenov
Daniar Kasenov
Nonce word studies motivate a notion of gradient similarity between nonce words and real words. In morpho-phonological research, similarity is often taken as to be a relationship between a nonce word and the list of morphemes / words that undergo a given morphophonological alternation (Albright and Hayes 2003; Becker et al. 2011 i.a.). This paper challenges this view on the basis of nonce word data on Russian vowel–zero alternations (Gouskova and Becker 2013; Becker and Gouskova 2016). I propose a model where morpho-phonological similarity is a relationship between the available underlying representations and the underlying representation the nonce item must have in order to undergo the alternation. The implementation of the proposed model matches—and in some comparisons exceeds—the performance of Becker and Gouskova’s (2016) MaxEnt-model. This study thus presents a linking hypothesis between nonce word studies and approaches that mark segments themselves as undergoing certain restricted alternations.
Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.
Search & Change (S&C) is a procedural model of phonological rule application that is conceptually clear and linguistically motivated, but whose computational properties have not been fully characterized. This paper provides a formal specification of S&C within the framework of Logical Phonology, presents a linear-time algorithm for rule application with a proof of correctness, and gives a compilation procedure mapping S&C rules to a single transition structure that is subsequential in one scan orientation and reverse-subsequential in the other, situating S&C within a well-understood subclass of regular string-to-string functions with known learnability guarantees and algebraic characterizations, implying that S&C-definable mappings are learnable from positive input/output pairs and amenable to algebraic classification.
This paper investigates whether tonotactic learning differs across representations and learning models. We conduct an experiment using the same dataset encoded in three representations: segments, features, and autosegmental representations (ARs). To the extent possible, two learning models are evaluated, the Maximum Entropy (MaxEnt) model and the Bottom-Up Factor Inference Algorithm (BUFIA), to examine how learning outcomes interact with both model type and representations. A follow-up experiment further explores the roles of frequency and complexity thresholds. The results show that (1) AR-based learning gives the strongest overall performance; (2) there is no consistent advantage between segmental and featural representations across learning models; (3) MaxEnt performance improves substantially when frequency information is introduced and lastly (4) the effects of complexity bounds interact with representation type and frequency information. These findings suggest that tonotactic learning benefits from structurally explicit representations. Overall this work highlights the importance of using linguistically meaningful representations into learning.
Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English
Michael Kamerath | Aniello De Santo
Michael Kamerath | Aniello De Santo
This paper investigates whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. We also test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies inconsistently across models and languages, and highlight the importance of leveraging subtle syntactic contrasts in exploring these models’ ability to correctly align with human-like preferences.
A Feature-Driven Tensor Semantics for Minimalist Grammars
John Paulson | Aniello De Santo | Jonathan Rawski
John Paulson | Aniello De Santo | Jonathan Rawski
This paper shows how tensor-based distributional semantics can be incorporated into Minimalist Grammars (MGs), leveraging the tensor-based MG representations of beim Graben and Gerth (2012). We embed the Minimalist feature calculus with a tensor algebra and give a joint tensor-based representation where compositional semantics is guided by the minimalist syntax. By bridging syntactic and semantic operation in tensor spaces, we aim to contribute to the broader enterprise of neurosymbolic approaches to linguistic cognition.
Word Predictability on Code-switching Points in Cantonese–English Discourse
Ariel Shuk Ling Chan | Yanting Li | Jacob Poschl
Ariel Shuk Ling Chan | Yanting Li | Jacob Poschl
This paper investigates how word predictability influences code-switching probability. We analyze 1,010 code-switched instances drawn from naturalistic sociolinguistic interviews with 41 Cantonese–English bilinguals across three bilingual groups (homeland, immersed, and heritage). In particular, we examine whether the predictability of switch points, operationalized as surprisal, influences the likelihood of code-switching. Using pretrained transformer-based language models, we estimate surprisal at the switch point under different modeling conditions, including autoregressive and masked models and varying amounts of contextual information. Mixed-effects logistic regressionanalyses show that less predictable words are more likely to be code-switched. These effects are largely consistent across model types and bilingual groups. Overall, these findings highlight the role of predictability in bilingual speech production and provide new insights into code-switching among bilingual speakers with diverse language experiences.
Non-literal Meaning Representation in the Brain during Naturalistic Listening
Zhengwu Ma | Yuhan Huang | Chengcheng Wang | Jixing Li
Zhengwu Ma | Yuhan Huang | Chengcheng Wang | Jixing Li
Naturalistic language comprehension often involves interpretations that go beyond literal meaning. In continuous narratives, literal and non-literal meanings are tightly intertwined, making them difficult to distinguish computationally. Here, we combined literal sentence representations and human-annotated non-literal interpretations for model-brain alignment. Using fMRI data recorded during passive listening to the Chinese version of The Little Prince, we annotated sentences containing non-literal meaning with human-written interpretations of their implied meaning. We then derived the literal and non-literal representations from LLaMA3.1-8B and evaluated their correspondence with neural activity using whole-brain encoding models. Literal representations aligned strongly with left-lateralized frontotemporal regions, whereas non-literal interpretations showed broader right-hemisphere involvement. Combining the two further improved encoding performance in the bilateral temporal and dorsal frontal cortices, suggesting that naturalistic comprehension engages complementary levels of meaning.
Probing the Attention Representation of Filler-Gap Dependency in Transformers
Ruoqing Yao | Pranav Anand
Ruoqing Yao | Pranav Anand
Prior work (Wilcox et al, 2024; Kobzeva et al., 2025) shows that neural language models exhibit filled-gap and unlicensed-gap effects, yet these effects attenuate with intervening clauses, especially with intervening overt complementizers. We conduct attention probing experiments on GPT-2 and identify two specific heads (layer 5, head 2, and layer 8, head 9) whose verb-to-filler attention correlates with filled-gap surprisal. The two heads are sensitive to clausal intervention but not to linear distance, and they show distinct patterns in islands. When intervening overt complementizers appear, head 2 of layer 5’s attention redistributes from the filler to the nearest complementizer, producing an “attend-closest-C” pattern, while head 9 of layer 8 does not. These results may suggest that LMs may have allocated distinct linguistically meaningful representations from the training data to individual attention heads, but they fail to fully learn the correct grammars of FGDs.
Learning Stress in Arabic Low-Resource Settings
Abed Qaddoumi | Jordan Kodner | Owen Rambow | Salam Khalifa | Jeffrey Heinz
Abed Qaddoumi | Jordan Kodner | Owen Rambow | Salam Khalifa | Jeffrey Heinz
We predict lexical stress in Arabic varieties using syllable structure (a sequence of CVs, with C for consonants and V for vowels). Our task is generation: given an unstressed input, the system outputs a stress-marked word. We compare four approaches: a grammar induction algorithm (BUFIA), a transformer-based neural network (NN), a rule-based method, and a frequency baseline. The models are evaluated across several low-resource settings by varying the training data size by words, structural type, and syllable count. BUFIA outperforms the neural network, especially when data are scarce. This supports grammar induction as an interpretable and sample-efficient alternative for learning stress.
Many gradable properties have been found to be encoded as axes in embedding space. Most commonly, property axes are computed using seed words, but recent work has noted limitations to seed-based axes. Here, we present a novel methodology for computing property axes that is based on human ratings and does not require seeds. We apply this methodology to a particular problem at the syntax-semantics interface: which semantic properties of intransitive verbs affect their likelihood to occur in one of two syntactic structures, unergative and unaccusative. Comparing property axes that encode different semantic dimensions of the concept of agentivity, we find that properties like movement and being alive are a better predictor of the syntactic behavior of intransitives than goal-directedness or intentionality. We discuss the potential of rating-based axes for future work in semantics and at the syntax-semantics interface.
Mapping the meaning of Hungarian impulsative constructions
Ágnes Kalivoda | Robert Malouf | Fackerman@Ucsd.Edu Fackerman@Ucsd.Edu
Ágnes Kalivoda | Robert Malouf | Fackerman@Ucsd.Edu Fackerman@Ucsd.Edu
We upload the abstract as a PDF file.
Various work in computational phonology has studied the computational properties of Optimality Theory. Some algorithms exist for the universal generation problem, including those of Ellison and Tesar, but their domain of applicability is poorly understood. I propose and study a concrete ’minimal’ fragment of finite-state Optimality Theory.I show that the universal generation problem for it is efficiently solvable by improving Ellison’s Algorithm, demonstrate that it has been implicitly used in the literature, and discuss its limitations.The minimal fragment is a natural and foundational step towards a computationally tractable general formalism for phonological analysis.
This paper investigates the learnability of interacting phonological processes by restricting the hypothesis space to a subregular class of functions. Interacting processes can be modeled as function composition, where the output of one function serves as the input to another. We focus specifically on interactions between two simplex Input Strictly Local (ISL2) functions, a proper subclass of the ISL function class. We propose a decomposition algorithm that reconstructs both the individual component processes and their relative ordering by exploiting structural properties of simplex ISL2 transducers and their compositions. This work provides an initial step toward understanding how learners can infer not only single phonological processes, but structured interactions between processes.
Do I know what I want to say? Modeling meaning uncertainty in RSA
Anzi Wang | Carolyn Jane Anderson | Grusha Prasad
Anzi Wang | Carolyn Jane Anderson | Grusha Prasad
Models using the Rational Speech Act (RSA) framework typically assume that speakers are certain about the meaning being communicated. In this work we note that there are contexts in which this assumption does not hold, and propose a method (um-RSA) to incorporate this meaning uncertainty within the RSA framework. As a case study, we explore two sources of meaning uncertainty: Counting-Uncertainty (from numerical cognition) and Discounting-Uncertainty (from behavioral economics). We generate predictions from these two hypotheses and test these predictions with two human experiments. The results show that um-RSA can account for differences in uncertainty expression usage that the standard RSA framework cannot account for, thus demonstrating the usefulness of modeling meaning uncertainty.
This paper examines the learnability of different types of tone sandhi in Structural Optimality, a constraint-based framework that posits hierarchical scales and defines constraints over the scales. Approached as a hidden structure problem, we show that Expectation Driven Parameter Learning can acquire these grammars, but that their properties can make learning difficult.
Concrete words (e.g., apple) are often described in the literature to share more semantic features across languages than abstract words (e.g., appetite). We test this hypothesis using multilingual aligned word embeddings by measuring the distance between words and their nearest neighbor in other languages, and examining whether shorter distances predicted higher concreteness ratings in six languages: Dutch, English, French, Cypriot Greek, Mandarin, and Portuguese. The relationship between concreteness and cross-linguistic distance varied across languages, suggesting that concreteness does not uniformly correspond to cross-linguistic semantic relatedness. Our attempt highlights the potential of using aligned word embeddings for operationalizing psycholinguistic constructs.
Frequency modulates structural choice in Turkish suspended affixation: a latent-process account
Utku Turk | Eva Neu | Özge Bakay | Brian Dillon | Gaja Jarosz
Utku Turk | Eva Neu | Özge Bakay | Brian Dillon | Gaja Jarosz
Suspended affixation (SA) allows a suffix on one conjunct to scope over all coordinated elements. While inflectional SA is productive in Turkish, derivational SA is claimed to be highly restricted; yet speakers readily accept certain cases. We propose that this gradient acceptability reflects a frequency-modulated choice between two possible syntactic representations: base-generation, which licenses derivational SA, and ellipsis. To test this, we conducted a rating task on the acceptability of four derivational suffixes in SA form while manipulating the frequency of coordinations. Using a Multinomial Processing Tree model to isolate latent structural choices from surface ratings, we found that frequency modulated SA acceptability for some suffixes (i.e., sIz ’-less’ and cI ’-maker’), but not others (i.e., lI ’-having’ and lIk ’-for’). These findings suggest that frequency shapes syntactic parsing in morphologically complex environments.
Effect of case markers during agreement production: A model comparison using Armenian forced choice data
Pranab Bagartti | Samar Husain
Pranab Bagartti | Samar Husain
Agreement attraction errors, where the verb erroneously agrees with a non-subject noun, have been a useful tool for investigating processes that subserve sentence production. Research has shown that case markers play an important role in modulating such errors. These effects have been argued to arise due to an underlying cue-based retrieval system. However, subsequent research in Armenian has challenged this conclusion (Avetisyan et al., 2020), arguing against a cue-based retrieval account. The current paper revisits the Armenian production data through computational modeling. Specifically, we implemented three distinct models and compared their predictions; we compare (a) a cue-based retrieval model, (b) a feature migration model, and (c) a case as markers for agreement prediction model. Our model comparison results show that a case as markers for agreement prediction model followed by an inference component explains the effect of case better than the cue-based retrieval model as well as the feature migration model.
One of the most fundamental representations in linguistic semantics is that of the proposition (McGrath and Frank, 2005), standardly taken as the carrier of truth-conditions. Recent work shows that some form of truth can be decoded from language models (Azaria and Mitchell, 2023; Li et al., 2023), and strikingly, that for some models, truth is even represented linearly in intermediate layers (Marks and Tegmark, 2024, GoT). We take this line of work a step further and argue that neural language models can use propositional representations compositionally (Janssen 2010; Pickel and Szabó 2025 a.o.), drawing from evidence of the behaviour of logical connectives: the linear compositionality hypothesis. Specifically, we show (a) that the truth values of individual conjuncts can be decoded independently of the truth value of a complex conjunction, and (b) that we can causally intervene on individual conjuncts in a way that affects the truth value of the whole.
Honorifics are linguistic forms that encode respect toward a socially valued individual or entity. This paper investigates how language models process Korean subject honorifics, which signal the social status of the subject through specific morphological markers. We evaluate a set of language models to determine whether they process honorifics in a human-like way by capturing the socio-pragmatic constraints governing their use, rather than merely relying on surface co-occurrence patterns. Our results indicate a systematic dissociation: models generally succeeded in detecting surface morphosyntactic mismatches, successfully treating unacceptable honorific constructions as less expected. However, models consistently favored overt honorific marking regardless of the subject’s social status, suggesting reliance on surface heuristics over genuine pragmatic knowledge. These findings suggest that language models have not fully acquired the socio-pragmatic constraints underlying honorific use, even when extensively trained on Korean text.
CrosSing: Cross-Scale Reasoning Evaluation on LLMs against Humans
Qi Han | Yifan Wu | Marten Van Schijndel
Qi Han | Yifan Wu | Marten Van Schijndel
While many studies have shown LLMs perform well in various reasoning tasks, few have examined their capacity on semantic reasoning tasks. As LLMs reason with language, it is crucial to understand how well they grasp and use the underlying scalar relationships in language. In this study, we introduced a new dataset CrosSing (Cross-Scale reasoning), providing a human baseline against which to evaluate LLMs’ ability to reason across lexical scales in gradable adjectives. We further probed how their understanding is influenced by overinformative contexts. We evaluated ten high-performing LLMs and found that some outperformed humans when no extra information was provided, but that LLM performance declined in certain overinformative contexts while human performance improved significantly. This contrast reveals a fundamental difference between recent LLMs and humans in understanding adjectives’ scalar relationships and how such understanding behaves in overinformative contexts.
We fine-tune Whisper large-v3 independently on each of the 81 languages in the FLEURS benchmark. Fine-tuning improves WER for all 81 languages, reducing it by nearly 30% on average. However, improvement varies widely, and the language’s writing system is the best predictor of success. Latin and Cyrillic script languages reach single-digit WERs, while languages with unique scripts (Thai, Georgian, Burmese, Khmer) benefit least. We further show that Whisper’s BPE compression ratio predicts fine-tuning headroom (Spearman ρ ≈ −0.78), pointing to tokenization as the underlying bottleneck. We will release model weights upon publication.
Human comprehenders have greater difficulty forming pairwise grammatical dependencies in cases where the earlier word competes with a "distractor" to which it is similar. Cue-based retrieval theories (see e.g., Lewis et al., 2006) address this "interference" phenomenon with explicit quantifications of memory retrieval difficulty. We propose a computational model, consistent with Cue-based retrieval, that separately quantifies two different kinds of similarity. A linear combination of the two reproduces the graded interference pattern reported in Van Dyke (2007). This simple account offers a more straightforward mechanistic interpretation than Attention-based predictors from opaque Transformer based models.
How much capacity does Turkish inflection require? An empirical study of GRU encoder–decoder bottlenecks.
Fred Mailhot
Fred Mailhot
Encoder–decoder neural networks with high-dimensional (e.g. d=300-–500) embeddings and hidden layers can be used to model a variety of morphophonological phenomena as sequence-to-sequence mappings, achieving high accuracy across languages and patterns. We show here that these high-capacity models are overparameterized, at least for the task of morphological inflection, and that simpler and smaller networks can perform near ceiling on the task of inflecting Turkish stems. Moreover these reduced-capacity models encode linguistically relevant information even when they are too small to succeed at the inflectional task.
The signal is coming from inside the noun phrase! Tracking semantic proto-role inferences during sentence processing
Lucas Y. Li | Zander Lynch | Marten Van Schijndel
Lucas Y. Li | Zander Lynch | Marten Van Schijndel
Semantic roles between a predicate and argument can be decomposed into proto-role properties (e.g.,Instigation). We introduce a novel LLM feature attribution method, Generalized Contextual Decomposition for Transformers (GCD-T), which we use to probe which parts of a sentence enable models to infer proto-role properties. We compare our findings with human inferences.
Quantifying mutual intelligibility gradients in Turkic languages using language models
Moldir Baidildinova | Shiva Upadhye | Austin Wagner | Connor Mayer | Richard Futrell
Moldir Baidildinova | Shiva Upadhye | Austin Wagner | Connor Mayer | Richard Futrell
Mutual intelligibility (MI) among related languages is a gradient phenomenon shaped by lexical, grammatical, and phonetic-phonological similarity. This study proposes a neural language modeling approach to quantifying MI patterns within the Turkic language family. Using IPA-transcribed naturalistic text from six Turkic languages, we train character-level LSTM models on a source language and fine-tune them on target languages that vary in genealogical distance. Cross-lingual transfer is evaluated using character-level cross-entropy (CE) loss, Area Under the Curve (AUC), and Rate of Change (ROC), which together capture model generalization, learning dynamics, and early-stage adaptation. We further examine whether model performance is predicted by cophenetic distance, lexical similarity, weighted trigram frequency overlap, and differences in vowel harmony index. Overall, the results suggest that character-level language models can approximate MI gradients across Turkic languages: closely related pairs generally show lower CE loss and smaller AUC, while more distant pairs show greater early-stage change. Lexical similarity, local phonotactic overlap, and genealogical distance appear to be the most informative predictors of model convergence. These findings provide preliminary evidence that neural language models trained on naturalistic text can offer a scalable way to model MI patterns, including directional asymmetries, across closely related languages.
This paper offers an updated perspective on the computational complexity of reduplication. Since one-way deterministic transducers cannot model reduplication in a straightforward way, the phenomenon has long been considered the outlier of morphology from a complexity perspective. Drawing on algebraic methods, I show that the vast majority of reduplicative processes belong to a few remarkably simple classes of subregular functions. A detailed study of the RedTyp database (Dolatian and Heinz, 2019) reveals that 100% of the surveyed reduplicative processes correspond to string-to-string functions in the class DA, while over98% are locally testable (LJ1) and over 87% are locally trivial (L1). These results indicate a new upper bound on the complexity of reduplication that is comparable to that of morphological processes in general.
Learning reduplicative templates as hidden structures: the case of reduplication-phonology interactions
Yang Wang
Yang Wang
Models of morphophonological learning have focused primarily on concatenative processes, leaving the challenges of non-concatenative morphology largely unaddressed. Reduplication, the systematic copying operation (e.g., Ilokano pluralization [kal-kaldÍN] ‘goats’), is particularly revealing because successful learning requires the joint inference of prosodic templates that govern copying, underlying representations (URs) of stems and other affixes, and the phonological grammar. In this paper, we present a learner that tackles this challenge by allowing reduplication to be learned alongside general morphophonemic alternations, a combination that, to our knowledge, has not been directly modeled in prior computational work. We show that the learner successfully captures the attested typology of reduplication–phonology interaction.
Do Large Language Models Acquire Phrase-Based Processing? Evidence from Eye Movements and Model-Brain Alignment After Fine-Tuning
Xufeng Duan | Zhengwu Ma | Zhaoqian Yao | Jixing Li | Zhenguang Cai
Xufeng Duan | Zhengwu Ma | Zhaoqian Yao | Jixing Li | Zhenguang Cai
Autoregressive large language models (LLMs) process text token-by-token, yet the human language system operates over multi-word units. We ask whether aggregating LLM representations at the phrase level yields a closer correspondence to human reading behavior and language cortex than the default word-level representations, and whether phrase-segmentation fine-tuning amplifies this correspondence. Using Meta-Llama-3.1-8B (base and fine-tuned), we provide three converging lines of evidence. First, phrase-level attention features predict regressive eye-saccade patterns more closely than word-level features; a partial correlation analysis with a shuffled-boundary control indicates that this is not solely an aggregation artifact and that linguistic chunk boundaries explain unique variance beyond word-level attention. Second, fMRI encoding analyses show that fine-tuning selectively improves phrase encoding in left superior temporal gyrus and inferior frontal gyrus, with no improvement for word representations. Third, representational similarity analysis confirms a phrase-specific gain in model-brain geometric alignment. These results identify phrase-level representation as a critical granularity for LLM–human correspondence and suggest that targeted training can model human-like compositional processing, linking computational representations to hierarchical theories of language.
Roles of Predictability and Acoustic Distance in Sound Discrimination via Contrastive Learning
Shuhao Zhang | Youngah Do
Shuhao Zhang | Youngah Do
Research in sound discrimination demonstrates that listeners exhibit reduced sensitivity to acoustic differences between allophones, as opposed to phonemes. Previous studies indicates that highly predictable, complementary distribution of allophones contributes to this limited sensitivity by providing strong contextual cues. Building on these insights, this study investigates the role of predictability in sound discrimination within a supervised contrastive learning framework. Specifically, we examine how varying levels of predictability affect the ability to distinguish sounds and whether this influence is categorical or gradual. Additionally, we explore the interaction between acoustic distance and predictability, as well as how the presence of other contrasts within a language modulates this process. Our findings indicate that only full predictability leads to a significant decline in discrimination performance, demonstrating a categorical effect. This impairment can be alleviated as acoustic distance increases. Moreover, the presence of additional contrasts sharing the relevant acoustic dimension enhances discriminability, showing the importance of contextual contrasts in speech perception.
Graded Expectations: Do Large Language Models Show Human-like Sensitivity to the Likelihood of Deceptive Speech Acts?
Xingyuan Zhao | Seana Coulson
Xingyuan Zhao | Seana Coulson
Human discourse comprehension includes graded expectations about whether a speaker is likely to lie. If language models capture human-like discourse expectations, they should be sensitive not only to factual consistency but also to lie expectancy as a contextual probability from complex pragmatic cues. We test this idea using discourse scenarios with varying incentives to deceive. Human lie probability is estimated from free continuations, and model lie expectancy is derived from the probability mass assigned to human-produced lie versus truth continuations. Across Qwen3 models, likelihood-derived lie mass aligns strongly with human lie expectancy. The best performance comes from the base checkpoints. By contrast, post-trained and mode-specialized variants show weaker alignment. Qualitative analysis suggests a structured error pattern: models tend to overpredict lies when a response directly conflicts with known facts, but underpredict them when lie expectancy depends more on contextual pressures such as politeness, self-protection, or strategic gain. These results suggest that graded lie expectancy is recoverable from model continuation probabilities and can be learned, at least in part, through the ordinary next-token prediction objective.
Lexical exceptionality in paradigm-specific learning: modeling stem-final obstruent alternations in Korean verbs and adjectives
Stella Eunsoo Hong
Stella Eunsoo Hong
Korean stem-final conjugations illustrate the interaction between lexical exceptionality and heterogeneous phonological processes. When /p/-, /t/-, and /s/-final stems occur before vowel-initial suffixes, the irregular classes in these paradigms undergo intervocalic lenition, each exhibiting a distinct alternation pattern. Learners must therefore not only identify which roots trigger lenition, but also determine the corresponding repair strategy. This study investigates how lexically-specific phonological patterns are acquired when multiple repair strategies are available. We employ a lexically scaled MaxEnt model (Linzen et al., 2013; Hughto et al., 2019) to learn these paradigm-specific alternations and run simulations under two learning scenarios: (1) when repair strategies occur at equal frequencies and (2) when one strategy significantly outnumbers the others. Results show that the model favors a least-cost solution by treating statistically dominant morpheme classes as the general pattern. We conclude by discussing the model’s sensitivity to lexical statistics, predictions for empirical testing, and implications for language acquisition.
Adaptive Speech Perception: Empirical Indeterminacy and a Path Forward
Shawn N. Cummings | T. Florian Jaeger | Chigusa Kurumada | Xin Xie
Shawn N. Cummings | T. Florian Jaeger | Chigusa Kurumada | Xin Xie
Human listeners rapidly adapt to unfamiliar talkers, but the underlying computational mechanisms remain contested. Three candidate hypotheses—pre-linguistic normalization, changes in phonetic category representations, and changing decision biases—have largely been pursued in separation, using subfield-specific paradigms. Researchers working in these paradigms often assume that adaptivity observed in their particular paradigm can only be explained by one of the three mechanisms. We test this assumption for one of the most popular experimental paradigms (lexically-guided perceptual learning or LGPL) using a unified computational framework (ASP). We apply ASP to the largest existing LGPL data: 89,600 categorization responses from over 1000 listeners after lexically-guided exposure to 32 different stimulus sets. Despite the unprecedented scale of these data, we find that behavioral data are equally compatible with all three candidate mechanisms. We discuss how model-guided stimulus selection can increase the diagnosticity of future LGPL experiments. Our simulation code can easily be adapted to other experimental paradigms.
Modeling generalization in perceptual learning of speech
Yiming Lu | Xinyu Leslie Liao | Alejandro Tabas | Xin Xie
Yiming Lu | Xinyu Leslie Liao | Alejandro Tabas | Xin Xie
A hallmark of learning is generalization to novel instances. In speech, exposure to atypical pronunciation drives perceptual adjustment that can generalize to unheard tokens. Prior work has attributed constraints on generalization primarily to acoustic similarity between exposure and test contexts. We propose that generalization can also be understood as an inference problem: listeners must determine whether, and how strongly, a learned phonetic mapping should apply in a new context. We test this proposal using data from a recent experiment in which listeners were exposed to shifted vowel pronunciations and then tested on minimal pairs varying in lexical frequency. Learning effects appeared strongest when the exposure direction aligned with a high-frequency alternative in mixed-frequency pairs, and were absent for low-frequency pairs. The observed pattern could reflect token-level acoustic similarity, reliance on prior expectations, or frequency-dependent constraints in applying the learned mapping. We formalized these alternatives within a Bayesian belief-updating framework: a talker-specific model assuming full transfer, a mixture-of-expectations model that interpolates between the updated representation and the listener’s prior, and a hierarchical Bayesian model that deploys the updated representation with uncertainty. The talker-specific model captured most generalization patterns through its sensitivity to token-level acoustic properties, but overpredicted learning for low-frequency pairs. The hierarchical model best recovered the theoretically central exposure-control contrast pattern, suggesting that lexical frequency may constrain how learned representations are applied. Our results provide a computationally explicit framework for studying how contextual factors shape generalization in speech perception.
This paper investigates the relationship between strictly local phonological processes and strictly local phonotactic constraints. On the theoretical side, I identify phonological rewrite rules that do not produce strictly local output languages and that do not weakly preserve the class of strictly local languages. Empirically, I find that strictly local rules without strictly local output languages are largely absent from the PBase database.
This paper argues in favor of a fundamentally new perspective on phonology via modal logic. We show that the class of total Boolean Monadic Recursive Schemes (BMRS), used in computational modeling of phonological processes (Bhaskar et al., 2020; Chandlee Jardine, 2021), is equivalent in expressive power to the well-studied modal 𝜇-calculus. As a corollary of this result, we obtain an alternative proof that order-preserving BMRS transductions capture the class of rational functions, which have been posited as a complexity bound on natural language phonological grammars.