Joshua Rozner

2025

pdf bib abs
Constructions are Revealed in Word Distributions
Joshua Rozner | Leonie Weissweiler | Kyle Mahowald | Cory Shain
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Construction grammar posits that constructions, or form-meaning pairings, are acquired through experience with language (the distributional learning hypothesis).But how much information about constructions does this distribution actually contain? Corpus-based analyses provide some answers, but text alone cannot answer counterfactual questions about what caused a particular word to occur.This requires computable models of the distribution over strings—namely, pretrained language models (PLMs).Here, we treat a RoBERTa model as a proxy for this distribution and hypothesize that constructions will be revealed within it as patterns of statistical affinity.We support this hypothesis experimentally: many constructions are robustly distinguished, including (i) hard cases where semantically distinct constructions are superficially similar, as well as (ii) schematic constructions, whose “slots” can be filled by abstract word classes.Despite this success, we also provide qualitative evidence that statistical affinity alone may be insufficient to identify all constructions from text.Thus, statistical affinity is likely an important, but partial, signal available to learners.

pdf bib abs
BabyLM’s First Constructions: Causal interventions provide a signal of learning
Joshua Rozner | Leonie Weissweiler | Cory Shain
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Construction grammar posits that language learners acquire constructions (form-meaning pairings) from the statistics of their environment. Recent work supports this hypothesis by showing sensitivity to constructions in pretrained language models (PLMs), including one recent study (Rozner et al., 2025) demonstrating that constructions shape RoBERTa’s output distribution. However, models under study have generally been trained on developmentally implausible amounts of data, casting doubt on their relevance to human language learning. Here we use Rozner et al.’s methods to evaluate construction learning in masked language models from the 2024 BabyLM Challenge.Our results show that even when trained on developmentally plausible quantities of data, models learn diverse constructions, even hard cases that are superficially indistinguishable.We further find correlational evidence that constructional performance may be functionally relevant: models that better represent constructions perform better on the BabyLM benchmarks.

2022

Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal dynamics of the teacher through a distillation interchange intervention training objective (DIITO). DIITO pushes the student model to become a causal abstraction of the teacher model – a faithful model with simpler causal structure. DIITO is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared against standard distillation with the same setting, DIITO results in lower perplexity on the WikiText-103M corpus (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

Co-authors

Venues

emnlp2
naacl1

Fix author