Jason Katz-Brown

Also published as: Jason Brown


2025

pdf bib
Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
Timothy Pistotti | Jason Brown | Michael J. Witbrock
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

pdf bib
Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
Timothy Pistotti | Jason Brown | Michael J. Witbrock
Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)

Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the “wh-effect”) to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM’s syntactic competence.

2013

pdf bib
Rhythm, Metrics, and the Link to Phonology
Jason Brown | Sam Mandal
Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013)

2011

pdf bib
Training a Parser for Machine Translation Reordering
Jason Katz-Brown | Slav Petrov | Ryan McDonald | Franz Och | David Talbot | Hiroshi Ichikawa | Masakazu Seno | Hideto Kazawa
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Training dependency parsers by jointly optimizing multiple objectives
Keith Hall | Ryan McDonald | Jason Katz-Brown | Michael Ringgaard
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Lightweight Evaluation Framework for Machine Translation Reordering
David Talbot | Hideto Kazawa | Hiroshi Ichikawa | Jason Katz-Brown | Masakazu Seno | Franz Och
Proceedings of the Sixth Workshop on Statistical Machine Translation