Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Timothy Pistotti; Jason Katz-Brown; Michael J. Witbrock

Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Timothy Pistotti, Jason Brown, Michael J. Witbrock

Abstract

Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

Anthology ID:: 2025.brigap-1.2
Volume:: Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2)
Month:: September
Year:: 2025
Address:: Düsseldorf, Germany
Editors:: Timothée Bernard, Timothee Mickus
Venues:: BriGap | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8–14
Language:
URL:: https://preview.aclanthology.org/iwcs-25-ingestion/2025.brigap-1.2/
DOI:
Bibkey:
Cite (ACL):: Timothy Pistotti, Jason Brown, and Michael J. Witbrock. 2025. Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance. In Proceedings of the Second Workshop on the Bridges and Gaps between Formal and Computational Linguistics (BriGap-2), pages 8–14, Düsseldorf, Germany. Association for Computational Linguistics.
Cite (Informal):: Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance (Pistotti et al., BriGap 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/iwcs-25-ingestion/2025.brigap-1.2.pdf

PDF Cite Search Fix data