Aya Meltzer-Asscher
2025
When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models
Samuel Joseph Amouyal
|
Aya Meltzer-Asscher
|
Jonathan Berant
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs’ and humans’ language processing. In this paper, we try to answer two questions: 1. What makes garden-path sentences hard to understand for humans? 2. Do the same reasons make garden-path sentences hard for LLMs as well? Based on psycholinguistic research, we formulate hypotheses on why garden-path sentences are hard, and test these hypotheses on human participants and a large suite of LLMs using comprehension questions. Our findings reveal that both LLMs and humans struggle with specific syntactic complexities, with some models showing high correlation with human comprehension. To complement our findings, we test LLM comprehension of garden-path constructions with paraphrasing and text-to-image generation tasks, and find that the results mirror the sentence comprehension question results, further validating our findings on LLM understanding of these constructions.
2024
Large Language Models for Psycholinguistic Plausibility Pretesting
Samuel Amouyal
|
Aya Meltzer-Asscher
|
Jonathan Berant
Findings of the Association for Computational Linguistics: EACL 2024
In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.