2025
pdf
bib
abs
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
So Young Lee
|
Russell Scheinberg
|
Amber Shore
|
Ameeta Agrawal
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
This study explores how recent large language models (LLMs) navigate relative clause attachment ambiguity and use world knowledge biases for disambiguation in six typologically diverse languages: English, Chinese, Japanese, Korean, Russian, and Spanish. We describe the process of creating a novel dataset – MultiWho – for fine-grained evaluation of relative clause attachment preferences in ambiguous and unambiguous contexts. Our experiments with three LLMs indicate that, contrary to humans, LLMs consistently exhibit a preference for local attachment, displaying limited responsiveness to syntactic variations or language-specific attachment patterns.Although LLMs performed well in unambiguous cases, they rigidly prioritized world knowledge biases, lacking the flexibility of human language processing. These findings highlight the need for more diverse, pragmatically nuanced multilingual training to improve LLMs’ handling of complex structures and human-like comprehension.
2024
pdf
bib
Multilingual Relative Clause Attachment Ambiguity Resolution in Large Language Models
So Young Lee
|
Russell Scheinberg
|
Amber Shore
|
Ameeta Agrawal
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
2023
pdf
bib
abs
MEEP: Is this Engaging? Prompting Large Language Models for Dialogue Evaluation in Multilingual Settings
Amila Ferron
|
Amber Shore
|
Ekata Mitra
|
Ameeta Agrawal
Findings of the Association for Computational Linguistics: EMNLP 2023
As dialogue systems become more popular, evaluation of their response quality gains importance. Engagingness highly correlates with overall quality and creates a sense of connection that gives human participants a more fulfilling experience. Although qualities like coherence and fluency are readily measured with well-worn automatic metrics, evaluating engagingness often relies on human assessment, which is a costly and time-consuming process. Existing automatic engagingness metrics evaluate the response without the conversation history, are designed for one dataset, or have limited correlation with human annotations. Furthermore, they have been tested exclusively on English conversations. Given that dialogue systems are increasingly available in languages beyond English, multilingual evaluation capabilities are essential. We propose that large language models (LLMs) may be used for evaluation of engagingness in dialogue through prompting, and ask how prompt constructs and translated prompts compare in a multilingual setting. We provide a prompt-design taxonomy for engagingness and find that using selected prompt elements with LLMs, including our comprehensive definition of engagingness, outperforms state-of-the-art methods on evaluation of engagingness in dialogue across multiple languages.