Eric Ragan

2025

As Large Language Models (LLMs) gain mainstream public usage, understanding how users interact with them becomes increasingly important. Limited variety in training data raises concerns about LLM reliability across different language inputs. To explore this, we test several LLMs using functionally equivalent prompts expressed in different English sublanguages. We frame this analysis using Question-Answer (QA) pairs, which allow us to detect and evaluate appropriate and anomalous model behavior. We contribute a cross-LLM testing method and a new QA dataset translated into AAVE and WAPE variants. Early results reveal a notable drop in accuracy for one sublanguage relative to the baseline.

Co-authors

Sangpil Youm 1

Venues

winlp1
ws1

Fix author