Katerina Marazopoulou


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Charlotte Siska | Katerina Marazopoulou | Melissa Ailem | James Bono
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model’s average performance across the test prompts of a benchmark to evaluate the model’s performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from some real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. Hence, we analyze the robustness of LLM benchmarks to their underlying distributional assumptions. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.