Luis Garcés-Erice


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Are Bias Evaluation Methods Biased ?
Lina Berrayana | Sean Rooney | Luis Garcés-Erice | Ioana Giurgiu
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

The creation of benchmarksto evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approacheswith distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approachesto rank a set of representative models for bias andcompare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.