2025
pdf
bib
abs
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Anna Bavaresco
|
Raffaella Bernardi
|
Leonardo Bertolazzi
|
Desmond Elliott
|
Raquel Fernández
|
Albert Gatt
|
Esam Ghaleb
|
Mario Giulianelli
|
Michael Hanna
|
Alexander Koller
|
Andre Martins
|
Philipp Mondorf
|
Vera Neplenbroek
|
Sandro Pezzelle
|
Barbara Plank
|
David Schlangen
|
Alessandro Suglia
|
Aditya K Surikuchi
|
Ece Takmaz
|
Alberto Testoni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.
pdf
bib
abs
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
Esam Ghaleb
|
Bulat Khaertdinov
|
Asli Ozyurek
|
Raquel Fernández
Findings of the Association for Computational Linguistics: ACL 2025
In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.