Stefana Arina Tabusca


2025

pdf bib
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective
Andra-Maria Florescu | Marius Micluța-Câmpeanu | Stefana Arina Tabusca | Liviu P Dinu
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

The following paper is a joint contribution for the 2025 ReproNLP shared task, part of the ReproHum project. We focused on reproducing the human evaluation based on one criterion, namely, factuality of Scientific Automated Generated Systems from August et al. (2022). In accordance to the ReproHum guidelines, we followed the original study as closely as possible, with two human raters who coded 300 ratings each. Moreover, we had an additional study on two subsets of the dataset based on domain (medicine and physics) in which we employed expert annotators. Our reproduction of the factuality assessment found similar overall rates of factual inaccuracies across models. However, variability and weak agreement with the original model rankings suggest challenges in reliably reproducing results, especially in such cases when results are close.