Giuseppe Savino


2025

pdf bib
Do Large Language Models understand how to be judges?
Nicolò Donati | Paolo Torroni | Giuseppe Savino
Proceedings of the 2nd LUHME Workshop

This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced editorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and inconsistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and ordering, each defined with explicit sub-criteria to guide LLMs in assessing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tailored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating fluency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or abstract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future directions include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.

2024

pdf bib
Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises
Nicolò Donati | Matteo Periani | Paolo Di Natale | Giuseppe Savino | Paolo Torroni
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners’ grammatical proficiency andcomprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCCexercises must be contextually relevant and engaging, incorporating distractors—plausible but incorrect alternatives—tobalance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs)in education, their application in generating English grammar MCC exercises is still limited. Previous methods typicallyimpose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potentialof LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. Wehypothesize that LLMs can craft self-contained sentences that foster learner’s communicative competence. Our analysisof existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we addressthe lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions includedeveloping an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammartopics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance theautomatic generation of English grammar MCC exercises, enhancing both their quality and creativity.