Paraphrase Generation Evaluation Powered by an LLM: A Semantic Metric, Not a Lexical One

Quentin Lemesle; Jonathan Chevelu; Philippe Martin; Damien Lolive; Arnaud Delhay; Nelly Barbot

Paraphrase Generation Evaluation Powered by an LLM: A Semantic Metric, Not a Lexical One

Quentin Lemesle, Jonathan Chevelu, Philippe Martin, Damien Lolive, Arnaud Delhay, Nelly Barbot

Abstract

Evaluating automatic paraphrase production systems is a difficult task as it involves, among other things, assessing the semantic proximity between two sentences. Usual measures are based on lexical distances, or at least on semantic embedding alignments. The rise of Large Language Models (LLM) has provided tools to model relationships within a text thanks to the attention mechanism. In this article, we introduce ParaPLUIE, a new measure based on a log likelihood ratio from an LLM, to assess the quality of a potential paraphrase. This measure is compared with usual measures on two known by the NLP community datasets prior to this study. Three new small datasets have been built to allow metrics to be compared in different scenario and to avoid data contamination bias. According to evaluations, the proposed measure is better for sorting pairs of sentences by semantic proximity. In particular, it is much more independent to lexical distance and provides an interpretable classification threshold between paraphrases and non-paraphrases.

Anthology ID:: 2025.coling-main.538
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8057–8087
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.538/
DOI:
Bibkey:
Cite (ACL):: Quentin Lemesle, Jonathan Chevelu, Philippe Martin, Damien Lolive, Arnaud Delhay, and Nelly Barbot. 2025. Paraphrase Generation Evaluation Powered by an LLM: A Semantic Metric, Not a Lexical One. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8057–8087, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Paraphrase Generation Evaluation Powered by an LLM: A Semantic Metric, Not a Lexical One (Lemesle et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.538.pdf

PDF Cite Search Fix data