John Mendonça

Also published as: John Mendonca


2023

pdf
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation
John Mendonça | Patrícia Pereira | Helena Moniz | Joao Paulo Carvalho | Alon Lavie | Isabel Trancoso
Proceedings of The Eleventh Dialog System Technology Challenge

Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 “Automatic Evaluation Metrics for Open-Domain Dialogue Systems”, proving the evaluation capabilities of prompted LLMs.

pdf
Towards Multilingual Automatic Open-Domain Dialogue Evaluation
John Mendonca | Alon Lavie | Isabel Trancoso
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The main limiting factor in the development of robust multilingual open-domain dialogue evaluation metrics is the lack of multilingual data and the limited availability of open-sourced multilingual dialogue systems. In this work, we propose a workaround for this lack of data by leveraging a strong multilingual pretrained encoder-based Language Model and augmenting existing English dialogue data using Machine Translation. We empirically show that the naive approach of finetuning a pretrained multilingual encoder model with translated data is insufficient to outperform the strong baseline of finetuning a multilingual model with only source data. Instead, the best approach consists in the careful curation of translated data using MT Quality Estimation metrics, excluding low quality translations that hinder its performance.

2022

pdf
Towards Speaker Verification for Crowdsourced Speech Collections
John Mendonca | Rui Correia | Mariana Lourenço | João Freitas | Isabel Trancoso
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Crowdsourcing the collection of speech provides a scalable setting to access a customisable demographic according to each dataset’s needs. The correctness of speaker metadata is especially relevant for speaker-centred collections - ones that require the collection of a fixed amount of data per speaker. This paper identifies two different types of misalignment present in these collections: Multiple Accounts misalignment (different contributors map to the same speaker), and Multiple Speakers misalignment (multiple speakers map to the same contributor). Based on state-of-the-art approaches to Speaker Verification, this paper proposes an unsupervised method for measuring speaker metadata plausibility of a collection, i.e., evaluating the match (or lack thereof) between contributors and speakers. The solution presented is composed of an embedding extractor and a clustering module. Results indicate high precision in automatically classifying contributor alignment (>0.94).

pdf
QualityAdapt: an Automatic Dialogue Quality Estimation Framework
John Mendonca | Alon Lavie | Isabel Trancoso
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Despite considerable advances in open-domain neural dialogue systems, their evaluation remains a bottleneck. Several automated metrics have been proposed to evaluate these systems, however, they mostly focus on a single notion of quality, or, when they do combine several sub-metrics, they are computationally expensive. This paper attempts to solve the latter: QualityAdapt leverages the Adapter framework for the task of Dialogue Quality Estimation. Using well defined semi-supervised tasks, we train adapters for different subqualities and score generated responses with AdapterFusion. This compositionality provides an easy to adapt metric to the task at hand that incorporates multiple subqualities. It also reduces computational costs as individual predictions of all subqualities are obtained in a single forward pass. This approach achieves comparable results to state-of-the-art metrics on several datasets, whilst keeping the previously mentioned advantages.