Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to Processing
Lucie Gianola | Hicham El Boukkouri | Cyril Grouin | Thomas Lavergne | Patrick Paroubek | Pierre Zweigenbaum
Most of the time, when dealing with a particular Natural Language Processing task, systems are compared on the basis of global statistics such as recall, precision, F1-score, etc. While such scores provide a general idea of the behavior of these systems, they ignore a key piece of information that can be useful for assessing progress and discerning remaining challenges: the relative difficulty of test instances. To address this shortcoming, we introduce the notion of differential evaluation which effectively defines a pragmatic partition of instances into gradually more difficult bins by leveraging the predictions made by a set of systems. Comparing systems along these difficulty bins enables us to produce a finer-grained analysis of their relative merits, which we illustrate on two use-cases: a comparison of systems participating in a multi-label text classification task (CLEF eHealth 2018 ICD-10 coding), and a comparison of neural models trained for biomedical entity detection (BioCreative V chemical-disease relations dataset).
Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.
Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (Lan et al., 2020) in combination with Stochastic Weight Averaging (SWA)—a cheap way of ensembling—on a sentiment analysis task (SST-2). In particular, we analyze SWA’s stability via CheckList criteria (Ribeiro et al., 2020), examining the agreement on errors made by models differing only in their random seed. We hypothesize that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory. We quantify stability by comparing the models’ mistakes with Fleiss’ Kappa (Fleiss, 1971) and overlap ratio scores. We find that SWA reduces error rates in general; yet the models still suffer from their own distinct biases (according to CheckList).
This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.
In this paper, we introduce a new embedding-based metric relying on trainable ranking models to evaluate the semantic accuracy of neural data-to-text generators. This metric is especially well suited to semantically and factually assess the performance of a text generator when tables can be associated with multiple references and table values contain textual utterances. We first present how one can implement and further specialize the metric by training the underlying ranking models on a legal Data-to-Text dataset. We show how it may provide a more robust evaluation than other evaluation schemes in challenging settings using a dataset comprising paraphrases between the table values and their respective references. Finally, we evaluate its generalization capabilities on a well-known dataset, WebNLG, by comparing it with human evaluation and a recently introduced metric based on natural language inference. We then illustrate how it naturally characterizes, both quantitatively and qualitatively, omissions and hallucinations.
Automatic readability assessment (ARA) is the task of automatically assessing readability with little or no human supervision. ARA is essential for many second language acquisition applications to reduce the workload of annotators, who are usually language teachers. Previous unsupervised approaches manually searched textual features that correlated well with readability labels, such as perplexity scores of large language models. This paper argues that, to evaluate an assessors’ performance, rank-correlation coefficients should be used instead of Pearson’s correlation coefficient (𝜌). In the experiments, we show that its performance can be easily underestimated using Pearson’s 𝜌, which is significantly affected by the linearity of the output readability scores. We also propose a lightweight unsupervised readability assessor that achieved the best performance in both the rank correlations and Pearson’s 𝜌 among all unsupervised assessors compared.
The benchmark performance of cross-database semantic parsing has climbed steadily in recent years, catalyzed by the wide adoption of pre-trained language models. Yet existing work have shown that state-of-the-art cross-database semantic parsers struggle to generalize to novel user utterances, databases and query structures. To obtain transparent details on the strengths and limitation of these models, we propose a diagnostic testing approach based on controlled synthesis of canonical natural language and SQL pairs. Inspired by the CheckList, we characterize a set of essential capabilities for cross-database semantic parsing models, and detailed the method for synthesizing the corresponding test data. We evaluated a variety of high performing models using the proposed approach, and identified several non-obvious weaknesses across models (e.g. unable to correctly select many columns). Our dataset and code are released as a test suite at http://github.com/hclent/BehaviorCheckingSemPar.
Learning authors representations from their textual productions is now widely used to solve multiple downstream tasks, such as classification, link prediction or user recommendation. Author embedding methods are often built on top of either Doc2Vec (Mikolov et al. 2014) or the Transformer architecture (Devlin et al. 2019). Evaluating the quality of these embeddings and what they capture is a difficult task. Most articles use either classification accuracy or authorship attribution, which does not clearly measure the quality of the representation space, if it really captures what it has been built for. In this paper, we propose a novel evaluation framework of author embedding methods based on the writing style. It allows to quantify if the embedding space effectively captures a set of stylistic features, chosen to be the best proxy of an author writing style. This approach gives less importance to the topics conveyed by the documents. It turns out that recent models are mostly driven by the inner semantic of authors’ production. They are outperformed by simple baselines, based on state-of-the-art pretrained sentence embedding models, on several linguistic axes. These baselines can grasp complex linguistic phenomena and writing style more efficiently, paving the way for designing new style-driven author embedding models.
We propose a new reference-free summary quality evaluation measure, with emphasis on the faithfulness. The measure is based on finding and counting all probable potential inconsistencies of the summary with respect to the source document. The proposed ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings, correlates with expert scores in summary-level SummEval dataset stronger than other common evaluation measures not only in Consistency but also in Fluency. We also introduce a method of generating subtle factual errors in human summaries. We show that ESTIME is more sensitive to subtle errors than other common evaluation measures.
Detecting lexical semantic change in smaller data sets, e.g. in historical linguistics and digital humanities, is challenging due to a lack of statistical power. This issue is exacerbated by non-contextual embedding models that produce one embedding per word and, therefore, mask the variability present in the data. In this article, we propose an approach to estimate semantic shift by combining contextual word embeddings with permutation-based statistical tests. We use the false discovery rate procedure to address the large number of hypothesis tests being conducted simultaneously. We demonstrate the performance of this approach in simulation where it achieves consistently high precision by suppressing false positives. We additionally analyze real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus. We show that by taking sample variation into account, we can improve the robustness of individual semantic shift estimates without degrading overall performance.
Reference-based automatic evaluation metrics are notoriously limited for NLG due to their inability to fully capture the range of possible outputs. We examine a referenceless alternative: evaluating the adequacy of English sentences generated from Abstract Meaning Representation (AMR) graphs by parsing into AMR and comparing the parse directly to the input. We find that the errors introduced by automatic AMR parsing substantially limit the effectiveness of this approach, but a manual editing study indicates that as parsing improves, parsing-based evaluation has the potential to outperform most reference-based metrics.
Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural language generation (NLG) tasks. Various widely popular metrics perform poorly with the code-mixed NLG tasks. To address this challenge, we present a metric in- dependent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments on the generated code-mixed text. As a use case, we demonstrate the performance of MIPE on the machine-generated Hinglish (code-mixing of Hindi and English languages) sentences from the HinGE corpus. We can extend the proposed evaluation strategy to other code-mixed language pairs, NLG tasks, and evaluation metrics with minimal to no effort.
We present the joint contribution of Instituto Superior Técnico (IST) and Unbabel to the Explainable Quality Estimation (QE) shared task, where systems were submitted to two tracks: constrained (without word-level supervision) and unconstrained (with word-level supervision). For the constrained track, we experimented with several explainability methods to extract the relevance of input tokens from sentence-level QE models built on top of multilingual pre-trained transformers. Among the different tested methods, composing explanations in the form of attention weights scaled by the norm of value vectors yielded the best results. When word-level labels are used during training, our best results were obtained by using word-level predicted probabilities. We further improve the performance of our methods on the two tracks by ensembling explanation scores extracted from models trained with different pre-trained transformers, achieving strong results for in-domain and zero-shot language pairs.
Quality Estimation (QE) for Machine Translation has been shown to reach relatively high accuracy in predicting sentence-level scores, relying on pretrained contextual embeddings and human-produced quality scores. However, the lack of explanations along with decisions made by end-to-end neural models makes the results difficult to interpret. Furthermore, word-level annotated datasets are rare due to the prohibitive effort required to perform this task, while they could provide interpretable signals in addition to sentence-level QE outputs. In this paper, we propose a novel QE architecture which tackles both the word-level data scarcity and the interpretability limitations of recent approaches. Sentence-level and word-level components are jointly pretrained through an attention mechanism based on synthetic data and a set of MT metrics embedded in a common space. Our approach is evaluated on the Eval4NLP 2021 shared task and our submissions reach the first position in all language pairs. The extraction of metric-to-input attention weights show that different metrics focus on different parts of the source and target text, providing strong rationales in the decision-making process of the QE model.
Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these metrics return certain scores. This year’s Eval4NLP shared task tackles this challenge by searching for methods that can extract feature importance scores that correlate well with human word-level error annotations. In this paper we show that unsupervised metrics that are based on tokenmatching can intrinsically provide such scores. The submitted system interprets the similarities of the contextualized word-embeddings that are used to compute (X)BERTScore as word-level importance scores.
In this paper, we introduce the Eval4NLP-2021 shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To the best of our knowledge, this is the first shared task on explainable NLP evaluation metrics. Datasets and results are available at https://github.com/eval4nlp/SharedTask2021.
Authorship attribution is the task of assigning an unknown document to an author from a set of candidates. In the past, studies in this field use various evaluation datasets to demonstrate the effectiveness of preprocessing steps, features, and models. However, only a small fraction of works use more than one dataset to prove claims. In this paper, we present a collection of highly diverse authorship attribution datasets, which better generalizes evaluation results from authorship attribution research. Furthermore, we implement a wide variety of previously used machine learning models and show that many approaches show vastly different performances when applied to different datasets. We include pre-trained language models, for the first time testing them in this field in a systematic way. Finally, we propose a set of aggregated scores to evaluate different aspects of the dataset collection.
Product reviews and satisfaction surveys seek customer feedback in the form of ranked scales. In these settings, widely used evaluation metrics including F1 and accuracy ignore the rank in the responses (e.g., ‘very likely’ is closer to ‘likely’ than ‘not at all’). In this paper, we hypothesize that the order of class values is important for evaluating classifiers on ordinal target variables and should not be disregarded. To test this hypothesis, we compared Multi-class Classification (MC) and Ordinal Regression (OR) by applying OR and MC to benchmark tasks involving ordinal target variables using the same underlying model architecture. Experimental results show that while MC outperformed OR for some datasets in accuracy and F1, OR is significantly better than MC for minimizing the error between prediction and target for all benchmarks, as revealed by error-sensitive metrics, e.g. mean-squared error (MSE) and Spearman correlation. Our findings motivate the need to establish consistent, error-sensitive metrics for evaluating benchmarks with ordinal target variables, and we hope that it stimulates interest in exploring alternative losses for ordinal problems.
Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.
SemEval is the primary venue in the NLP community for the proposal of new challenges and for the systematic empirical evaluation of NLP systems. This paper provides a systematic quantitative analysis of SemEval aiming to evidence the patterns of the contributions behind SemEval. By understanding the distribution of task types, metrics, architectures, participation and citations over time we aim to answer the question on what is being evaluated by SemEval.
This paper describes the UMD submission to the Explainable Quality Estimation Shared Task at the EMNLP 2021 Workshop on “Evaluation & Comparison of NLP Systems”. We participated in the word-level and sentence-level MT Quality Estimation (QE) constrained tasks for all language pairs: Estonian-English, Romanian-English, German-Chinese, and Russian-German. Our approach combines the predictions of a word-level explainer model on top of a sentence-level QE model and a sequence labeler trained on synthetic data. These models are based on pre-trained multilingual language models and do not require any word-level annotations for training, making them well suited to zero-shot settings. Our best-performing system improves over the best baseline across all metrics and language pairs, with an average gain of 0.1 in AUC, Average Precision, and Recall at Top-K score.
Current research on quality estimation of machine translation focuses on the sentence-level quality of the translations. By using explainability methods, we can use these quality estimations for word-level error identification. In this work, we compare different explainability techniques and investigate gradient-based and perturbation-based methods by measuring their performance and required computational efforts. Throughout our experiments, we observed that using absolute word scores boosts the performance of gradient-based explainers significantly. Further, we combine explainability methods to ensembles to exploit the strengths of individual explainers to get better explanations. We propose the usage of absolute gradient-based methods. These work comparably well to popular perturbation-based ones while being more time-efficient.
This paper describes our participating system in the shared task Explainable quality estimation of 2nd Workshop on Evaluation & Comparison of NLP Systems. The task of quality estimation (QE, a.k.a. reference-free evaluation) is to predict the quality of MT output at inference time without access to reference translations. In this proposed work, we first build a word-level quality estimation model, then we finetune this model for sentence-level QE. Our proposed models achieve near state-of-the-art results. In the word-level QE, we place 2nd and 3rd on the supervised Ro-En and Et-En test sets. In the sentence-level QE, we achieve a relative improvement of 8.86% (Ro-En) and 10.6% (Et-En) in terms of the Pearson correlation coefficient over the baseline model.