Albert Gatt

2021

pdf bib abs
On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti | Lonneke van der Plas | Claudia Borg | Albert Gatt
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks – POS tagging and natural language inference – which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on ‘unlearning’ language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model’s limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

pdf bib abs
Entity-Based Semantic Adequacy for Data-to-Text Generation
Juliette Faille | Albert Gatt | Claire Gardent
Findings of the Association for Computational Linguistics: EMNLP 2021

While powerful pre-trained language models have improved the fluency of text generation models, semantic adequacy -the ability to generate text that is semantically faithful to the input- remains an unsolved issue. In this paper, we introduce a novel automatic evaluation metric, Entity-Based Semantic Adequacy, which can be used to assess to what extent generation models that verbalise RDF (Resource Description Framework) graphs produce text that contains mentions of the entities occurring in the RDF input. This is important as RDF subject and object entities make up 2/3 of the input. We use our metric to compare 25 models from the WebNLG Shared Tasks and we examine correlation with results from human evaluations of semantic adequacy. We show that while our metric correlates with human evaluation scores, this correlation varies with the specifics of the human evaluation setup. This suggests that in order to measure the entity-based adequacy of generated texts, an automatic metric such as the one proposed here might be more reliable, as less subjective and more focused on correct verbalisation of the input, than human evaluation measures.

pdf bib abs
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu | Albert Gatt | Anette Frank | Iacer Calixto
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.

2020

pdf bib
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation
Daniel Sánchez | Raquel Hervás | Albert Gatt
Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation

pdf bib abs
The Natural Language Pipeline, Neural Text Generation and Explainability
Juliette Faille | Albert Gatt | Claire Gardent
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

End-to-end encoder-decoder approaches to data-to-text generation are often black boxes whose predictions are difficult to explain. Breaking up the end-to-end model into sub-modules is a natural way to address this problem. The traditional pre-neural Natural Language Generation (NLG) pipeline provides a framework for breaking up the end-to-end encoder-decoder. We survey recent papers that integrate traditional NLG submodules in neural approaches and analyse their explainability. Our survey is a first step towards building explainable neural NLG models.

pdf bib abs
Towards Harnessing Natural Language Generation to Explain Black-box Models
Ettore Mariotti | Jose M. Alonso | Albert Gatt
2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence

The opaque nature of many machine learning techniques prevents the wide adoption of powerful information processing tools for high stakes scenarios. The emerging field eXplainable Artificial Intelligence (XAI) aims at providing justifications for automatic decision-making systems in order to ensure reliability and trustworthiness in the users. For achieving this vision, we emphasize the importance of a natural language textual modality as a key component for a future intelligent interactive agent. We outline the challenges of XAI and review a set of publications that work in this direction.

Earlier research has shown that evaluation metrics based on textual similarity (e.g., BLEU, CIDEr, Meteor) do not correlate well with human evaluation scores for automatically generated text. We carried out an experiment with Chinese speakers, where we systematically manipulated image descriptions to contain different kinds of errors. Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions. Our results show that different kinds of errors elicit significantly different evaluation scores, even though all erroneous descriptions differ in only one character from the reference descriptions. Evaluation metrics based solely on textual similarity are unable to capture these differences, which (at least partially) explains their poor correlation with human judgments. Our work provides the foundations for future work, where we aim to understand why different errors are seen as more or less severe.

pdf bib
Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)
Patrizia Paggio | Albert Gatt | Roman Klinger
Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)

pdf bib abs
On the interaction of automatic evaluation and task framing in headline style transfer
Lorenzo De Mattei | Michele Cafagna | Huiyuan Lai | Felice Dell’Orletta | Malvina Nissim | Albert Gatt
Proceedings of the 1st Workshop on Evaluating NLG Evaluation

An ongoing debate in the NLG community concerns the best way to evaluate systems, with human evaluation often being considered the most reliable method, compared to corpus-based metrics. However, tasks involving subtle textual differences, such as style transfer, tend to be hard for humans to perform. In this paper, we propose an evaluation method for this task based on purposely-trained classifiers, showing that it better reflects system differences than traditional metrics such as BLEU.

pdf bib abs
Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias
Marion Bartl | Malvina Nissim | Albert Gatt
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

Contextualized word embeddings have been replacing standard embeddings as the representational knowledge source of choice in NLP systems. Since a variety of biases have previously been found in standard word embeddings, it is crucial to assess biases encoded in their replacements as well. Focusing on BERT (Devlin et al., 2018), we measure gender bias by studying associations between gender-denoting target words and names of professions in English and German, comparing the findings with real-world workforce statistics. We mitigate bias by fine-tuning BERT on the GAP corpus (Webster et al., 2018), after applying Counterfactual Data Substitution (CDS) (Maudslay et al., 2019). We show that our method of measuring bias is appropriate for languages such as English, but not for languages with a rich morphology and gender-marking, such as German. Our results highlight the importance of investigating bias and mitigation techniques cross-linguistically,especially in view of the current emphasis on large-scale, multilingual language models.

pdf bib abs
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
Stavros Assimakopoulos | Rebecca Vella Muskat | Lonneke van der Plas | Albert Gatt
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realisation that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realised through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label ‘hate speech,’ as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we propose a multi-layer annotation scheme, which is pilot-tested against a binary ±hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI HEADSET Corpus is publicly available for research/academic purposes.

2019

pdf bib abs
You Write like You Eat: Stylistic Variation as a Predictor of Social Stratification
Angelo Basile | Albert Gatt | Malvina Nissim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Inspired by Labov’s seminal work on stylisticvariation as a function of social stratification,we develop and compare neural models thatpredict a person’s presumed socio-economicstatus, obtained through distant supervision,from their writing style on social media. Thefocus of our work is on identifying the mostimportant stylistic parameters to predict socio-economic group. In particular, we show theeffectiveness of morpho-syntactic features aspredictors of style, in contrast to lexical fea-tures, which are good predictors of topic

pdf bib abs
Visually grounded generation of entailments from premises
Somayeh Jafaritazehjani | Albert Gatt | Marc Tanti
Proceedings of the 12th International Conference on Natural Language Generation

Natural Language Inference (NLI) is the task of determining the semantic relationship between a premise and a hypothesis. In this paper, we focus on the generation of hypotheses from premises in a multimodal setting, to generate a sentence (hypothesis) given an image and/or its description (premise) as the input. The main goals of this paper are (a) to investigate whether it is reasonable to frame NLI as a generation task; and (b) to consider the degree to which grounding textual premises in visual information is beneficial to generation. We compare different neural architectures, showing through automatic and human evaluation that entailments can indeed be generated successfully. We also show that multimodal models outperform unimodal models in this task, albeit marginally

pdf bib abs
Best practices for the human evaluation of automatically generated text
Chris van der Lee | Albert Gatt | Emiel van Miltenburg | Sander Wubben | Emiel Krahmer
Proceedings of the 12th International Conference on Natural Language Generation

Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

2018

Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”). We use a multimodal version of the SNLI dataset (Bowman et al., 2015) and we compare “blind” and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing “grounding” in an optimal fashion.

pdf bib
Proceedings of the 11th International Conference on Natural Language Generation
Emiel Krahmer | Albert Gatt | Martijn Goudbeek
Proceedings of the 11th International Conference on Natural Language Generation

pdf bib abs
Meteorologists and Students: A resource for language grounding of geographical descriptors
Alejandro Ramos-Soto | Ehud Reiter | Kees van Deemter | Jose Alonso | Albert Gatt
Proceedings of the 11th International Conference on Natural Language Generation

We present a data resource which can be useful for research purposes on language grounding tasks in the context of geographical referring expression generation. The resource is composed of two data sets that encompass 25 different geographical descriptors and a set of associated graphical representations, drawn as polygons on a map by two groups of human subjects: teenage students and expert meteorologists.

pdf bib abs
Specificity measures and reference
Albert Gatt | Nicolás Marín | Gustavo Rivas-Gervilla | Daniel Sánchez
Proceedings of the 11th International Conference on Natural Language Generation

In this paper we study empirically the validity of measures of referential success for referring expressions involving gradual properties. More specifically, we study the ability of several measures of referential success to predict the success of a user in choosing the right object, given a referring expression. Experimental results indicate that certain fuzzy measures of success are able to predict human accuracy in reference resolution. Such measures are therefore suitable for the estimation of the success or otherwise of a referring expression produced by a generation algorithm, especially in case the properties in a domain cannot be assumed to have crisp denotations.

2017

pdf bib abs
Morphological Analysis for the Maltese Language: The challenges of a hybrid system
Claudia Borg | Albert Gatt
Proceedings of the Third Arabic Natural Language Processing Workshop

Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and non-concatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems.

pdf bib abs
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
Marc Tanti | Albert Gatt | Kenneth Camilleri
Proceedings of the 10th International Conference on Natural Language Generation

Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.

2015

pdf bib
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)
Anya Belz | Albert Gatt | François Portet | Matthew Purver
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

pdf bib abs
Crowd-sourcing evaluation of automatically acquired, morphologically related word groupings
Claudia Borg | Albert Gatt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The automatic discovery and clustering of morphologically related words is an important problem with several practical applications. This paper describes the evaluation of word clusters carried out through crowd-sourcing techniques for the Maltese language. The hybrid (Semitic-Romance) nature of Maltese morphology, together with the fact that no large-scale lexical resources are available for Maltese, make this an interesting and challenging problem.

pdf bib
Learning when to point: A data-driven approach
Albert Gatt | Patrizia Paggio
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Proceedings of the 14th European Workshop on Natural Language Generation
Albert Gatt | Horacio Saggion
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
What and Where: An Empirical Investigation of Pointing Gestures and Descriptions in Multimodal Referring Actions
Albert Gatt | Patrizia Paggio
Proceedings of the 14th European Workshop on Natural Language Generation

2012

pdf bib abs
A Repository of Data and Evaluation Resources for Natural Language Generation
Anja Belz | Albert Gatt
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Starting in 2007, the field of natural language generation (NLG) has organised shared-task evaluation events every year, under the Generation Challenges umbrella. In the course of these shared tasks, a wealth of data has been created, along with associated task definitions and evaluation regimes. In other contexts too, sharable NLG data is now being created. In this paper, we describe the online repository that we have created as a one-stop resource for obtaining NLG task materials, both from Generation Challenges tasks and from other sources, where the set of materials provided for each task consists of (i) task definition, (ii) input and output data, (iii) evaluation software, (iv) documentation, and (v) publications reporting previous results.

pdf bib abs
Incorporating an Error Corpus into a Spellchecker for Maltese
Michael Rosner | Albert Gatt | Andrew Attard | Jan Joachimsen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper discusses the ongoing development of a new Maltese spell checker, highlighting the methodologies which would best suit such a language. We thus discuss several previous attempts, highlighting what we believe to be their weakest point: a lack of attention to context. Two developments are of particular interest, both of which concern the availability of language resources relevant to spellchecking: (i) the Maltese Language Resource Server (MLRS) which now includes a representative corpus of c. 100M words extracted from diverse documents including the Maltese Legislation, press releases and extracts from Maltese web-pages and (ii) an extensive and detailed corpus of spelling errors that was collected whilst part of the MLRS texts were being prepared. We describe the structure of these resources as well as the experimental approaches focused on context that we are now in a position to adopt. We describe the framework within which a variety of different approaches to spellchecking and evaluation will be carried out, and briefly discuss the first baseline system we have implemented. We conclude the paper with a roadmap for future improvements.

Notre société génère une masse d’information toujours croissante, que ce soit en médecine, en météorologie, etc. La méthode la plus employée pour analyser ces données est de les résumer sous forme graphique. Cependant, il a été démontré qu’un résumé textuel est aussi un mode de présentation efficace. L’objectif du prototype BT-45, développé dans le cadre du projet Babytalk, est de générer des résumés de 45 minutes de signaux physiologiques continus et d’événements temporels discrets en unité néonatale de soins intensifs (NICU). L’article présente l’aspect génération de texte de ce prototype. Une expérimentation clinique a montré que les résumés humains améliorent la prise de décision par rapport à l’approche graphique, tandis que les textes de BT-45 donnent des résultats similaires à l’approche graphique. Une analyse a identifié certaines des limitations de BT-45 mais en dépit de cellesci, notre travail montre qu’il est possible de produire automatiquement des résumés textuels efficaces de données complexes.

pdf bib
SimpleNLG: A Realisation Engine for Practical Applications
Albert Gatt | Ehud Reiter
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
A Hearer-Oriented Evaluation of Referring Expression Generation
Imtiaz Hussain Khan | Kees van Deemter | Graeme Ritchie | Albert Gatt | Alexandra A. Cleland
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
Generation Challenges 2009: Preface
Anja Belz | Albert Gatt
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
The TUNA-REG Challenge 2009: Overview and Evaluation Results
Albert Gatt | Anja Belz | Eric Kow
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf bib
The GREC Main Subject Reference Generation Challenge 2009: Overview and Evaluation Results
Anja Belz | Eric Kow | Jette Viethen | Albert Gatt
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf bib
Text Content and Task Performance in the Evaluation of a Natural Language Generation System
Albert Gatt | François Portet
Proceedings of the International Conference RANLP-2009