Edward Gow-Smith


2023

pdf
Sheffield’s Submission to the AmericasNLP Shared Task on Machine Translation into Indigenous Languages
Edward Gow-Smith | Danae Sánchez Villegas
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

The University of Sheffield took part in the shared task 2023 AmericasNLP for all eleven language pairs. Our models consist of training different variations of NLLB-200 model on data provided by the organizers and available data from various sources such as constitutions, handbooks and news articles. Our models outperform the baseline model on the development set on chrF with substantial improvements particularly for Aymara, Guarani and Quechua. On the test set, our best submission achieves the highest average chrF of all the submissions, we rank first in four of the eleven languages, and at least one of our models ranks in the top 3 for all languages.

pdf
NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track
Edward Gow-Smith | Alexandre Berard | Marcely Zanon Boito | Ioan Calapodescu
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper presents NAVER LABS Europe’s systems for Tamasheq-French and Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our work attempts to maximize translation quality in low-resource settings using multilingual parameter-efficient solutions that leverage strong pre-trained models. Our primary submission for Tamasheq outperforms the previous state of the art by 7.5 BLEU points on the IWSLT 2022 test set, and achieves 23.6 BLEU on this year’s test set, outperforming the second best participant by 7.7 points. For Quechua, we also rank first and achieve 17.7 BLEU, despite having only two hours of translation data. Finally, we show that our proposed multilingual architecture is also competitive for high-resource languages, outperforming the best unconstrained submission to the IWSLT 2021 Multilingual track, despite using much less training data and compute.

2022

pdf
Sample Efficient Approaches for Idiomaticity Detection
Dylan Phelps | Xuan-Rui Fan | Edward Gow-Smith | Harish Tayyar Madabushi | Carolina Scarton | Aline Villavicencio
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022

Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings, on the task of idiomaticity detection. In addition, to further explore generalisability, we focus on the identification of MWEs not present in the training data. Our experiments show that while these methods improve performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT. Regardless, we believe sample efficient methods for both identifying and representing potentially idiomatic MWEs are very encouraging and hold significant potential for future exploration.

pdf
SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
Harish Tayyar Madabushi | Edward Gow-Smith | Marcos Garcia | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask includes different settings regarding the amount of training data. Besides the task description, this paper introduces the datasets in English, Portuguese, and Galician and their annotation procedure, the evaluation metrics, and a summary of the participant systems and their results. The task had close to 100 registered participants organised into twenty five teams making over 650 and 150 submissions in the practice and evaluation phases respectively.

pdf
Use of Transformer-Based Models for Word-Level Transliteration of the Book of the Dean of Lismore
Edward Gow-Smith | Mark McConville | William Gillies | Jade Scott | Roibeard Ó Maolalaigh
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

The Book of the Dean of Lismore (BDL) is a 16th-century Scottish Gaelic manuscript written in a non-standard orthography. In this work, we outline the problem of transliterating the text of the BDL into a standardised orthography, and perform exploratory experiments using Transformer-based models for this task. In particular, we focus on the task of word-level transliteration, and achieve a character-level BLEU score of 54.15 with our best model, a BART architecture pre-trained on the text of Scottish Gaelic Wikipedia and then fine-tuned on around 2,000 word-level parallel examples. Our initial experiments give promising results, but we highlight the shortcomings of our model, and discuss directions for future work.

pdf
Improving Tokenisation by Alternative Treatment of Spaces
Edward Gow-Smith | Harish Tayyar Madabushi | Carolina Scarton | Aline Villavicencio
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

2021

pdf
AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models
Harish Tayyar Madabushi | Edward Gow-Smith | Carolina Scarton | Aline Villavicencio
Findings of the Association for Computational Linguistics: EMNLP 2021

Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model’s ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs.