Euan Mcgill

Also published as: Euan McGill

2024

pdf abs
Bootstrapping Pre-trained Word Embedding Models for Sign Language Gloss Translation
Euan McGill | Luis Chiruzzo | Horacio Saggion
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

This paper explores a novel method to modify existing pre-trained word embedding models of spoken languages for Sign Language glosses. These newly-generated embeddings are described, visualised, and then used in the encoder and/or decoder of models for the Text2Gloss and Gloss2Text task of machine translation. In two translation settings (one including data augmentation-based pre-training and a baseline), we find that bootstrapped word embeddings for glosses improve translation across four Signed/spoken language pairs. Many improvements are statistically significant, including those where the bootstrapped gloss embedding models are used.Languages included: American Sign Language, Finnish Sign Language, Spanish Sign Language, Sign Language of The Netherlands.

SignON, a 3-year Horizon 20202 project addressing the lack of technology and services for MT between sign languages (SLs) and spoken languages (SpLs) ended in December 2023. SignON was unprecedented. Not only it addressed the wider complexity of the aforementioned problem – from research and development of recognition, translation and synthesis, through development of easy-to-use mobile applications and a cloud-based framework to do the “heavy lifting” as well as to establishing ethical, privacy and inclusivenesspolicies and operation guidelines – but also engaged with the deaf and hard of hearing communities in an effective co-creation approach where these main stakeholders drove the development in the right direction and had the final say.Currently we are witnessing advances in natural language processing for SLs, including MT. SignON was one of the largest projects that contributed to this surge with 17 partners and more than 60 consortium members, working in parallel with other international and European initiatives, such as project EASIER and others.

pdf
Can True Zero-shot Methods with Large Language Models be Adopted for Sign Language Machine Translation?
Euan McGill | Horacio Saggion
Proceedings of the First International Workshop on Knowledge-Enhanced Machine Translation

pdf abs
TRIBBLE - TRanslating IBerian languages Based on Limited E-resources
Igor Kuzmin | Piotr Przybyła | Euan Mcgill | Horacio Saggion
Proceedings of the Ninth Conference on Machine Translation

In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn_Latn, arg_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.

pdf abs
Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning
Piotr Przybyła | Euan McGill | Horacio Saggion
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

We present XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. Our solution is adaptive, it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. This reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We evaluate our approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. We also perform a qualitative analysis to understand the language patterns in the misinformation text that play a role in the attacks.

2023

pdf abs
BSL-Hansard: A parallel, multimodal corpus of English and interpreted British Sign Language data from parliamentary proceedings
Euan McGill | Horacio Saggion
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

BSL-Hansard is a novel open source and multimodal resource composed by combining Sign Language video data in BSL and English text from the official transcription of British parliamentary sessions. This paper describes the method followed to compile BSL-Hansard including time alignment of text using the MAUS (Schiel, 2015) segmentation system, gives some statistics about this dataset, and suggests experiments. These primarily include end-to-end Sign Language-to-text translation, but is also relevant for broader machine translation, and speech and language processing tasks.

pdf abs
Part-of-Speech tagging Spanish Sign Language data and its applications in Sign Language machine translation
Euan McGill | Luis Chiruzzo | Santiago Egea Gómez | Horacio Saggion
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

This paper examines the use of manually part-of-speech tagged sign language gloss data in the Text2Gloss and Gloss2Text translation tasks, as well as running an LSTM-based sequence labelling model on the same glosses for automatic part-of-speech tagging. We find that a combination of tag-enhanced glosses and pretraining the neural model positively impacts performance in the translation tasks. The results of the tagging task are limited, but provide a methodological framework for further research into tagging sign language gloss data.

SignON (https://signon-project.eu/) is a Horizon 2020 project, running from 2021 until the end of 2023, which addresses the lack of technology and services for the automatic translation between sign languages (SLs) and spoken languages, through an inclusive, human-centric solution, hence contributing to the repertoire of communication media for deaf, hard of hearing (DHH) and hearing individuals. In this paper, we present an update of the status of the project, describing the approaches developed to address the challenges and peculiarities of SL machine translation (SLMT).

2022

pdf abs
Translating Spanish into Spanish Sign Language: Combining Rules and Data-driven Approaches
Luis Chiruzzo | Euan McGill | Santiago Egea-Gómez | Horacio Saggion
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

This paper presents a series of experiments on translating between spoken Spanish and Spanish Sign Language glosses (LSE), including enriching Neural Machine Translation (NMT) systems with linguistic features, and creating synthetic data to pretrain and later on finetune a neural translation model. We found evidence that pretraining over a large corpus of LSE synthetic data aligned to Spanish sentences could markedly improve the performance of the translation models.

2021

pdf abs
Syntax-aware Transformers for Neural Machine Translation: The Case of Text to Sign Gloss Translation
Santiago Egea Gómez | Euan McGill | Horacio Saggion
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

It is well-established that the preferred mode of communication of the deaf and hard of hearing (DHH) community are Sign Languages (SLs), but they are considered low resource languages where natural language processing technologies are of concern. In this paper we study the problem of text to SL gloss Machine Translation (MT) using Transformer-based architectures. Despite the significant advances of MT for spoken languages in the recent couple of decades, MT is in its infancy when it comes to SLs. We enrich a Transformer-based architecture aggregating syntactic information extracted from a dependency parser to word-embeddings. We test our model on a well-known dataset showing that the syntax-aware model obtains performance gains in terms of MT evaluation metrics.

Co-authors

Venues

bucc1

kemt1

wmt1

wassa1