Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

Theodorus Fransen, William Lamb, Delyth Prys (Editors)

Anthology ID:: 2022.cltw-1
Month:: June
Year:: 2022
Address:: Marseille, France
Venue:: CLTW
SIG:
Publisher:: European Language Resources Association
URL:: https://aclanthology.org/2022.cltw-1
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2022.cltw-1.pdf

pdf bib
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022
Theodorus Fransen | William Lamb | Delyth Prys

pdf bib abs
Multilingual Abstract Meaning Representation for Celtic Languages
Johannes Heinecke | Anastasia Shimorina

Deep Semantic Parsing into Abstract Meaning Representation (AMR) graphs has reached a high quality with neural-based seq2seq approaches. However, the training corpus for AMR is only available for English. Several approaches to process other languages exist, but only for high resource languages. We present an approach to create a multilingual text-to-AMR model for three Celtic languages, Welsh (P-Celtic) and the closely related Irish and Scottish-Gaelic (Q-Celtic). The main success of this approach are underlying multilingual transformers like mT5. We finally show that machine translated test corpora unfairly improve the AMR evaluation for about 1 or 2 points (depending on the language).

pdf bib abs
Diachronic Parsing of Pre-Standard Irish
Kevin Scannell

Irish underwent a major spelling standardization in the 1940’s and 1950’s, and as a result it can be challenging to apply language technologies designed for the modern language to older, “pre-standard” texts. Lemmatization, tagging, and parsing of these pre-standard texts play an important role in a number of applications, including the lexicographical work on Foclóir Stairiúil na Gaeilge, a historical dictionary of Irish covering the period from 1600 to the present. We have two main goals in this paper. First, we introduce a small benchmark corpus containing just over 3800 words, annotated according to the Universal Dependencies guidelines and covering a range of dialects and time periods since 1600. Second, we establish baselines for lemmatization, tagging, and dependency parsing on this corpus by experimenting with a variety of machine learning approaches.

pdf abs
Creation of an Evaluation Corpus and Baseline Evaluation Scores for Welsh Text Summarisation
Mahmoud El-Haj | Ignatius Ezeani | Jonathan Morris | Dawn Knight

As part of the effort to increase the availability of Welsh digital technology, this paper introduces the first human vs metrics Welsh summarisation evaluation results and dataset, which we provide freely for research purposes to help advance the work on Welsh summarisation. The system summaries were created using an extractive graph-based Welsh summariser. The system summaries were evaluated by both human and a range of ROUGE metric variants (e.g. ROUGE 1, 2, L and SU4). The summaries and evaluation results will serve as benchmarks for the development of summarisers and evaluation metrics in other minority language contexts.

pdf abs
CLILSTORE.EU - A Multilingual online CLIL platform
Caoimhín Ó Dónaill

CLILSTORE.EU is an open educational resource (OER) that was created by the Erasmus + funded CLIL Open Online Learning (COOL) project which ran from 2018-2021. The project consortium included teaching practitioners from the primary, secondary, tertiary and vocational sectors who each brought their influence to bear on the design and functionality of the OER and subsequently evaluated its development within the learning contexts of their respective sectors. CLILSTORE.EU serves as both an authoring and sharing platform where multimedia learning materials can be created and accessed. Its name comprises the acronym CLIL, owing to its particular suitablity as a tool to support the Content and Language Integrated Learning methodology (Marsh, D. (ed.), 2002). The main educational aims of the OER are to provide teachers with a relatively straightforward means of creating reusable, multimodal learning units that can be used within the classroom or via remote learning to underpin and scaffold the delivery of curricular content in any subject area, especially in contexts where learners are acquiring new knowledge through the medium of a second or additional language. The following account details recent development work on the OER’s functionality and usability and presents case studies showing how it can benefit Celtic languages.

pdf abs
Evaluation of Three Welsh Language POS Taggers
Gruffudd Prys | Gareth Watkins

In this paper we describe our quantitative and qualitative evaluation of three Welsh language Part of Speech (POS) taggers. Following an introductory section, we explore some of the issues which face POS taggers, discuss the state of the art in English language tagging, and describe the three Welsh language POS taggers that will be evaluated in this paper, namely WNLT2, CyTag and TagTeg. In section 3 we describe the challenges involved in evaluating POS taggers which make use of different tagsets, and introduce our mapping of the taggers’ individual tagsets to an Intermediate Tagset used to facilitate their comparative evaluation. Section 4 introduces our benchmarking corpus as an important component of our methodology. In section 5 we describe how the inconsistencies in text tokenization between the different taggers present an issue when undertaking such evaluations, and discuss the method used to overcome this complication. Section 6 illustrates how we annotated the benchmark corpus, while section 7 describes the scoring method used. Section 8 provides an in-depth analysis of the results, and a summary of the work is presented in the conclusion found in section 9. Keywords: POS Tagger, Welsh, Evaluation, Machine Learning

pdf abs
Iterated Dependencies in a Breton treebank and implications for a Categorial Dependency Grammar
Annie Foret | Denis Béchet | Valérie Bellynck

Categorial Dependency Grammars (CDG) are computational grammars for natural language processing, defining dependency structures. They can be viewed as a formal system, where types are attached to words, combining the classical categorial grammars’ elimination rules with valency pairing rules able to define discontinuous (non-projective) dependencies. Algorithms have been proposed to infer grammars in this class from treebanks, with respect to Mel’čuk principles. We consider this approach with experiments on Breton. We focus in particular on ”repeatable dependencies” (iterated) and their patterns. A dependency d is iterated in a dependency structure if some word in this structure governs several other words through dependency d. We illustrate this approach with data in the universal dependencies format and dependency patterns written in Grew (a graph rewriting tool dedicated to applications in natural Language Processing).

This paper describes ÉIST, automatic speech recogniser for Irish, developed as part of the ongoing ABAIR initiative, combining (1) acoustic models, (2) pronunciation lexicons and (3) language models into a hybrid system. A priority for now is a system that can deal with the multiple diverse native-speaker dialects. Consequently, (1) was built using predominately native-speaker speech, which included earlier recordings used for synthesis development as well as more diverse recordings obtained using the MíleGlór platform. The pronunciation variation across the dialects is a particular challenge in the development of (2) and is explored by testing both Trans-dialect and Multi-dialect letter-to-sound rules. Two approaches to language modelling (3) are used in the hybrid system, a simple n-gram model and recurrent neural network lattice rescoring, the latter garnering impressive performance improvements. The system is evaluated using a test set that is comprised of both native and non-native speakers, which allows for some inferences to be made on the performance of the system on both cohorts.

pdf abs
Development and Evaluation of Speech Recognition for the Welsh Language
Dewi Jones

This paper reports on ongoing work on developing and evaluating speech recognition models for the Welsh language using data from the Common Voice project and two popular open development kits – HuggingFace wav2vec2 and coqui STT. Activities for ensuring the growth and improvement of the Welsh Common Voice dataset are described. Two applications have been developed – a voice assistant and an online transcription service that allow users and organisations to use the new models in a practical and useful context, but which have also helped source additional test data for better evaluation of recognition accuracy and establishing the optimal selection and configurations of models. Test results suggest that in transcription good accuracy can be achieved for read speech, but further data and research is required for improving recognition results of freely spoken formal and informal speech. Meanwhile a limited domain language model provides excellent accuracy for a voice assistant. All code, data and models produced from this work are freely available.

pdf abs
Handwriting recognition for Scottish Gaelic
William Lamb | Beatrice Alex | Mark Sinclair

Like most other minority languages, Scottish Gaelic has limited tools and resources available for Natural Language Processing research and applications. These limitations restrict the potential of the language to participate in modern speech technology, while also restricting research in fields such as corpus linguistics and the Digital Humanities. At the same time, Gaelic has a long written history, is well-described linguistically, and is unusually well-supported in terms of potential NLP training data. For instance, archives such as the School of Scottish Studies hold thousands of digitised recordings of vernacular speech, many of which have been transcribed as paper-based, handwritten manuscripts. In this paper, we describe a project to digitise and recognise a corpus of handwritten narrative transcriptions, with the intention of re-purposing it to develop a Gaelic speech recognition system.

In this paper, we present the Irish language learning platform, An Sc ́eala ́ı, an intelligent Computer-Assisted Language Learning (iCALL) system which incorporates speech and language technologies in ways that promote the holistic development of the language skills - writing, listening, reading, and speaking. The technologies offer the advantage of extensive feedback in spoken and written form, enabling learners to improve their production. The system works equally as a classroom-based tool and as a standalone platform for the autonomous learner. Given the key role of education for the transmission of all the Celtic languages, it is vital that digital technologies be harnessed to maximise the effectiveness of language teaching/learning. An Scéalaí has been used by large numbers of learners and teachers and has received very positive feedback. It is built as a modular system which allows existing and newly emerging technologies to be readily integrated, even if those technologies are still in development phase. The architecture is largely language-independent, and as an open-source system, it is hoped that it can be usefully deployed in other Celtic languages.

pdf abs
Cipher – Faoi Gheasa: A Game-with-a-Purpose for Irish
Elaine Uí Dhonnchadha | Monica Ward | Liang Xu

This paper describes Cipher – Faoi Gheasa, a ‘game with a purpose’ designed to support the learning of Irish in a fun and enjoyable way. The aim of the game is to promote language ‘noticing’ and to combine the benefits of reading with the enjoyment of computer game playing, in a pedagogically beneficial way. In this paper we discuss pedagogical challenges for Irish, the development of measures for the selection and ranking of reading materials, as well as initial results of game evaluation. Overall user feedback is positive and further testing and development is envisaged.

pdf abs
Towards Coreference Resolution for Early Irish
Mark Darling | Marieke Meelen | David Willis

In this article, we present an outline of some of the issues involved in developing a semi-supervised procedure for coreference resolution for early Irish as part of a wider enterprise to create a parsed corpus of historical Irish with enriched annotation for information structure and anaphoric coreference. We outline the ways in which existing resources, notably the POMIC historical Irish corpus and the Cesax annotation algorithm, have had to be adapted, the first to provide suitable input for coreference resolution, the second to cope with specific aspects of early Irish grammar. We also outline features of a part-of-speech tagger that we have developed for early Irish as part of the first task and with a view to expanding the size of the future corpus.

pdf abs
Use of Transformer-Based Models for Word-Level Transliteration of the Book of the Dean of Lismore
Edward Gow-Smith | Mark McConville | William Gillies | Jade Scott | Roibeard Ó Maolalaigh

The Book of the Dean of Lismore (BDL) is a 16th-century Scottish Gaelic manuscript written in a non-standard orthography. In this work, we outline the problem of transliterating the text of the BDL into a standardised orthography, and perform exploratory experiments using Transformer-based models for this task. In particular, we focus on the task of word-level transliteration, and achieve a character-level BLEU score of 54.15 with our best model, a BART architecture pre-trained on the text of Scottish Gaelic Wikipedia and then fine-tuned on around 2,000 word-level parallel examples. Our initial experiments give promising results, but we highlight the shortcomings of our model, and discuss directions for future work.

pdf abs
Introducing the National Corpus of Irish Project
Mícheál Ó Meachair | Úna Bhreathnach | Gearóid Ó Cleircín

This paper introduces the National Corpus of Irish, an initiative to develop a large national corpus of written and spoken contemporary Irish as well as related specialised corpora. The newly-compiled corpora will be hosted at corpas.ie, in what will become a hub for corpus-based research on the Irish language. Users will be able to search the corpora and download data generated during the project from the corpas.ie website and appropriate third-party repositories. Corpus 1 will be a balanced general-purpose corpus containing c.155m words. Corpus 2 will be a written corpus consisting of c100m words. Corpus 3 will be a spoken corpus containing 6.5m words. Corpus 4 will be a monitor corpus with a target size of 1m words per year from 2000 onwards. Token, lemma, and n-gram frequency lists will be published at regular intervals on the project website, and language models will be published there and on other appropriate platforms during the course of the project. This paper focuses on the background and crucial scoping stage of the project, and examines user needs as identified in a survey of potential users.

pdf abs
BU-TTS: An Open-Source, Bilingual Welsh-English, Text-to-Speech Corpus
Stephen Russell | Dewi Jones | Delyth Prys

This paper presents the design, collection and verification of a bilingual text-to-speech synthesis corpus for Welsh and English. The ever expanding voice collection currently contains almost 10 hours of recordings from a bilingual, phonetically balanced text corpus. The speakers consist of a professional voice actor and three amateur contributors, with male and female accents from north and south Wales. This corpus provides audio-text pairs for building and training high-quality bilingual Welsh-English neural based TTS systems. We describe the process by which we created a phonetically balanced prompt set and the challenges of attempting to collate such a dataset during the COVID-19 pandemic. Our initial findings in validating the corpus via the implementation of a state-of-the-art TTS models are presented. This corpus represents the first open-source Welsh language corpus large enough to capitalise on neural TTS architectures.

pdf abs
Developing Automatic Speech Recognition for Scottish Gaelic
Lucy Evans | William Lamb | Mark Sinclair | Beatrice Alex

This paper discusses our efforts to develop a full automatic speech recognition (ASR) system for Scottish Gaelic, starting from a point of limited resource. Building ASR technology is important for documenting and revitalising endangered languages; it enables existing resources to be enhanced with automatic subtitles and transcriptions, improves accessibility for users, and, in turn, encourages continued use of the language. In this paper, we explain the many difficulties faced when collecting minority language data for speech recognition. A novel cross-lingual approach to the alignment of training data is used to overcome one such difficulty, and in this way we demonstrate how majority language resources can bootstrap the development of lower-resourced language technology. We use the Kaldi speech recognition toolkit to develop several Gaelic ASR systems, and report a final WER of 26.30%. This is a 9.50% improvement on our original model.

pdf abs
Handwritten Text Recognition (HTR) for Irish-Language Folklore
Brian Ó Raghallaigh | Andrea Palandri | Críostóir Mac Cárthaigh

In this paper we present our method for digitising a large collection of handwritten Irish-language texts as part of a project to mine information from a large corpus of Irish and Scottish Gaelic folktales. The handwritten texts form part of the Main Manuscript Collection of the National Folklore Collection of Ireland and contain handwritten transcriptions of oral folklore collected in Ireland in the 20th century. With the goal of creating a large text corpus of the Irish-language folktales contained within this collection, our method involves scanning the pages of the physical volumes and digitising the text on these pages using Transkribus, a platform for the recognition of historical documents. Given the nature of the collection, the approach we have taken involves the creation of individual text recognition models for multiple collectors’ hands. Doing it this way was motivated by the fact that a relatively small number of collectors contributed the bulk of the material, while the differences between each collector in terms of style, layout and orthography were difficult to reconcile within a single handwriting model. We present our preliminary results along with a discussion on the viability of using crowdsourced correction to improve our HTR models.

This paper describes the prototype development of an Alternative and Augmentative Communication (AAC) system for the Irish language. This system allows users to communicate using the ABAIR synthetic voices, by selecting a series of words or images. Similar systems are widely available in English and are often used by autistic people, as well as by people with Cerebral Palsy, Alzheimer’s and Parkinson’s disease. A dual-pronged approach to development has been adopted: this involves (i) the initial short-term prototype development that targets the immediate needs of specific users, as well as considerations for (ii) the longer term development of a bilingual AAC system which will suit a broader range of users with varying linguistic backgrounds, age ranges and needs. This paper described the design considerations and the implementation steps in the current system. Given the substantial differences in linguistic structures in Irish and English, the development of a bilingual system raises many research questions and avenues for future development.