William Lamb


2022

pdf bib
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022
Theodorus Fransen | William Lamb | Delyth Prys
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

pdf
Handwriting recognition for Scottish Gaelic
William Lamb | Beatrice Alex | Mark Sinclair
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

Like most other minority languages, Scottish Gaelic has limited tools and resources available for Natural Language Processing research and applications. These limitations restrict the potential of the language to participate in modern speech technology, while also restricting research in fields such as corpus linguistics and the Digital Humanities. At the same time, Gaelic has a long written history, is well-described linguistically, and is unusually well-supported in terms of potential NLP training data. For instance, archives such as the School of Scottish Studies hold thousands of digitised recordings of vernacular speech, many of which have been transcribed as paper-based, handwritten manuscripts. In this paper, we describe a project to digitise and recognise a corpus of handwritten narrative transcriptions, with the intention of re-purposing it to develop a Gaelic speech recognition system.

pdf
Developing Automatic Speech Recognition for Scottish Gaelic
Lucy Evans | William Lamb | Mark Sinclair | Beatrice Alex
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

This paper discusses our efforts to develop a full automatic speech recognition (ASR) system for Scottish Gaelic, starting from a point of limited resource. Building ASR technology is important for documenting and revitalising endangered languages; it enables existing resources to be enhanced with automatic subtitles and transcriptions, improves accessibility for users, and, in turn, encourages continued use of the language. In this paper, we explain the many difficulties faced when collecting minority language data for speech recognition. A novel cross-lingual approach to the alignment of training data is used to overcome one such difficulty, and in this way we demonstrate how majority language resources can bootstrap the development of lower-resourced language technology. We use the Kaldi speech recognition toolkit to develop several Gaelic ASR systems, and report a final WER of 26.30%. This is a 9.50% improvement on our original model.

2014

pdf bib
Developing an Automatic Part-of-Speech Tagger for Scottish Gaelic
William Lamb | Samuel Danso
Proceedings of the First Celtic Language Technology Workshop