Mícheál J. Ó Meachair
2026
Creating a Hybrid Rule and Neural Network Based Semantic Tagger Using Silver Standard Data: The PyMUSAS Framework for Multilingual Semantic Annotation
Andrew Moore | Paul Rayson | Dawn Archer | Tim Czerniak | Dawn Knight | Daisy Monika Lal | Gearóid Ó Donnchadha | Mícheál J. Ó Meachair | Scott Piao | Elaine Uí Dhonnchadha | Johanna Vuorinen | Yan Yabo | Xiaobin Yang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Andrew Moore | Paul Rayson | Dawn Archer | Tim Czerniak | Dawn Knight | Daisy Monika Lal | Gearóid Ó Donnchadha | Mícheál J. Ó Meachair | Scott Piao | Elaine Uí Dhonnchadha | Johanna Vuorinen | Yan Yabo | Xiaobin Yang
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code will be released as open resources.
2025
Gaeilge Bhriste ó Shamhlacha Cliste: How Clever Are LLMs When Translating Irish Text?
Teresa Clifford | Abigail Walsh | Brian Davis | Mícheál J. Ó Meachair
Proceedings of the 5th Celtic Language Technology Workshop
Teresa Clifford | Abigail Walsh | Brian Davis | Mícheál J. Ó Meachair
Proceedings of the 5th Celtic Language Technology Workshop
Large Language Models have been widely adopted in NLP tasks and applications, how- ever, their ability to accurately process Irish and other minority languages has not been fully explored. In this paper we describe prelim- inary experiments examining the capacity of publicly-available machine translation engines (Google Translate, Microsoft Bing, and eTrans- lation) and prompt-based AI systems systems (ChatGPT 3.5, Llama 2) for translating and handling challenging language features of Irish. A hand-crafted selection of challenging Irish language features were incorporated into trans- lation prompts, and the output from each model was examined by a human evaluator. The re- sults of these experiments indicate that these LLM-based models still struggle with translat- ing rare linguistic phenomena and ambiguous constructions. This preliminary analysis helps to inform further research in this field, pro- viding a simple ranking of publicly-available models, and indicating which language features require particular attention when evaluating model capacity.
2022
gaBERT — an Irish Language Model
James Barry | Joachim Wagner | Lauren Cassidy | Alan Cowap | Teresa Lynn | Abigail Walsh | Mícheál J. Ó Meachair | Jennifer Foster
Proceedings of the Thirteenth Language Resources and Evaluation Conference
James Barry | Joachim Wagner | Lauren Cassidy | Alan Cowap | Teresa Lynn | Abigail Walsh | Mícheál J. Ó Meachair | Jennifer Foster
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.