Marieke Meelen

2026

Revitalising Endangered Languages and Cultural Heritage through Language Technology: A Pilot Study for Dzardzongke
Hannah Claus | Songbo Hu | Emre Isik | Anna Korhonen | Kitty Liu | Marieke Meelen
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)

In this short paper, we present the first prototype of a mobile application to help preserve and revitalise the endangered language and cultural heritage of the speakers of Dzardzongke, a Tibetic language spoken in South Mustang, Nepal. With this pilot study, we provide a collaborative and highly accessible solution to revitalisation that has potential for any community interested in preserving their language and culture.

Despite decades of progress in human language technology (HLT) and growing research interest in endangered languages, practical uptake of HLT in documentary linguistics workflows remains rare. In this opinion piece, we report on a structured dialogue among approximately twenty academics convened to diagnose why this gap persists. Across all topics, we identify a recurring structural problem, which we call the missing middle: despite the existence of many potentially useful HLTs, the connective infrastructure necessary to make them genuinely accessible to linguists and language communities does not exist. We report the details of our discussion and make four specific recommendations for how those active in language documentation and HLT research might orient their future work.

2025

pdf bib abs

Comparing efficacy of IPA vs Pinyin romanisation transcriptions for complex tonal languages: A case study in Baima
Katia Chirkova | Rolando Coto-Solano | Rachael Griffiths | Marieke Meelen
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

How is automated tone transcription affected by the choice of transcription orthography? In this paper we present a range of experiments that indicate that, even when the tonal repre- sentations are kept the same, the way vowels and consonants are transcribed can affect tonal character outputs. Our results also indicate that using a Language Model (LM) for decoding can mitigate problems with tonal outputs, but tones remain the most difficult part of the tran- scription. In doing this we also present the first Automatic Speech Recognition (ASR) models for the Baima language, spoken in Sichuan and Gansu, China. We hope to use these models to contribute to ongoing documentation efforts.

2024

pdf bib abs

End-to-End Speech Recognition for Endangered Languages of Nepal
Marieke Meelen | Alexander O’neill | Rolando Coto-Solano
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper presents three experiments to test the most effective and efficient ASR pipeline to facilitate the documentation and preservation of endangered languages, which are often extremely low-resourced. With data from two languages in Nepal —Dzardzongke and Newar— we show that model improvements are different for different masses of data, and that transfer learning as well as a range of modifications (e.g. normalising amplitude and pitch) can be effective, but that a consistently-standardised orthography as NLP input and post-training dictionary corrections improve results even more.

2022

pdf bib abs

NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties
Christian Faggionato | Nathan Hill | Marieke Meelen
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

In this paper we present our work-in-progress on a fully-implemented pipeline to create deeply-annotated corpora of a number of historical and contemporary Tibetan and Newar varieties. Our off-the-shelf tools allow researchers to create corpora with five different layers of annotation, ranging from morphosyntactic to information-structural annotation. We build on and optimise existing tools (in line with FAIR principles), as well as develop new ones, and show how they can be adapted to other Tibetan and Newar languages, most notably modern endangered languages that are both extremely low-resourced and under-researched.

pdf bib abs

Towards Coreference Resolution for Early Irish
Mark Darling | Marieke Meelen | David Willis
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

In this article, we present an outline of some of the issues involved in developing a semi-supervised procedure for coreference resolution for early Irish as part of a wider enterprise to create a parsed corpus of historical Irish with enriched annotation for information structure and anaphoric coreference. We outline the ways in which existing resources, notably the POMIC historical Irish corpus and the Cesax annotation algorithm, have had to be adapted, the first to provide suitable input for coreference resolution, the second to cope with specific aspects of early Irish grammar. We also outline features of a part-of-speech tagger that we have developed for early Irish as part of the first task and with a view to expanding the size of the future corpus.

2020

pdf bib

Meta-dating the PArsed Corpus of Tibetan (PACTib)
Marieke Meelen | Élie Roux
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib abs

Developing the Old Tibetan Treebank
Christian Faggionato | Marieke Meelen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper presents a full procedure for the development of a segmented, POS-tagged and chunkparsed corpus of Old Tibetan. As an extremely low-resource language, Old Tibetan poses non-trivial problems in every step towards the development of a searchable treebank. We demonstrate, however, that a carefully developed, semisupervised method of optimising and extending existing tools for Classical Tibetan, as well as creating specific ones for Old Tibetan can address these issues. We thus also present the first very Tibetan Treebank in a variety of formats to facilitate research in the fields of NLP, historical linguistics and Tibetan Studies.

Venues

TLT1

Fix author

Marieke Meelen

2026

2025

2024

2022

2020

2019

Co-authors

Venues