Nick Thieberger


2025

pdf bib
English-based acoustic models perform well in the forced alignment of two English-based Pacific Creoles
Sam Passmore | Lila San Roque | Kirsty Gillespie | Saurabh Nath | Kira Davey | Keira Mullan | Tim Cawley | Jennifer Biggs | Rosey Billington | Bethwyn Evans | Nick Thieberger | Danielle Barth
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Expanding the breadth languages used to study sociophonetic variation and change is an important step in the theoretical development of sociophonetics. As data archives grow, forced alignment can accelerate the study of sociophonetic variation in minority languages. This paper examines the application of English and custom-made acoustic models on the alignment of vowels in two Pacific Creoles, Tok Pisin (59 hours) and Bislama (38.5 hours). We find that English models perform acceptably well in both languages, and as well as humans in vowel environments described as ‘Highly Reliable’. Custom models performed better in Bislama than Tok Pisin. We end the paper with recommendations on the use of cross-linguistic acoustic models in the case of English-Based Creoles.

pdf bib
Tulun: Transparent and Adaptable Low-resource Machine Translation
Raphael Merx | Hanna Suominen | Lois Yinghui Hong | Nick Thieberger | Trevor Cohn | Ekaterina Vylomova
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories.Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy.Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF++ points over NLLB-54B. Tulun is publicly accessible at https://bislama-trans.rapha.dev.

2021

pdf bib
The language documentation quartet
Simon Musgrave | Nick Thieberger
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2017

pdf bib
From Small to Big Data: paper manuscripts to RDF triples of Australian Indigenous Vocabularies
Nick Thieberger | Conal Tuohy
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Developing collection management tools to create more robust and reliable linguistic data
Gary Holton | Kavon Hooshiar | Nick Thieberger
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf bib
Phonotactic Modeling of Extremely Low Resource Languages
Andrei Shcherbakov | Ekaterina Vylomova | Nick Thieberger
Proceedings of the Australasian Language Technology Association Workshop 2016

2007

pdf bib
Does Language Technology Offer Anything to Small Languages?
Nick Thieberger
Proceedings of the Australasian Language Technology Workshop 2007