2024
pdf
abs
Developing Infrastructure for Low-Resource Language Corpus Building
Hedwig G. Sekeres
|
Wilbert Heeringa
|
Wietse de Vries
|
Oscar Yde Zwagers
|
Martijn Wieling
|
Goffe Th. Jensma
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
For many of the world’s small languages, few resources are available. In this project, a written online accessible corpus was created for the minority language variant Gronings, which serves both researchers interested in language change and variation and a general audience of (new) speakers interested in finding real-life examples of language use. The corpus was created using a combination of volunteer work and automation, which together formed an efficient pipeline for converting printed text to Key Words in Context (KWICs), annotated with lemmas and part-of-speech tags. In the creation of the corpus, we have taken into account several of the challenges that can occur when creating resources for minority languages, such as a lack of standardisation and limited (financial) resources. As the solutions we offer are applicable to other small languages as well, each step of the corpus creation process is discussed and resources will be made available benefiting future projects on other low-resource languages.
2022
pdf
abs
PoS Tagging, Lemmatization and Dependency Parsing of West Frisian
Wilbert Heeringa
|
Gosse Bouma
|
Martha Hofman
|
Jelle Brouwer
|
Eduard Drenth
|
Jan Wijffels
|
Hans Van de Velde
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present a lemmatizer/PoS tagger/dependency parser for West Frisian using a corpus of 44,714 words in 3,126 sentences that were annotated according to the guidelines of Universal Dependencies version 2. PoS tags were assigned to words by using a Dutch PoS tagger that was applied to a Dutch word-by-word translation, or to sentences of a Dutch parallel text. Best results were obtained when using word-by-word translations that were created by using the previous version of the Frisian translation program Oersetter. Morphologic and syntactic annotations were generated on the basis of a Dutch word-by-word translation as well. The performance of the lemmatizer/tagger/annotator when it was trained using default parameters was compared to the performance that was obtained when using the parameter values that were used for training the LassySmall UD 2.5 corpus. We study the effects of different hyperparameter settings on the accuracy of the annotation pipeline. The Frisian lemmatizer/PoS tagger/dependency parser is released as a web app and as a web service.
2018
pdf
The Boarnsterhim Corpus: A Bilingual Frisian-Dutch Panel and Trend Study
Marjoleine Sloos
|
Eduard Drenth
|
Wilbert Heeringa
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2007
pdf
The Relative Divergence of Dutch Dialect Pronunciations from their Common Source: An Exploratory Study
Wilbert Heeringa
|
Brian Joseph
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology
2006
pdf
Evaluation of String Distance Algorithms for Dialectology
Wilbert Heeringa
|
Peter Kleiweg
|
Charlotte Gooskens
|
John Nerbonne
Proceedings of the Workshop on Linguistic Distances
1999
pdf
Comparison and Classification of Dialects
John Nerbonne
|
Wilbert Heeringa
|
Peter Kleiweg
Ninth Conference of the European Chapter of the Association for Computational Linguistics
1997
pdf
bib
Measuring Dialect Distance Phonetically
John Nerbonne
|
Wilbert Heeringa
Computational Phonology: Third Meeting of the ACL Special Interest Group in Computational Phonology