Martijn Wieling


Quantifying Language Variation Acoustically with Few Resources
Martijn Bartelds | Martijn Wieling
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep acoustic models represent linguistic information based on massive amounts of data. Unfortunately, for regional languages and dialects such resources are mostly not available. However, deep acoustic models might have learned linguistic information that transfers to low-resource languages. In this study, we evaluate whether this is the case through the task of distinguishing low-resource (Dutch) regional varieties. By extracting embeddings from the hidden layers of various wav2vec 2.0 models (including new models which are pre-trained and/or fine-tuned on Dutch) and using dynamic time warping, we compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages. We then cluster the resulting difference matrix in four groups and compare these to a gold standard, and a partitioning on the basis of comparing phonetic transcriptions. Our results show that acoustic models outperform the (traditional) transcription-based approach without requiring phonetic transcriptions, with the best performance achieved by the multilingual XLSR-53 model fine-tuned on Dutch. On the basis of only six seconds of speech, the resulting clustering closely matches the gold standard.

Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages
Wietse de Vries | Martijn Wieling | Malvina Nissim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance. The findings described in this paper can be used as indicators of which factors are important for effective zero-shot cross-lingual transfer to zero- and low-resource languages.

Low Saxon dialect distances at the orthographic and syntactic level
Janine Siewert | Yves Scherrer | Martijn Wieling
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

We compare five Low Saxon dialects from the 19th and 21st century from Germany and the Netherlands with each other as well as with modern Standard Dutch and Standard German. Our comparison is based on character n-grams on the one hand and PoS n-grams on the other and we show that these two lead to different distances. Particularly in the PoS-based distances, one can observe all of the 21st century Low Saxon dialects shifting towards the modern majority languages.


Adapting Monolingual Models: Data can be Scarce when Language Similarity is High
Wietse de Vries | Martijn Bartelds | Malvina Nissim | Martijn Wieling
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021


LSDC - A comprehensive dataset for Low Saxon Dialect Classification
Janine Siewert | Yves Scherrer | Martijn Wieling | Jörg Tiedemann
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper. Since so far no such comprehensive dataset of contemporary Low Saxon exists, this provides a great contribution to NLP research on this language. We also test the use of this dataset for dialect classification by training a few baseline models comparing statistical and neural approaches. The performance of these models shows that in spite of an imbalance in the amount of data per dialect, enough features can be learned for a relatively high classification accuracy.


Squib: Reproducibility in Computational Linguistics: Are We Willing to Share?
Martijn Wieling | Josine Rawee | Gertjan van Noord
Computational Linguistics, Volume 44, Issue 4 - December 2018

This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.


The Power of Character N-grams in Native Language Identification
Artur Kulmizev | Bo Blankers | Johannes Bjerva | Malvina Nissim | Gertjan van Noord | Barbara Plank | Martijn Wieling
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we explore the performance of a linear SVM trained on language independent character features for the NLI Shared Task 2017. Our basic system (GRONINGEN) achieves the best performance (87.56 F1-score) on the evaluation set using only 1-9 character n-grams as features. We compare this against several ensemble and meta-classifiers in order to examine how the linear system fares when combined with other, especially non-linear classifiers. Special emphasis is placed on the topic bias that exists by virtue of the assessment essay prompt distribution.

Last Words: Sharing Is Caring: The Future of Shared Tasks
Malvina Nissim | Lasha Abzianidze | Kilian Evang | Rob van der Goot | Hessel Haagsma | Barbara Plank | Martijn Wieling
Computational Linguistics, Volume 43, Issue 4 - December 2017


Read my points: Effect of animation type when speech-reading from EMA data
Kristy James | Martijn Wieling
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

ALT Explored: Integrating an Online Dialectometric Tool and an Online Dialect Atlas
Martijn Wieling | Eva Sassolini | Sebastiana Cucurullo | Simonetta Montemagni
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we illustrate the integration of an online dialectometric tool, Gabmap, together with an online dialect atlas, the Atlante Lessicale Toscano (ALT-Web). By using a newly created url-based interface to Gabmap, ALT-Web is able to take advantage of the sophisticated dialect visualization and exploration options incorporated in Gabmap. For example, distribution maps showing the distribution in the Tuscan dialect area of a specific dialectal form (selected via the ALT-Web website) are easily obtainable. Furthermore, the complete ALT-Web dataset as well as subsets of the data (selected via the ALT-Web website) can be automatically uploaded and explored in Gabmap. By combining these two online applications, macro- and micro-analyses of dialectal data (respectively offered by Gabmap and ALT-Web) are effectively and dynamically combined.


Assessing the Readability of Sentences: Which Corpora and Features?
Felice Dell’Orletta | Martijn Wieling | Giulia Venturi | Andrea Cimino | Simonetta Montemagni
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications


Hierarchical Spectral Partitioning of Bipartite Graphs to Cluster Dialects and Identify Distinguishing Features
Martijn Wieling | John Nerbonne
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing


Multiple Sequence Alignments in Linguistics
Jelena Prokić | Martijn Wieling | John Nerbonne
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

Evaluating the Pairwise String Alignment of Pronunciations
Martijn Wieling | Jelena Prokić | John Nerbonne
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology
Martijn Wieling | John Nerbonne
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)


Inducing Sound Segment Differences Using Pair Hidden Markov Models
Martijn Wieling | Therese Leinonen | John Nerbonne
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology