Francis Tyers

Also published as: Francis M. Tyers

2023

pdf abs
Codex to corpus: Exploring annotation and processing for an open and extensible machine-readable edition of the Florentine Codex
Francis Tyers | Robert Pugh | Valery Berthoud F.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes an ongoing effort to create, from the original hand-written text, a machine-readable, linguistically-annotated, and easily-searchable corpus of the Nahuatl portion of the Florentine Codex, a 16th century Mesoamerican manuscript written in Nahuatl and Spanish. The Codex consists of 12 books and over 300,000 tokens. We describe the process of annotating 3 of these books, the steps of text preprocessing undertaken, our approach to efficient manual processing and annotation, and some of the challenges faced along the way. We also report on a set of experiments evaluating our ability to automate the text processing tasks to aid in the remaining annotation effort, and find the results promising despite the relatively low volume of training data. Finally, we briefly present a real use case from the humanities that would benefit from the searchable, linguistically annotated corpus we describe.

pdf abs
Developing finite-state language technology for Maya
Robert Pugh | Francis Tyers | Quetzil Castañeda
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

We describe a suite of finite-state language technologies for Maya, a Mayan language spoken in Mexico. At the core is a computational model of Maya morphology and phonology using a finite-state transducer. This model results in a morphological analyzer and a morphologically-informed spell-checker. All of these technologies are designed for use as both a pedagogical reading/writing aid for L2 learners and as a general language processing tool capable of supporting much of the natural variation in written Maya. We discuss the relevant features of Maya morphosyntax and orthography, and then outline the implementation details of the analyzer. To conclude, we present a longer-term vision for these tools and their use by both native speakers and learners.

pdf abs
A finite-state morphological analyser for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

This paper describes the development of a free/open-source finite-state morphologicaltransducer for Highland Puebla Nahuatl, a Uto-Aztecan language spoken in and around the stateof Puebla in Mexico. The finite-state toolkit used for the work is the Helsinki Finite-StateToolkit (HFST); we use the lexc formalism for modelling the morphotactics and twol formal-ism for modelling morphophonological alternations. An evaluation is presented which showsthat the transducer has a reasonable coveragearound 90%on freely-available corpora of the language, and high precisionover 95%on a manually verified test set

pdf abs
Comparing methods of orthographic conversion for Bàsàá, a language of Cameroon
Alexandra O’neil | Daniel Swanson | Robert Pugh | Francis Tyers | Emmanuel Ngue Um
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Orthographical standardization is a milestone in a language’s documentation and the development of its resources. However, texts written in former orthographies remain relevant to the language’s history and development and therefore must be converted to the standardized orthography. Ensuring a language has access to the orthographically standardized version of all of its recorded texts is important in the development of resources as it provides additional textual resources for training, supports contribution of authors using former writing systems, and provides information about the development of the language. This paper evaluates the performance of natural language processing methods, specifically Finite State Transducers and Long Short-term Memory networks, for the orthographical conversion of Bàsàá texts from the Protestant missionary orthography to the now-standard AGLC orthography, with the conclusion that LSTMs are somewhat more effective in the absence of explicit lexical information.

pdf
Towards a finite-state morphological analyser for San Mateo Huave
Francis M. Tyers | Samuel Herrera Castro
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)
Loïc Grobol | Francis Tyers
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

pdf bib abs
WITH Context: Adding Rule-Grouping to VISL CG-3
Daniel Swanson | Tino Didriksen | Francis M. Tyers
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications

This paper presents an extension to the VISL CG-3 compiler and processor which enables complex contexts to be shared between rules. This sharing substantially improves the readability and maintainability of sets of rules performing multi-step operations.

2022

pdf abs
Handling Stress in Finite-State Morphological Analyzers for Ancient Greek and Ancient Hebrew
Daniel Swanson | Francis Tyers
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

Modeling stress placement has historically been a challenge for computational morphological analysis, especially in finite-state systems because lexically conditioned stress cannot be modeled using only rewrite rules on the phonological form of a word. However, these phenomena can be modeled fairly easily if the lexicon’s internal representation is allowed to contain more information than the pure phonological form. In this paper we describe the stress systems of Ancient Greek and Ancient Hebrew and we present two prototype finite-state morphological analyzers, one for each language, which successfully implement these stress systems by inserting a small number of control characters into the phonological form, thus conclusively refuting the claim that finite-state systems are not powerful enough to model such stress systems and arguing in favor of the continued relevance of finite-state systems as an appropriate tool for modeling the morphology of historical languages.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

pdf abs
A Universal Dependencies Treebank of Ancient Hebrew
Daniel Swanson | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the initial construction of a Universal Dependencies treebank with morphological annotations of Ancient Hebrew containing portions of the Hebrew Scriptures (1579 sentences, 27K tokens) for use in comparative study with ancient translations and for analysis of the development of Hebrew syntax. We construct this treebank by applying a rule-based parser (300 rules) to an existing morphologically-annotated corpus with minimal constituency structure and manually verifying the output and present the results of this semi-automated annotation process and some of the annotation decisions made in the process of applying the UD guidelines to a new language.

pdf abs
Universal Dependencies for Western Sierra Puebla Nahuatl
Robert Pugh | Marivel Huerta Mendez | Mitsuya Sasaki | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a morpho-syntactically-annotated corpus of Western Sierra Puebla Nahuatl that conforms to the annotation guidelines of the Universal Dependencies project. We describe the sources of the texts that make up the corpus, the annotation process, and important annotation decisions made throughout the development of the corpus. As the first indigenous language of Mexico to be added to the Universal Dependencies project, this corpus offers a good opportunity to test and more clearly define annotation guidelines for the Meso-american linguistic area, spontaneous and elicited spoken data, and code-switching.

pdf abs
A Free/Open-Source Morphological Analyser and Generator for Sakha
Sardana Ivanova | Jonathan Washington | Francis Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present, to our knowledge, the first ever published morphological analyser and generator for Sakha, a marginalised language of Siberia. The transducer, developed using HFST, has coverage of solidly above 90%, and high precision. In the development of the analyser, we have expanded linguistic knowledge about Sakha, and developed strategies for complex grammatical patterns. The transducer is already being used in downstream tasks, including computer assisted language learning applications for linguistic maintenance and computational linguistic shared tasks.

pdf abs
How to encode arbitrarily complex morphology in word embeddings, no corpus needed
Lane Schwartz | Coleman Haley | Francis Tyers
Proceedings of the first workshop on NLP applications to field linguistics

In this paper, we present a straightforward technique for constructing interpretable word embeddings from morphologically analyzed examples (such as interlinear glosses) for all of the world’s languages. Currently, fewer than 300-400 languages out of approximately 7000 have have more than a trivial amount of digitized texts; of those, between 100-200 languages (most in the Indo-European language family) have enough text data for BERT embeddings of reasonable quality to be trained. The word embeddings in this paper are explicitly designed to be both linguistically interpretable and fully capable of handling the broad variety found in the world’s diverse set of 7000 languages, regardless of corpus size or morphological characteristics. We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication.

pdf abs
Predictive Text for Agglutinative and Polysynthetic Languages
Sergey Kosyak | Francis Tyers
Proceedings of the first workshop on NLP applications to field linguistics

This paper presents a set of experiments in the area of morphological modelling and prediction. We test whether morphological segmentation can compete against statistical segmentation in the tasks of language modelling and predictive text entry for two under-resourced and indigenous languages, K’iche’ and Chukchi. We use different segmentation methods — both statistical and morphological — to make datasets that are used to train models of different types: single-way segmented, which are trained using data from one segmenter; two-way segmented, which are trained using concatenated data from two segmenters; and finetuned, which are trained on two datasets from different segmenters. We compute word and character level perplexities and find that single-way segmented models trained on morphologically segmented data show the highest performance. Finally, we evaluate the language models on the task of predictive text entry using gold standard data and measure the average number of clicks per character and keystroke savings rate. We find that the models trained on morphologically segmented data show better scores, although with substantial room for improvement. At last, we propose the usage of morphological segmentation in order to improve the end-user experience while using predictive text and we plan on testing this assumption by doing end-user evaluation.

In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analysis.

2021

pdf bib
Keyword spotting for audiovisual archival search in Uralic languages
Nils Hjortnaes | Niko Partanen | Francis M. Tyers
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.

pdf abs
Do RNN States Encode Abstract Phonological Alternations?
Miikka Silfverberg | Francis Tyers | Garrett Nicolai | Mans Hulden
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Sequence-to-sequence models have delivered impressive results in word formation tasks such as morphological inflection, often learning to model subtle morphophonological details with limited training data. Despite the performance, the opacity of neural models makes it difficult to determine whether complex generalizations are learned, or whether a kind of separate rote memorization of each morphophonological process takes place. To investigate whether complex alternations are simply memorized or whether there is some level of generalization across related sound changes in a sequence-to-sequence model, we perform several experiments on Finnish consonant gradation—a complex set of sound changes triggered in some words by certain suffixes. We find that our models often—though not always—encode 17 different consonant gradation processes in a handful of dimensions in the RNN. We also show that by scaling the activations in these dimensions we can control whether consonant gradation occurs and the direction of the gradation.

pdf bib abs
A corpus of K’iche’ annotated for morphosyntactic structure
Francis Tyers | Robert Henderson
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes a collection of sentences in K’iche’ annotated for morphology and syntax. K’iche’ is a language in the Mayan language family, spoken in Guatemala. The annotation is done according to the guidelines of the Universal Dependencies project. The corpus consists of a total of 1,433 sentences containing approximately 10,000 tokens and is released under a free/open-source licence. We present a comparison of parsing systems for K’iche’ using this corpus and describe how it can be used for mining linguistic examples.

pdf abs
Investigating variation in written forms of Nahuatl using character-based language models
Robert Pugh | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.

pdf abs
A survey of part-of-speech tagging approaches applied to K’iche’
Francis Tyers | Nick Howell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We study the performance of several popular neural part-of-speech taggers from the Universal Dependencies ecosystem on Mayan languages using a small corpus of 1435 annotated K’iche’ sentences consisting of approximately 10,000 tokens, with encouraging results: F₁ scores 93%+ on lemmatisation, part-of-speech and morphological feature assignment. The high performance motivates a cross-language part-of-speech tagging study, where K’iche’-trained models are evaluated on two other Mayan languages, Kaqchikel and Uspanteko: performance on Kaqchikel is good, 63-85%, and on Uspanteko modest, 60-71%. Supporting experiments lead us to conclude the relative diversity of morphological features as a plausible explanation for the limiting factors in cross-language tagging performance, providing some direction for future sentence annotation and collection work to support these and other Mayan languages.

pdf abs
A finite-state morphological analyser for Paraguayan Guaraní
Anastasia Kuznetsova | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes the development of morphological analyser for Paraguayan Guaraní, agglutinative indigenous language spoken by nearly 6 million people in South America. The implementation of our analyser uses HFST (Helsiki Finite State Technology) and two-level transducer that covers morphotactics and phonological processes occurring in Guaraní. We assess the efficacy of the approach on publicly available Wikipedia and Bible corpora and the naive coverage of analyser reaches 86% on Wikipedia and 91% on Bible corpora.

pdf abs
Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik
Hyunji Hayley Park | Lane Schwartz | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.

pdf
The Relevance of the Source Language in Transfer Learning for ASR
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf
Towards an Open Source Finite-State Morphological Analyzer for Zacatlán-Ahuacatlán-Tepetzintla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

2020

pdf abs
A Finite-State Morphological Analyser for Evenki
Anna Zueva | Anastasia Kuznetsova | Francis Tyers
Proceedings of the Twelfth Language Resources and Evaluation Conference

It has been widely admitted that morphological analysis is an important step in automated text processing for morphologically rich languages. Evenki is a language with rich morphology, therefore a morphological analyser is highly desirable for processing Evenki texts and developing applications for Evenki. Although two morphological analysers for Evenki have already been developed, they are able to analyse less than a half of the available Evenki corpora. The aim of this paper is to create a new morphological analyser for Evenki. It is implemented using the Helsinki Finite-State Transducer toolkit (HFST). The lexc formalism is used to specify the morphotactic rules, which define the valid orderings of morphemes in a word. Morphophonological alternations and orthographic rules are described using the twol formalism. The lexicon is extracted from available machine-readable dictionaries. Since a part of the corpora belongs to texts in Evenki dialects, a version of the analyser with relaxed rules is developed for processing dialectal features. We evaluate the analyser on available Evenki corpora and estimate precision, recall and F-score. We obtain coverage scores of between 61% and 87% on the available Evenki corpora.

pdf abs
An Unsupervised Method for Weighting Finite-state Morphological Analyzers
Amr Keleg | Francis Tyers | Nick Howell | Tommi Pirinen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.

pdf bib
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
Tommi A Pirinen | Francis M. Tyers | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf abs
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

pdf abs
Dependency annotation of noun incorporation in polysynthetic languages
Francis Tyers | Karina Mishchenkova
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

This paper describes an approach to annotating noun incorporation in Universal Dependencies. It motivates the need to annotate this particular morphosyntactic phenomenon and justifies it with respect to frequency of the construction. A case study is presented in which the proposed annotation scheme is applied to Chukchi, a language that exhibits noun incorporation. We compare argument encoding in Chukchi, English and Russian and find that while in English and Russian discourse elements are primarily tracked through noun phrases and pronouns, in Chukchi they are tracked through agreement marking and incorporation, with a lesser role for noun phrases.

pdf abs
Universal Dependency Treebank for Xibe
He Zhou | Juyeon Chung | Sandra Kübler | Francis Tyers
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

We present our work of constructing the first treebank for the Xibe language following the Universal Dependencies (UD) annotation scheme. Xibe is a low-resourced and severely endangered Tungusic language spoken by the Xibe minority living in the Xinjiang Uygur Autonomous Region of China. We collected 810 sentences so far, including 544 sentences from a grammar book on written Xibe and 266 sentences from Cabcal News. We annotated those sentences manually from scratch. In this paper, we report the procedure of building this treebank and analyze several important annotation issues of our treebank. Finally, we propose our plans for future work.

2019

pdf abs
Building a Morphological Analyser for Laz
Esra Onal | Francis Tyers
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This study is an attempt to contribute to documentation and revitalization efforts of endangered Laz language, a member of South Caucasian language family mainly spoken on northeastern coastline of Turkey. It constitutes the first steps to create a general computational model for word form recognition and production for Laz by building a rule-based morphological analyser using Helsinki Finite-State Toolkit (HFST). The evaluation results show that the analyser has a 64.9% coverage over a corpus collected for this study with 111,365 tokens. We have also performed an error analysis on randomly selected 100 tokens from the corpus which are not covered by the analyser, and these results show that the errors mostly result from Turkish words in the corpus and missing stems in our lexicon.

pdf bib
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
Tommi A. Pirinen | Heiki-Jaan Kaalep | Francis M. Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Data-Driven Morphological Analysis for Uralic Languages
Miikka Silfverberg | Francis Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf abs
A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus
Jungyeul Park | Francis Tyers
Proceedings of the 13th Linguistic Annotation Workshop

In this paper we present a new annotation scheme for the Sejong part-of-speech tagged corpus based on Universal Dependencies style annotation. By using a new annotation scheme, we can produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing. We also explore the possibility of doing named-entity recognition and semantic-role labelling for Korean using the new annotation scheme.

pdf
A biscriptual morphological transducer for Crimean Tatar
Francis M. Tyers | Jonathan Washington | Darya Kavitskaya | Memduh Gökırmak | Nick Howell | Remziye Berberova
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf
A free/open-source rule-based machine translation system for Crimean Tatar to Turkish
Memduh Gökırmak | Francis Tyers | Jonathan Washington
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Proceedings of the Celtic Language Technology Workshop
Teresa Lynn | Delyth Prys | Colin Batchelor | Francis Tyers
Proceedings of the Celtic Language Technology Workshop

pdf
Development of a Universal Dependencies treebank for Welsh
Johannes Heinecke | Francis M. Tyers
Proceedings of the Celtic Language Technology Workshop

pdf bib
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
Alexandre Rademaker | Francis Tyers
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Michael Rießler | Jack Rueter | Trond Trosterud | Francis M. Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf
Towards an open-source universal-dependency treebank for Erzya
Jack Rueter | Francis Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf abs
A prototype finite-state morphological analyser for Chukchi
Vasilisa Andriyanets | Francis Tyers
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

In this article we describe the application of finite-state transducers to the morphological and phonological systems of Chukchi, a polysynthetic language spoken in the north of the Russian Federation. The language exhibits progressive and regressive vowel harmony, productive incorporation and extensive circumfixing. To implement the analyser we use the well-known Helsinki Finite-State Toolkit (HFST). The resulting model covers the majority of the morphological and phonological processes. A brief evaluation carried out on publically-available corpora shows that the coverage of the transducer is between and 53% and 76%. An error evaluation of 100 tokens randomly selected from the corpus, which were not covered by the analyser shows that most of the morphological processes are covered and that the majority of errors are caused by a limited stem lexicon.

pdf abs
Can LSTM Learn to Capture Agreement? The Case of Basque
Shauli Ravfogel | Yoav Goldberg | Francis Tyers
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire? We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks – verb number prediction and suffix recovery – we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.

pdf abs
Multi-source synthetic treebank creation for improved cross-lingual dependency parsing
Francis Tyers | Mariya Sheyanova | Aleksandra Martynova | Pavel Stepachev | Konstantin Vinogorodskiy
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.

pdf
Finite-state morphological analysis for Gagauz
Francis Tyers | Sevilay Bayatli | Güllü Karanfil | Memduh Gökırmak | Francis M. Tyers
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Rule-based machine translation from Kazakh to Turkish
Sevilay Bayatli | Sefer Kurnaz | Ilnar Salimzyanov | Jonathan Washington | Francis M. Tyers
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

This paper presents a shallow-transfer machine translation (MT) system for translating from Kazakh to Turkish. Background on the differences between the languages is presented, followed by how the system was designed to handle some of these differences. The system is based on the Apertium free/open-source machine translation platform. The structure of the system and how it works is described, along with an evaluation against two competing systems. Linguistic components were developed, including a Kazakh-Turkish bilingual dictionary, Constraint Grammar disambiguation rules, lexical selection rules, and structural transfer rules. With many known issues yet to be addressed, our RBMT system has reached performance comparable to publicly-available corpus-based MT systems between the languages.

2017

pdf bib
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
Francis M. Tyers | Michael Rießler | Tommi A. Pirinen | Trond Trosterud
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf
Annotation schemes in North Sámi dependency parsing
Francis M. Tyers | Mariya Sheyanova
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf
Finite-State Morphological Analysis for Marathi
Vinit Ravishankar | Francis M. Tyers
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

pdf
A Dependency Treebank for Kurmanji Kurdish
Memduh Gökırmak | Francis M. Tyers
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf
UD Annotatrix: An annotation tool for Universal Dependencies
Francis M. Tyers | Mariya Sheyanova | Jonathan North Washington
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf
Towards a dependency-annotated treebank for Bambara
Ekaterina Aplonova | Francis M. Tyers
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib abs
Universal Dependencies
Joakim Nivre | Daniel Zeman | Filip Ginter | Francis Tyers
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Universal Dependencies (UD) is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages. This tutorial gives an introduction to the UD framework and resources, from basic design principles to annotation guidelines and existing treebanks. We also discuss tools for developing and exploiting UD treebanks and survey applications of UD in NLP and linguistics.

2016

pdf abs
A Finite-state Morphological Analyser for Tuvan
Francis Tyers | Aziyana Bayyr-ool | Aelita Salchak | Jonathan Washington
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

~This paper describes the development of free/open-source finite-state morphological transducers for Tuvan, a Turkic language spoken in and around the Tuvan Republic in Russia. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST), we use the lexc formalism for modelling the morphotactics and twol formalism for modelling morphophonological alternations. We present a novel description of the morphological combinatorics of pseudo-derivational morphemes in Tuvan. An evaluation is presented which shows that the transducer has a reasonable coverage―around 93%―on freely-available corpora of the languages, and high precision―over 99%―on a manually verified test set.

pdf abs
A Finite-State Morphological Analyser for Sindhi
Raveesh Motlani | Francis Tyers | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81% and a precision of over 97%.

pdf
Apertium: a free/open source platform for machine translation and basic language technology
Mikel L. Forcada | Francis M. Tyers
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.

2015

pdf
Automatic word stress annotation of Russian unrestricted text
Robert Reynolds | Francis Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf
Automatic conversion of colloquial Finnishto standard Finnish
Inari Listenmaa | Francis M. Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf
Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Mikel L. Forcada | Francis M. Tyers | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Francis M. Tyers | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf
Unsupervised training of maximum-entropy models for lexical selection i in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martinez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf abs
Finite-state morphological transducers for three Kypchak languages
Jonathan Washington | Ilnar Salimzyanov | Francis Tyers
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the development of free/open-source finite-state morphological transducers for three Turkic languages―Kazakh, Tatar, and Kumyk―representing one language from each of the three sub-branches of the Kypchak branch of Turkic. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST). This paper describes how the development of a transducer for each subsequent closely-related language took less development time. An evaluation is presented which shows that the transducers all have a reasonable coverage―around 90%―on freely available corpora of the languages, and high precision over a manually verified test set.

pdf
Subsegmental language detection in Celtic language text
Akshay Minocha | Francis Tyers
Proceedings of the First Celtic Language Technology Workshop

pdf
Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser
Dávid Márk Nemeskey | Francis Tyers | Mans Hulden
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
A Free/Open-source Kazakh-Tatar Machine Translation System
Ilnar Salimzyanov | Jonathan Washington | Francis Tyers
Proceedings of Machine Translation Summit XIV: Papers

2012

pdf
Rule-based Machine Translation between Indonesian and Malaysian
Raymond Hendy Susanto | Septina Dian Larasati | Francis M. Tyers
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf abs
Free/Open Source Shallow-Transfer Based Machine Translation for Spanish and Aragonese
Juan Pablo Martínez Cortés | Jim O’Regan | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the development of a bidirectional shallow-transfer based machine translation system for Spanish and Aragonese, based on the Apertium platform, reusing the resources provided by other translators built for the platform. The system, and the morphological analyser built for it, are both the first resources of their kind for Aragonese. The morphological analyser has coverage of over 80\%, and is being reused to create a spelling checker for Aragonese. The translator is bidirectional: the Word Error Rate for Spanish to Aragonese is 16.83%, while Aragonese to Spanish is 11.61%.

pdf abs
A finite-state morphological transducer for Kyrgyz
Jonathan Washington | Mirlan Ipasov | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the development of a free/open-source finite-state morphological transducer for Kyrgyz. The transducer has been developed for morphological generation for use within a prototype TurkishâKyrgyz machine translation system, but has also been extensively tested for analysis. The finite-state toolkit used for the work was the Helsinki Finite-State Toolkit (HFST). The paper describes some issues in Kyrgyz morphology, the development of the tool, some linguistic issues encountered and how they were dealt with, and which issues are left to resolve. An evaluation is presented which shows that the transducer has medium-level coverage, between 82% and 87% on two freely available corpora of Kyrgyz, and high precision and recall over a manually verified test set.

pdf abs
A rule-based machine translation system from Serbo-Croatian to Macedonian
Hrvoje Peradin | Francis Tyers
Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes the development of a one-way machine translation system from SerboCroatian to Macedonian on the Apertium platform. Details of resources and development methods are given, as well as an evaluation, and general directives for future work.

pdf
Flexible finite-state lexical selection for rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2011

pdf
Rapid rule-based machine translation between Dutch and Afrikaans
Pim Otte | Francis M. Tyers
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf
Apertium-IceNLP: A rule-based Icelandic to English machine translation system
Martha Dís Brandt | Hrafh Loftsson | Hlynur Sigurþórsson | Francis M. Tyers
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf abs
An Italian to Catalan RBMT system reusing data from existing language pairs
Antonio Toral | Mireia Ginestí-Rosell | Francis Tyers
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM.

2010

pdf
Rule-based Breton to French machine translation
Francis Tyers
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

2009

pdf
Developing Prototypes for Machine Translation between Two Sami Languages
Francis M. Tyers | Linda Wiechetek | Trond Trosterud
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation
Francis M. Tyers
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martinez | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

pdf abs
The Apertium machine translation platform: Five years on
Mikel L. Forcada | Francis M. Tyers | Gema Ramírez-Sánchez
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes Apertium: a free/open-source machine translation platform (engine, toolbox and data), its history, its philosophy of design, its technology, the community of developers, the research and business based on it, and its prospects and challenges, now that it is five years old.

pdf abs
Matxin: Moving towards language independence
Aingeru Mayor | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes some of the issues found when adapting and extending the Matxin free-software machine translation system to other language pairs. It sketches out some of the characteristics of Matxin and offers some possible solutions to these issues.

pdf abs
Shallow-transfer rule-based machine translation for Swedish to Danish
Francis M. Tyers | Jacob Nordfalk
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of a shallow-transfer machine translation system from Swedish to Danish in the Apertium platform. It gives details of the resources used, the methods for constructing the system and an evaluation of the translation quality. The quality is found to be comparable with that of current commercial systems, despite the particularly low coverage of the lexicons.

pdf abs
Development of a morphological analyser for Bengali
Abu Zaher Md Faridee | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of an open-source morphological analyser for Bengali Language using 􏰁nitestate technology. First we discuss the challenges of creating a morphological analyser for a highly in􏰂ectional language like Bengali and then propose a solution to that using lttoolbox, an open-source 􏰁nite-state toolkit. We then evaluate the performance of our developed system and propose ways of improving it further.