Francis Tyers

Also published as: Francis M. Tyers

2021

pdf bib
Keyword spotting for audiovisual archival search in Uralic languages
Nils Hjortnaes | Niko Partanen | Francis M. Tyers
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
A corpus of K’iche’ annotated for morphosyntactic structure
Francis Tyers | Robert Henderson
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes a collection of sentences in K’iche’ annotated for morphology and syntax. K’iche’ is a language in the Mayan language family, spoken in Guatemala. The annotation is done according to the guidelines of the Universal Dependencies project. The corpus consists of a total of 1,433 sentences containing approximately 10,000 tokens and is released under a free/open-source licence. We present a comparison of parsing systems for K’iche’ using this corpus and describe how it can be used for mining linguistic examples.

pdf bib abs
Investigating variation in written forms of Nahuatl using character-based language models
Robert Pugh | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.

pdf bib abs
A survey of part-of-speech tagging approaches applied to K’iche’
Francis Tyers | Nick Howell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We study the performance of several popular neural part-of-speech taggers from the Universal Dependencies ecosystem on Mayan languages using a small corpus of 1435 annotated K’iche’ sentences consisting of approximately 10,000 tokens, with encouraging results: F₁ scores 93%+ on lemmatisation, part-of-speech and morphological feature assignment. The high performance motivates a cross-language part-of-speech tagging study, where K’iche’-trained models are evaluated on two other Mayan languages, Kaqchikel and Uspanteko: performance on Kaqchikel is good, 63-85%, and on Uspanteko modest, 60-71%. Supporting experiments lead us to conclude the relative diversity of morphological features as a plausible explanation for the limiting factors in cross-language tagging performance, providing some direction for future sentence annotation and collection work to support these and other Mayan languages.

pdf bib abs
A finite-state morphological analyser for Paraguayan Guaraní
Anastasia Kuznetsova | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This article describes the development of morphological analyser for Paraguayan Guaraní, agglutinative indigenous language spoken by nearly 6 million people in South America. The implementation of our analyser uses HFST (Helsiki Finite State Technology) and two-level transducer that covers morphotactics and phonological processes occurring in Guaraní. We assess the efficacy of the approach on publicly available Wikipedia and Bible corpora and the naive coverage of analyser reaches 86% on Wikipedia and 91% on Bible corpora.

pdf bib abs
Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik
Hyunji Park | Lane Schwartz | Francis Tyers
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.

This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.

pdf bib
The Relevance of the Source Language in Transfer Learning for ASR
Nils Hjortnaes | Niko Partanen | Michael Rießler | Francis M. Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Towards an Open Source Finite-State Morphological Analyzer for Zacatlán-Ahuacatlán-Tepetzintla Nahuatl
Robert Pugh | Francis Tyers
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib abs
Do RNN States Encode Abstract Phonological Alternations?
Miikka Silfverberg | Francis Tyers | Garrett Nicolai | Mans Hulden
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Sequence-to-sequence models have delivered impressive results in word formation tasks such as morphological inflection, often learning to model subtle morphophonological details with limited training data. Despite the performance, the opacity of neural models makes it difficult to determine whether complex generalizations are learned, or whether a kind of separate rote memorization of each morphophonological process takes place. To investigate whether complex alternations are simply memorized or whether there is some level of generalization across related sound changes in a sequence-to-sequence model, we perform several experiments on Finnish consonant gradation—a complex set of sound changes triggered in some words by certain suffixes. We find that our models often—though not always—encode 17 different consonant gradation processes in a handful of dimensions in the RNN. We also show that by scaling the activations in these dimensions we can control whether consonant gradation occurs and the direction of the gradation.

Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.

2020

pdf bib abs
Dependency annotation of noun incorporation in polysynthetic languages
Francis Tyers | Karina Mishchenkova
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

This paper describes an approach to annotating noun incorporation in Universal Dependencies. It motivates the need to annotate this particular morphosyntactic phenomenon and justifies it with respect to frequency of the construction. A case study is presented in which the proposed annotation scheme is applied to Chukchi, a language that exhibits noun incorporation. We compare argument encoding in Chukchi, English and Russian and find that while in English and Russian discourse elements are primarily tracked through noun phrases and pronouns, in Chukchi they are tracked through agreement marking and incorporation, with a lesser role for noun phrases.

pdf bib abs
Universal Dependency Treebank for Xibe
He Zhou | Juyeon Chung | Sandra Kübler | Francis Tyers
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

We present our work of constructing the first treebank for the Xibe language following the Universal Dependencies (UD) annotation scheme. Xibe is a low-resourced and severely endangered Tungusic language spoken by the Xibe minority living in the Xinjiang Uygur Autonomous Region of China. We collected 810 sentences so far, including 544 sentences from a grammar book on written Xibe and 266 sentences from Cabcal News. We annotated those sentences manually from scratch. In this paper, we report the procedure of building this treebank and analyze several important annotation issues of our treebank. Finally, we propose our plans for future work.

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.

pdf bib abs
Improving the Language Model for Low-Resource ASR with Online Text Corpora
Nils Hjortnaes | Timofey Arkhangelskiy | Niko Partanen | Michael Rießler | Francis Tyers
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.

pdf bib
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages
Tommi A Pirinen | Francis M. Tyers | Michael Rießler
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
A Finite-State Morphological Analyser for Evenki
Anna Zueva | Anastasia Kuznetsova | Francis Tyers
Proceedings of the 12th Language Resources and Evaluation Conference

It has been widely admitted that morphological analysis is an important step in automated text processing for morphologically rich languages. Evenki is a language with rich morphology, therefore a morphological analyser is highly desirable for processing Evenki texts and developing applications for Evenki. Although two morphological analysers for Evenki have already been developed, they are able to analyse less than a half of the available Evenki corpora. The aim of this paper is to create a new morphological analyser for Evenki. It is implemented using the Helsinki Finite-State Transducer toolkit (HFST). The lexc formalism is used to specify the morphotactic rules, which define the valid orderings of morphemes in a word. Morphophonological alternations and orthographic rules are described using the twol formalism. The lexicon is extracted from available machine-readable dictionaries. Since a part of the corpora belongs to texts in Evenki dialects, a version of the analyser with relaxed rules is developed for processing dialectal features. We evaluate the analyser on available Evenki corpora and estimate precision, recall and F-score. We obtain coverage scores of between 61% and 87% on the available Evenki corpora.

pdf bib abs
An Unsupervised Method for Weighting Finite-state Morphological Analyzers
Amr Keleg | Francis Tyers | Nick Howell | Tommi Pirinen
Proceedings of the 12th Language Resources and Evaluation Conference

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

2019

pdf bib
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
Tommi A. Pirinen | Heiki-Jaan Kaalep | Francis M. Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Data-Driven Morphological Analysis for Uralic Languages
Miikka Silfverberg | Francis Tyers
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf bib abs
A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus
Jungyeul Park | Francis Tyers
Proceedings of the 13th Linguistic Annotation Workshop

In this paper we present a new annotation scheme for the Sejong part-of-speech tagged corpus based on Universal Dependencies style annotation. By using a new annotation scheme, we can produce Sejong-style morphological analysis and part-of-speech tagging results which have been the de facto standard for Korean language processing. We also explore the possibility of doing named-entity recognition and semantic-role labelling for Korean using the new annotation scheme.

pdf bib
A biscriptual morphological transducer for Crimean Tatar
Francis M. Tyers | Jonathan Washington | Darya Kavitskaya | Memduh Gökırmak | Nick Howell | Remziye Berberova
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Machine Translation for Crimean Tatar to Turkish
Memduh Gökırmak | Francis Tyers | Jonathan Washington
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Proceedings of the Celtic Language Technology Workshop
Teresa Lynn | Delyth Prys | Colin Batchelor | Francis Tyers
Proceedings of the Celtic Language Technology Workshop

pdf bib
Development of a Universal Dependencies treebank for Welsh
Johannes Heinecke | Francis M. Tyers
Proceedings of the Celtic Language Technology Workshop

pdf bib
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
Alexandre Rademaker | Francis Tyers
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

pdf bib abs
Building a Morphological Analyser for Laz
Esra Onal | Francis Tyers
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This study is an attempt to contribute to documentation and revitalization efforts of endangered Laz language, a member of South Caucasian language family mainly spoken on northeastern coastline of Turkey. It constitutes the first steps to create a general computational model for word form recognition and production for Laz by building a rule-based morphological analyser using Helsinki Finite-State Toolkit (HFST). The evaluation results show that the analyser has a 64.9% coverage over a corpus collected for this study with 111,365 tokens. We have also performed an error analysis on randomly selected 100 tokens from the corpus which are not covered by the analyser, and these results show that the errors mostly result from Turkish words in the corpus and missing stems in our lexicon.

2018

pdf bib
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Michael Rießler | Jack Rueter | Trond Trosterud | Francis M. Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Towards an open-source universal-dependency treebank for Erzya
Jack Rueter | Francis Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
A prototype finite-state morphological analyser for Chukchi
Vasilisa Andriyanets | Francis Tyers
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

In this article we describe the application of finite-state transducers to the morphological and phonological systems of Chukchi, a polysynthetic language spoken in the north of the Russian Federation. The language exhibits progressive and regressive vowel harmony, productive incorporation and extensive circumfixing. To implement the analyser we use the well-known Helsinki Finite-State Toolkit (HFST). The resulting model covers the majority of the morphological and phonological processes. A brief evaluation carried out on publically-available corpora shows that the coverage of the transducer is between and 53% and 76%. An error evaluation of 100 tokens randomly selected from the corpus, which were not covered by the analyser shows that most of the morphological processes are covered and that the majority of errors are caused by a limited stem lexicon.

pdf bib abs
Can LSTM Learn to Capture Agreement? The Case of Basque
Shauli Ravfogel | Yoav Goldberg | Francis Tyers
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire? We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks – verb number prediction and suffix recovery – we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.

pdf bib abs
Multi-source synthetic treebank creation for improved cross-lingual dependency parsing
Francis Tyers | Mariya Sheyanova | Aleksandra Martynova | Pavel Stepachev | Konstantin Vinogorodskiy
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

This paper describes a method of creating synthetic treebanks for cross-lingual dependency parsing using a combination of machine translation (including pivot translation), annotation projection and the spanning tree algorithm. Sentences are first automatically translated from a lesser-resourced language to a number of related highly-resourced languages, parsed and then the annotations are projected back to the lesser-resourced language, leading to multiple trees for each sentence from the lesser-resourced language. The final treebank is created by merging the possible trees into a graph and running the spanning tree algorithm to vote for the best tree for each sentence. We present experiments aimed at parsing Faroese using a combination of Danish, Swedish and Norwegian. In a similar experimental setup to the CoNLL 2018 shared task on dependency parsing we report state-of-the-art results on dependency parsing for Faroese using an off-the-shelf parser.

pdf bib
Finite-state morphological analysis for Gagauz
Francis Tyers | Sevilay Bayatli | Güllü Karanfil | Memduh Gökırmak | Francis M. Tyers
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
Francis M. Tyers | Michael Rießler | Tommi A. Pirinen | Trond Trosterud
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf bib
Annotation schemes in North Sámi dependency parsing
Francis M. Tyers | Mariya Sheyanova
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages

pdf bib
Finite-State Morphological Analysis for Marathi
Vinit Ravishankar | Francis M. Tyers
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

pdf bib
A Dependency Treebank for Kurmanji Kurdish
Memduh Gökırmak | Francis M. Tyers
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
UD Annotatrix: An annotation tool for Universal Dependencies
Francis M. Tyers | Mariya Sheyanova | Jonathan North Washington
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

pdf bib
Towards a dependency-annotated treebank for Bambara
Ekaterina Aplonova | Francis M. Tyers
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib abs
Universal Dependencies
Joakim Nivre | Daniel Zeman | Filip Ginter | Francis Tyers
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Universal Dependencies (UD) is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages. This tutorial gives an introduction to the UD framework and resources, from basic design principles to annotation guidelines and existing treebanks. We also discuss tools for developing and exploiting UD treebanks and survey applications of UD in NLP and linguistics.

2016

The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.

pdf bib
Apertium: a free/open source platform for machine translation and basic language technology
Mikel L. Forcada | Francis M. Tyers
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib abs
A Finite-state Morphological Analyser for Tuvan
Francis Tyers | Aziyana Bayyr-ool | Aelita Salchak | Jonathan Washington
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

~This paper describes the development of free/open-source finite-state morphological transducers for Tuvan, a Turkic language spoken in and around the Tuvan Republic in Russia. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST), we use the lexc formalism for modelling the morphotactics and twol formalism for modelling morphophonological alternations. We present a novel description of the morphological combinatorics of pseudo-derivational morphemes in Tuvan. An evaluation is presented which shows that the transducer has a reasonable coverage―around 93%―on freely-available corpora of the languages, and high precision―over 99%―on a manually verified test set.

pdf bib abs
A Finite-State Morphological Analyser for Sindhi
Raveesh Motlani | Francis Tyers | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81% and a precision of over 97%.

2015

pdf bib
Automatic word stress annotation of Russian unrestricted text
Robert Reynolds | Francis Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Automatic conversion of colloquial Finnishto standard Finnish
Inari Listenmaa | Francis M. Tyers
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf bib
Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Mikel L. Forcada | Francis M. Tyers | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Francis M. Tyers | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Unsupervised training of maximum-entropy models for lexical selection i in rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martinez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib abs
Finite-state morphological transducers for three Kypchak languages
Jonathan Washington | Ilnar Salimzyanov | Francis Tyers
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the development of free/open-source finite-state morphological transducers for three Turkic languages―Kazakh, Tatar, and Kumyk―representing one language from each of the three sub-branches of the Kypchak branch of Turkic. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST). This paper describes how the development of a transducer for each subsequent closely-related language took less development time. An evaluation is presented which shows that the transducers all have a reasonable coverage―around 90\%―on freely available corpora of the languages, and high precision over a manually verified test set.

pdf bib
Subsegmental language detection in Celtic language text
Akshay Minocha | Francis Tyers
Proceedings of the First Celtic Language Technology Workshop

pdf bib
Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser
Dávid Márk Nemeskey | Francis Tyers | Mans Hulden
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
A Free/Open-source Kazakh-Tatar Machine Translation System
Ilnar Salimzyanov | Jonathan Washington | Francis Tyers
Proceedings of Machine Translation Summit XIV: Papers

2012

pdf bib
Rule-based Machine Translation between Indonesian and Malaysian
Raymond Hendy Susanto | Septina Dian Larasati | Francis M. Tyers
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib abs
A rule-based machine translation system from Serbo-Croatian to Macedonian
Hrvoje Peradin | Francis Tyers
Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes the development of a one-way machine translation system from SerboCroatian to Macedonian on the Apertium platform. Details of resources and development methods are given, as well as an evaluation, and general directives for future work.

pdf bib
Flexible finite-state lexical selection for rule-based machine translation
Francis M. Tyers | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib abs
Free/Open Source Shallow-Transfer Based Machine Translation for Spanish and Aragonese
Juan Pablo Martínez Cortés | Jim O’Regan | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the development of a bidirectional shallow-transfer based machine translation system for Spanish and Aragonese, based on the Apertium platform, reusing the resources provided by other translators built for the platform. The system, and the morphological analyser built for it, are both the first resources of their kind for Aragonese. The morphological analyser has coverage of over 80\%, and is being reused to create a spelling checker for Aragonese. The translator is bidirectional: the Word Error Rate for Spanish to Aragonese is 16.83%, while Aragonese to Spanish is 11.61%.

pdf bib abs
A finite-state morphological transducer for Kyrgyz
Jonathan Washington | Mirlan Ipasov | Francis Tyers
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the development of a free/open-source finite-state morphological transducer for Kyrgyz. The transducer has been developed for morphological generation for use within a prototype TurkishâKyrgyz machine translation system, but has also been extensively tested for analysis. The finite-state toolkit used for the work was the Helsinki Finite-State Toolkit (HFST). The paper describes some issues in Kyrgyz morphology, the development of the tool, some linguistic issues encountered and how they were dealt with, and which issues are left to resolve. An evaluation is presented which shows that the transducer has medium-level coverage, between 82% and 87% on two freely available corpora of Kyrgyz, and high precision and recall over a manually verified test set.

2011

pdf bib
Rapid rule-based machine translation between Dutch and Afrikaans
Pim Otte | Francis M. Tyers
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Apertium-IceNLP: A rule-based Icelandic to English machine translation system
Martha Dís Brandt | Hrafh Loftsson | Hlynur Sigurþórsson | Francis M. Tyers
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib abs
An Italian to Catalan RBMT system reusing data from existing language pairs
Antonio Toral | Mireia Ginestí-Rosell | Francis Tyers
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper presents an Italian→Catalan RBMT system automatically built by combining the linguistic data of the existing pairs Spanish–Catalan and Spanish–Italian. A lightweight manual postprocessing is carried out in order to fix inconsistencies in the automatically derived dictionaries and to add very frequent words that are missing according to a corpus analysis. The system is evaluated on the KDE4 corpus and outperforms Google Translate by approximately ten absolute points in terms of both TER and GTM.

2010

pdf bib
Rule-based Breton to French machine translation
Francis Tyers
Proceedings of the 14th Annual conference of the European Association for Machine Translation

2009

pdf bib
Developing Prototypes for Machine Translation between Two Sami Languages
Francis M. Tyers | Linda Wiechetek | Trond Trosterud
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Rule-Based Augmentation of Training Data in Breton-French Statistical Machine Translation
Francis M. Tyers
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martinez | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

pdf bib abs
The Apertium machine translation platform: Five years on
Mikel L. Forcada | Francis M. Tyers | Gema Ramírez-Sánchez
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes Apertium: a free/open-source machine translation platform (engine, toolbox and data), its history, its philosophy of design, its technology, the community of developers, the research and business based on it, and its prospects and challenges, now that it is five years old.

pdf bib abs
Matxin: Moving towards language independence
Aingeru Mayor | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This paper describes some of the issues found when adapting and extending the Matxin free-software machine translation system to other language pairs. It sketches out some of the characteristics of Matxin and offers some possible solutions to these issues.

pdf bib abs
Shallow-transfer rule-based machine translation for Swedish to Danish
Francis M. Tyers | Jacob Nordfalk
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of a shallow-transfer machine translation system from Swedish to Danish in the Apertium platform. It gives details of the resources used, the methods for constructing the system and an evaluation of the translation quality. The quality is found to be comparable with that of current commercial systems, despite the particularly low coverage of the lexicons.

pdf bib abs
Development of a morphological analyser for Bengali
Abu Zaher Md Faridee | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

This article describes the development of an open-source morphological analyser for Bengali Language using 􏰁nitestate technology. First we discuss the challenges of creating a morphological analyser for a highly in􏰂ectional language like Bengali and then propose a solution to that using lttoolbox, an open-source 􏰁nite-state toolkit. We then evaluate the performance of our developed system and propose ways of improving it further.