Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (2019)


pdf (full)
bib (full)
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Shervin Malmasi | Nikola Ljubešić | Jörg Tiedemann | Ahmed Ali

pdf bib
A Report on the Third VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Yves Scherrer | Tanja Samardžić | Francis Tyers | Miikka Silfverberg | Natalia Klyueva | Tung-Le Pan | Chu-Ren Huang | Radu Tudor Ionescu | Andrei M. Butnaru | Tommi Jauhiainen

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf bib
Improving Cuneiform Language Identification with BERT
Gabriel Bernier-Colborne | Cyril Goutte | Serge Léger

We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.

Joint Approach to Deromanization of Code-mixed Texts
Rashed Rubby Riyadh | Grzegorz Kondrak

The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.

Char-RNN for Word Stress Detection in East Slavic Languages
Ekaterina Chernyak | Maria Ponomareva | Kirill Milintsevich

We explore how well a sequence labeling approach, namely, recurrent neural network, is suited for the task of resource-poor and POS tagging free word stress detection in the Russian, Ukranian, Belarusian languages. We present new datasets, annotated with the word stress, for the three languages and compare several RNN models trained on three languages and explore possible applications of the transfer learning for the task. We show that it is possible to train a model in a cross-lingual setting and that using additional languages improves the quality of the results.

Modeling Global Syntactic Variation in English Using Dialect Classification
Jonathan Dunn

This paper evaluates global-scale dialect identification for 14 national varieties of English on both web-crawled data and Twitter data. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.

Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation
Nianheng Wu | Eric DeMattos | Kwok Him So | Pin-zhen Chen | Çağrı Çöltekin

This paper describes the work done by team tearsofjoy participating in the VarDial 2019 Evaluation Campaign. We developed two systems based on Support Vector Machines: SVM with a flat combination of features and SVM ensembles. We participated in all language/dialect identification tasks, as well as the Moldavian vs. Romanian cross-dialect topic identification (MRC) task. Our team achieved first place in German Dialect identification (GDI) and MRC subtasks 2 and 3, second place in the simplified variant of Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT) as well as Cuneiform Language Identification (CLI), and third and fifth place in DMT traditional and MRC subtask 1 respectively. In most cases, the SVM with a flat combination of features performed better than SVM ensembles. Besides describing the systems and the results obtained by them, we provide a tentative comparison between the feature combination methods, and present additional experiments with a method of adaptation to the test set, which may indicate potential pitfalls with some of the data sets.

Variation between Different Discourse Types: Literate vs. Oral
Katrin Ortmann | Stefanie Dipper

This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers.

Neural Machine Translation between Myanmar (Burmese) and Rakhine (Arakanese)
Thazin Myint Oo | Ye Kyaw Thu | Khin Mar Soe

This work explores neural machine translation between Myanmar (Burmese) and Rakhine (Arakanese). Rakhine is a language closely related to Myanmar, often considered a dialect. We implemented three prominent neural machine translation (NMT) systems: recurrent neural networks (RNN), transformer, and convolutional neural networks (CNN). The systems were evaluated on a Myanmar-Rakhine parallel text corpus developed by us. In addition, two types of word segmentation schemes for word embeddings were studied: Word-BPE and Syllable-BPE segmentation. Our experimental results clearly show that the highest quality NMT and statistical machine translation (SMT) performances are obtained with Syllable-BPE segmentation for both types of translations. If we focus on NMT, we find that the transformer with Word-BPE segmentation outperforms CNN and RNN for both Myanmar-Rakhine and Rakhine-Myanmar translation. However, CNN with Syllable-BPE segmentation obtains a higher score than the RNN and transformer.

Language and Dialect Identification of Cuneiform Texts
Tommi Jauhiainen | Heidi Jauhiainen | Tero Alstola | Krister Lindén

This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus. We also describe the CLI dataset and how it was derived from the corpus. In addition, we provide some baseline language identification results using the CLI dataset. To the best of our knowledge, the experiments detailed here represent the first time that automatic language identification methods have been used on cuneiform data.

Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data
Fahad AlGhamdi | Mona Diab

Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-the-art monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.

Toward a deep dialectological representation of Indo-Aryan
Chundra Cathcart

This paper presents a new approach to disentangling inter-dialectal and intra-dialectal relationships within one such group, the Indo-Aryan subgroup of Indo-European. We draw upon admixture models and deep generative models to tease apart historic language contact and language-specific behavior in the overall patterns of sound change displayed by Indo-Aryan languages. We show that a “deep” model of Indo-Aryan dialectology sheds some light on questions regarding inter-relationships among the Indo-Aryan languages, and performs better than a “shallow” model in terms of certain qualities of the posterior distribution (e.g., entropy of posterior distributions), and outline future pathways for model development.

Naive Bayes and BiLSTM Ensemble for Discriminating between Mainland and Taiwan Variation of Mandarin Chinese
Li Yang | Yang Xiang

Automatic dialect identification is a more challengingctask than language identification, as it requires the ability to discriminate between varieties of one language. In this paper, we propose an ensemble based system, which combines traditional machine learning models trained on bag of n-gram fetures, with deep learning models trained on word embeddings, to solve the Discriminating between Mainland and Taiwan Variation of Mandarin Chinese (DMT) shared task at VarDial 2019. Our experiments show that a character bigram-trigram combination based Naive Bayes is a very strong model for identifying varieties of Mandarin Chinense. Through further ensemble of Navie Bayes and BiLSTM, our system (team: itsalexyang) achived an macro-averaged F1 score of 0.8530 and 0.8687 in two tracks.

BAM: A combination of deep and shallow models for German Dialect Identification.
Andrei M. Butnaru

*This is a submission for the Third VarDial Evaluation Campaign* In this paper, we present a machine learning approach for the German Dialect Identification (GDI) Closed Shared Task of the DSL 2019 Challenge. The proposed approach combines deep and shallow models, by applying a voting scheme on the outputs resulted from a Character-level Convolutional Neural Networks (Char-CNN), a Long Short-Term Memory (LSTM) network, and a model based on String Kernels. The first model used is the Char-CNN model that merges multiple convolutions computed with kernels of different sizes. The second model is the LSTM network which applies a global max pooling over the returned sequences over time. Both models pass the activation maps to two fully-connected layers. The final model is based on String Kernels, computed on character p-grams extracted from speech transcripts. The model combines two blended kernel functions, one is the presence bits kernel, and the other is the intersection kernel. The empirical results obtained in the shared task prove that the approach can achieve good results. The system proposed in this paper obtained the fourth place with a macro-F1 score of 62.55%

The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task
Adrian-Gabriel Chifu

This article presents the model that generated the runs submitted by the R2I_LIS team to the VarDial2019 evaluation campaign, more particularly, to the binary classification by dialect sub-task of the Moldavian vs. Romanian Cross-dialect Topic identification (MRC) task. The team proposed a majority vote-based model, between five supervised machine learning models, trained on forty manually-crafted features. One of the three submitted runs was ranked second at the binary classification sub-task, with a performance of 0.7963, in terms of macro-F1 measure. The other two runs were ranked third and fourth, respectively.

Initial Experiments In Cross-Lingual Morphological Analysis Using Morpheme Segmentation
Vladislav Mikhailov | Lorenzo Tosi | Anastasia Khorosheva | Oleg Serikov

The paper describes initial experiments in data-driven cross-lingual morphological analysis of open-category words using a combination of unsupervised morpheme segmentation, annotation projection and an LSTM encoder-decoder model with attention. Our algorithm provides lemmatisation and morphological analysis generation for previously unseen low-resource language surface forms with only annotated data on the related languages given. Despite the inherently lossy annotation projection, we achieved the best lemmatisation F1-score in the VarDial 2019 Shared Task on Cross-Lingual Morphological Analysis for both Karachay-Balkar (Turkic languages, agglutinative morphology) and Sardinian (Romance languages, fusional morphology).

Neural and Linear Pipeline Approaches to Cross-lingual Morphological Analysis
Çağrı Çöltekin | Jeremy Barnes

This paper describes Tübingen-Oslo team’s participation in the cross-lingual morphological analysis task in the VarDial 2019 evaluation campaign. We participated in the shared task with a standard neural network model. Our model achieved analysis F1-scores of 31.48 and 23.67 on test languages Karachay-Balkar (Turkic) and Sardinian (Romance) respectively. The scores are comparable to the scores obtained by the other participants in both language families, and the analysis score on the Romance data set was also the best result obtained in the shared task. Besides describing the system used in our shared task participation, we describe another, simpler, model based on linear classifiers, and present further analyses using both models. Our analyses, besides revealing some of the difficult cases, also confirm that the usefulness of a source language in this task is highly correlated with the similarity of source and target languages.

Ensemble Methods to Distinguish Mainland and Taiwan Chinese
Hai Hu | Wen Li | He Zhou | Zuoyu Tian | Yiwen Zhang | Liang Zou

This paper describes the IUCL system at VarDial 2019 evaluation campaign for the task of discriminating between Mainland and Taiwan variation of mandarin Chinese. We first build several base classifiers, including a Naive Bayes classifier with word n-gram as features, SVMs with both character and syntactic features, and neural networks with pre-trained character/word embeddings. Then we adopt ensemble methods to combine output from base classifiers to make final predictions. Our ensemble models achieve the highest F1 score (0.893) in simplified Chinese track and the second highest (0.901) in traditional Chinese track. Our results demonstrate the effectiveness and robustness of the ensemble methods.

SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification
Cristian Onose | Dumitru-Clementin Cercel | Stefan Trausan-Matu

This paper describes our models for the Moldavian vs. Romanian Cross-Topic Identification (MRC) evaluation campaign, part of the VarDial 2019 workshop. We focus on the three subtasks for MRC: binary classification between the Moldavian (MD) and the Romanian (RO) dialects and two cross-dialect multi-class classification between six news topics, MD to RO and RO to MD. We propose several deep learning models based on long short-term memory cells, Bidirectional Gated Recurrent Unit (BiGRU) and Hierarchical Attention Networks (HAN). We also employ three word embedding models to represent the text as a low dimensional vector. Our official submission includes two runs of the BiGRU and HAN models for each of the three subtasks. The best submitted model obtained the following macro-averaged F1 scores: 0.708 for subtask 1, 0.481 for subtask 2 and 0.480 for the last one. Due to a read error caused by the quoting behaviour over the test file, our final submissions contained a smaller number of items than expected. More than 50% of the submission files were corrupted. Thus, we also present the results obtained with the corrected labels for which the HAN model achieves the following results: 0.930 for subtask 1, 0.590 for subtask 2 and 0.687 for the third one.

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models
Tommi Jauhiainen | Krister Lindén | Heidi Jauhiainen

This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign. The DMT shared task included two separate tracks, one for the simplified Chinese script and one for the traditional Chinese script. We submitted three runs on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese track using Naive Bayes with language model adaptation, came second on GDI with an adaptive version of the HeLI 2.0 method, and third on the simplified Chinese track using again the adaptive Naive Bayes.

Investigating Machine Learning Methods for Language and Dialect Identification of Cuneiform Texts
Ehsan Doostmohammadi | Minoo Nassajian

Identification of the languages written using cuneiform symbols is a difficult task due to the lack of resources and the problem of tokenization. The Cuneiform Language Identification task in VarDial 2019 addresses the problem of identifying seven languages and dialects written in cuneiform; Sumerian and six dialects of Akkadian language: Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian. This paper describes the approaches taken by SharifCL team to this problem in VarDial 2019. The best result belongs to an ensemble of Support Vector Machines and a naive Bayes classifier, both working on character-level features, with macro-averaged F1-score of 72.10%.

TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019
Fernando Benites | Pius von Däniken | Mark Cieliebak

We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%.

DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification
Diana Tudoreanu

This paper presents the solution proposed by DTeam in the VarDial 2019 Evaluation Campaign for the Moldavian vs. Romanian cross-topic identification task. The solution proposed is a Support Vector Machines (SVM) ensemble composed of a two character-level neural networks. The first network is a skip-gram classification model formed of an embedding layer, three convolutional layers and two fully-connected layers. The second network has a similar architecture, but is trained using the triplet loss function.

Experiments in Cuneiform Language Identification
Gustavo Henrique Paetzold | Marcos Zampieri

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world. We report the results obtained by the PZ team in the Cuneiform Language Identification (CLI) shared task organized within the scope of the VarDial Evaluation Campaign 2019. The task included two languages, Sumerian and Akkadian. The latter is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. We approach the task using a meta-classifier trained on various SVM models and we show the effectiveness of the system for this task. Our submission achieved 0.738 F1 score in discriminating between the seven languages and dialects and it was ranked fourth in the competition among eight teams.

Comparing Pipelined and Integrated Approaches to Dialectal Arabic Neural Machine Translation
Pamela Shapiro | Kevin Duh

When translating diglossic languages such as Arabic, situations may arise where we would like to translate a text but do not know which dialect it is. A traditional approach to this problem is to design dialect identification systems and dialect-specific machine translation systems. However, under the recent paradigm of neural machine translation, shared multi-dialectal systems have become a natural alternative. Here we explore under which conditions it is beneficial to perform dialect identification for Arabic neural machine translation versus using a general system for all dialects.

Cross-lingual Annotation Projection Is Effective for Neural Part-of-Speech Tagging
Matthias Huck | Diana Dutka | Alexander Fraser

We tackle the important task of part-of-speech tagging using a neural model in the zero-resource scenario, where we have no access to gold-standard POS training data. We compare this scenario with the low-resource scenario, where we have access to a small amount of gold-standard POS training data. Our experiments focus on Ukrainian as a representative of under-resourced languages. Russian is highly related to Ukrainian, so we exploit gold-standard Russian POS tags. We consider four techniques to perform Ukrainian POS tagging: zero-shot tagging and cross-lingual annotation projection (for the zero-resource scenario), and compare these with self-training and multilingual learning (for the low-resource scenario). We find that cross-lingual annotation projection works particularly well in the zero-resource scenario.