Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (2018)


pdf (full)
bib (full)
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

pdf bib
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Shervin Malmasi | Ahmed Ali

pdf bib
Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign
Marcos Zampieri | Shervin Malmasi | Preslav Nakov | Ahmed Ali | Suwon Shon | James Glass | Yves Scherrer | Tanja Samardžić | Nikola Ljubešić | Jörg Tiedemann | Chris van der Lee | Stefan Grondelaers | Nelleke Oostdijk | Dirk Speelman | Antal van den Bosch | Ritesh Kumar | Bornini Lahiri | Mayank Jain

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf bib
Encoder-Decoder Methods for Text Normalization
Massimo Lusetti | Tatyana Ruzsics | Anne Göhring | Tanja Samardžić | Elisabeth Stark

Text normalization is the task of mapping non-canonical language, typical of speech transcription and computer-mediated communication, to a standardized writing. It is an up-stream task necessary to enable the subsequent direct employment of standard natural language processing tools and indispensable for languages such as Swiss German, with strong regional variation and no written standard. Text normalization has been addressed with a variety of methods, most successfully with character-level statistical machine translation (CSMT). In the meantime, machine translation has changed and the new methods, known as neural encoder-decoder (ED) models, resulted in remarkable improvements. Text normalization, however, has not yet followed. A number of neural methods have been tried, but CSMT remains the state-of-the-art. In this work, we normalize Swiss German WhatsApp messages using the ED framework. We exploit the flexibility of this framework, which allows us to learn from the same training data in different ways. In particular, we modify the decoding stage of a plain ED model to include target-side language models operating at different levels of granularity: characters and words. Our systematic comparison shows that our approach results in an improvement over the CSMT state-of-the-art.

A High Coverage Method for Automatic False Friends Detection for Spanish and Portuguese
Santiago Castro | Jairo Bonanata | Aiala Rosá

False friends are words in two languages that look or sound similar, but have different meanings. They are a common source of confusion among language learners. Methods to detect them automatically do exist, however they make use of large aligned bilingual corpora, which are hard to find and expensive to build, or encounter problems dealing with infrequent words. In this work we propose a high coverage method that uses word vector representations to build a false friends classifier for any pair of languages, which we apply to the particular case of Spanish and Portuguese. The required resources are a large corpus for each language and a small bilingual lexicon for the pair.

Sub-label dependencies for Neural Morphological Tagging – The Joint Submission of University of Colorado and University of Helsinki for VarDial 2018
Miikka Silfverberg | Senka Drobac

This paper presents the submission of the UH&CU team (Joint University of Colorado and University of Helsinki team) for the VarDial 2018 shared task on morphosyntactic tagging of Croatian, Slovenian and Serbian tweets. Our system is a bidirectional LSTM tagger which emits tags as character sequences using an LSTM generator in order to be able to handle unknown tags and combinations of several tags for one token which occur in the shared task data sets. To the best of our knowledge, using an LSTM generator is a novel approach. The system delivers sizable improvements of more than 6%-points over a baseline trigram tagger. Overall, the performance of our system is quite even for all three languages.

Part of Speech Tagging in Luyia: A Bantu Macrolanguage
Kenneth Steimel

Luyia is a macrolanguage in central Kenya. The Luyia languages, like other Bantu languages, have a complex morphological system. This system can be leveraged to aid in part of speech tagging. Bag-of-characters taggers trained on a source Luyia language can be applied directly to another Luyia language with some degree of success. In addition, mixing data from the target language with data from the source language does produce more accurate predictive models compared to models trained on just the target language data when the training set size is small. However, for both of these tagging tasks, models involving the more distantly related language, Tiriki, are better at predicting part of speech tags for Wanga data. The models incorporating Bukusu data are not as successful despite the closer relationship between Bukusu and Wanga. Overlapping vocabulary between the Wanga and Tiriki corpora as well as a bias towards open class words help Tiriki outperform Bukusu.

Tübingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-gram Features in Language Variety Identification
Çağrı Çöltekin | Taraka Rama | Verena Blaschke

This paper describes our systems for the VarDial 2018 evaluation campaign. We participated in all language identification tasks, namely, Arabic dialect identification (ADI), German dialect identification (GDI), discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). In all of the tasks, we only used textual transcripts (not using audio features for ADI). We submitted system runs based on support vector machine classifiers (SVMs) with bag of character and word n-grams as features, and gated bidirectional recurrent neural networks (RNNs) using units of characters and words. Our SVM models outperformed our RNN models in all tasks, obtaining the first place on the DFS task, third place on the ADI task, and second place on others according to the official rankings. As well as describing the models we used in the shared task participation, we present an analysis of the n-gram features used by the SVM models in each task, and also report additional results (that were run after the official competition deadline) on the GDI surprise dialect track.

Iterative Language Model Adaptation for Indo-Aryan Language Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén

This paper presents the experiments and results obtained by the SUKI team in the Indo-Aryan Language Identification shared task of the VarDial 2018 Evaluation Campaign. The shared task was an open one, but we did not use any corpora other than what was distributed by the organizers. A total of eight teams provided results for this shared task. Our submission using a HeLI-method based language identifier with iterative language model adaptation obtained the best results in the shared task with a macro F1-score of 0.958.

Language and the Shifting Sands of Domain, Space and Time (Invited Talk)
Timothy Baldwin

In this talk, I will first present recent work on domain debiasing in the context of language identification, then discuss a new line of work on language variety analysis in the form of dialect map generation. Finally, I will reflect on the interplay between time and space on language variation, and speculate on how these can be captured in a single model.

UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row
Andrei Butnaru | Radu Tudor Ionescu

We present a machine learning approach that ranked on the first place in the Arabic Dialect Identification (ADI) Closed Shared Tasks of the 2018 VarDial Evaluation Campaign. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech or phonetic transcripts, we also use a kernel based on dialectal embeddings generated from audio recordings by the organizers. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Preliminary experiments indicate that KRR provides better classification results. Our approach is shallow and simple, but the empirical results obtained in the 2018 ADI Closed Shared Task prove that it achieves the best performance. Furthermore, our top macro-F1 score (58.92%) is significantly better than the second best score (57.59%) in the 2018 ADI Shared Task, according to the statistical significance test performed by the organizers. Nevertheless, we obtain even better post-competition results (a macro-F1 score of 62.28%) using the audio embeddings released by the organizers after the competition. With a very similar approach (that did not include phonetic features), we also ranked first in the ADI Closed Shared Tasks of the 2017 VarDial Evaluation Campaign, surpassing the second best method by 4.62%. We therefore conclude that our multiple kernel learning method is the best approach to date for Arabic dialect identification.

Varying image description tasks: spoken versus written descriptions
Emiel van Miltenburg | Ruud Koolen | Emiel Krahmer

Automatic image description systems are commonly trained and evaluated on written image descriptions. At the same time, these systems are often used to provide spoken descriptions (e.g. for visually impaired users) through apps like TapTapSee or Seeing AI. This is not a problem, as long as spoken and written descriptions are very similar. However, linguistic research suggests that spoken language often differs from written language. These differences are not regular, and vary from context to context. Therefore, this paper investigates whether there are differences between written and spoken image descriptions, even if they are elicited through similar tasks. We compare descriptions produced in two languages (English and Dutch), and in both languages observe substantial differences between spoken and written descriptions. Future research should see if users prefer the spoken over the written style and, if so, aim to emulate spoken descriptions.

Transfer Learning for British Sign Language Modelling
Boris Mocialov | Helen Hastie | Graham Turner

Automatic speech recognition and spoken dialogue systems have made great advances through the use of deep machine learning methods. This is partly due to greater computing power but also through the large amount of data available in common languages, such as English. Conversely, research in minority languages, including sign languages, is hampered by the severe lack of data. This has led to work on transfer learning methods, whereby a model developed for one language is reused as the starting point for a model on a second language, which is less resourced. In this paper, we examine two transfer learning techniques of fine-tuning and layer substitution for language modelling of British Sign Language. Our results show improvement in perplexity when using transfer learning with standard stacked LSTM models, trained initially using a large corpus for standard English from the Penn Treebank corpus.

Paraphrastic Variance between European and Brazilian Portuguese
Anabela Barreiro | Cristina Mota

This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and phrasal units, such as the compounds “toda a gente” versus “todo o mundo” ‘everybody’ or the gerundive constructions [estar a + V-Inf] versus [ficar + V-Ger] (e.g., “estive a observar” | “fiquei observando” ‘I was observing’), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases. The construction of a larger dataset of paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.

Character Level Convolutional Neural Network for Arabic Dialect Identification
Mohamed Ali

This submission is for the description paper for our system in the ADI shared task.

Neural Network Architectures for Arabic Dialect Identification
Elise Michon | Minh Quang Pham | Josep Crego | Jean Senellart

SYSTRAN competes this year for the first time to the DSL shared task, in the Arabic Dialect Identification subtask. We participate by training several Neural Network models showing that we can obtain competitive results despite the limited amount of training data available for learning. We report our experiments and detail the network architecture and parameters of our 3 runs: our best performing system consists in a Multi-Input CNN that learns separate embeddings for lexical, phonetic and acoustic input features (F1: 0.5289); we also built a CNN-biLSTM network aimed at capturing both spatial and sequential features directly from speech spectrograms (F1: 0.3894 at submission time, F1: 0.4235 with later found parameters); and finally a system relying on binary CNN-biLSTMs (F1: 0.4339).

HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén

This paper presents the experiments and results obtained by the SUKI team in the Discriminating between Dutch and Flemish in Subtitles shared task of the VarDial 2018 Evaluation Campaign. Our best submission was ranked 8th, obtaining macro F1-score of 0.61. Our best results were produced by a language identifier implementing the HeLI method without any modifications. We describe, in addition to the best method we used, some of the experiments we did with unsupervised clustering.

Measuring language distance among historical varieties using perplexity. Application to European Portuguese.
Jose Ramom Pichel Campos | Pablo Gamallo | Iñaki Alegria

The objective of this work is to quantify, with a simple and robust measure, the distance between historical varieties of a language. The measure will be inferred from text corpora corresponding to historical periods. Different approaches have been proposed for similar aims: Language Identification, Phylogenetics, Historical Linguistics or Dialectology. In our approach, we used a perplexity-based measure to calculate language distance between all the historical periods of a specific language: European Portuguese. Perplexity has also proven to be a robust metric to calculate distance between languages. However, this measure has not been tested yet to identify diachronic periods within the historical evolution of a specific language. For this purpose, a historical Portuguese corpus has been constructed from different open sources containing texts with close original spelling. The results of our experiments show that Portuguese keeps an important degree of homogeneity over time. We anticipate this metric to be a starting point to be applied to other languages.

Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages
Nikola Ljubešić

This paper presents two systems taking part in the Morphosyntactic Tagging of Tweets shared task on Slovene, Croatian and Serbian data, organized inside the VarDial Evaluation Campaign. While one system relies on the traditional method for sequence labeling (conditional random fields), the other relies on its neural alternative (bidirectional long short-term memory). We investigate the similarities and differences of these two approaches, showing that both methods yield very good and quite similar results, with the neural model outperforming the traditional one more as the level of non-standardness of the text increases. Through an error analysis we show that the neural system is better at long-range dependencies, while the traditional system excels and slightly outperforms the neural system at the local ones. We present in the paper new state-of-the-art results in morphosyntactic annotation of non-standard text for Slovene, Croatian and Serbian.

Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers
Adrien Barbaresi

The present contribution revolves around efficient approaches to language classification which have been field-tested in the Vardial evaluation campaign. The methods used in several language identification tasks comprising different language types are presented and their results are discussed, giving insights on real-world application of regularization, linear classifiers and corresponding linguistic features. The use of a specially adapted Ridge classifier proved useful in 2 tasks out of 3. The overall approach (XAC) has slightly outperformed most of the other systems on the DFS task (Dutch and Flemish) and on the ILI task (Indo-Aryan languages), while its comparative performance was poorer in on the GDI task (Swiss German dialects).

Character Level Convolutional Neural Network for German Dialect Identification
Mohamed Ali

This submission is a description paper for our system in GDI shared task

Discriminating between Indo-Aryan Languages Using SVM Ensembles
Alina Maria Ciobanu | Marcos Zampieri | Shervin Malmasi | Santanu Pal | Liviu P. Dinu

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3rd out of 8 teams.

IIT (BHU) System for Indo-Aryan Language Identification (ILI) at VarDial 2018
Divyanshu Gupta | Gourav Dhakad | Jayprakash Gupta | Anil Kumar Singh

Text language Identification is a Natural Language Processing task of identifying and recognizing a given language out of many different languages from a piece of text. This paper describes our submission to the ILI 2018 shared-task, which includes the identification of 5 closely related Indo-Aryan languages. We developed a word-level LSTM(Long Short-term Memory) model, a specific type of Recurrent Neural Network model, for this task. Given a sentence, our model embeds each word of the sentence and convert into its trainable word embedding, feeds them into our LSTM network and finally predict the language. We obtained an F1 macro score of 0.836, ranking 5th in the task.

Exploring Classifier Combinations for Language Variety Identification
Tim Kreutz | Walter Daelemans

This paper describes CLiPS’s submissions for the Discriminating between Dutch and Flemish in Subtitles (DFS) shared task at VarDial 2018. We explore different ways to combine classifiers trained on different feature groups. Our best system uses two Linear SVM classifiers; one trained on lexical features (word n-grams) and one trained on syntactic features (PoS n-grams). The final prediction for a document to be in Flemish Dutch or Netherlandic Dutch is made by the classifier that outputs the highest probability for one of the two labels. This confidence vote approach outperforms a meta-classifier on the development data and on the test data.

Identification of Differences between Dutch Language Varieties with the VarDial2018 Dutch-Flemish Subtitle Data
Hans van Halteren | Nelleke Oostdijk

With the goal of discovering differences between Belgian and Netherlandic Dutch, we participated as Team Taurus in the Dutch-Flemish Subtitles task of VarDial2018. We used a rather simple marker-based method, but a wide range of features, including lexical, lexico-syntactic and syntactic ones, and achieved a second position in the ranking. Inspection of highly distin-guishing features did point towards differences between the two language varieties, but because of the nature of the experimental data, we have to treat our observations as very tentative and in need of further investigation.

Birzeit Arabic Dialect Identification System for the 2018 VarDial Challenge
Rabee Naser | Abualsoud Hanani

This paper describes our Automatic Dialect Recognition (ADI) system for the VarDial 2018 challenge, with the goal of distinguishing four major Arabic dialects, as well as Modern Standard Arabic (MSA). The training and development ADI VarDial 2018 data consists of 16,157 utterances, their words transcription, their phonetic transcriptions obtained with four non-Arabic phoneme recognizers and acoustic embedding data. Our overall system is a combination of four different systems. One system uses the words transcriptions and tries to recognize the speaker dialect by modeling the sequence of words for each dialect. Another system tries to recognize the dialect by modeling the phones sequence produced by non-Arabic phone recognizers, whereas, the other two systems use GMM trained on the acoustic features for recognizing the dialect. The best performance was achieved by the fused system which combines four systems together, with F1 micro of 68.77%.

Twist Bytes - German Dialect Identification with Data Mining Optimization
Fernando Benites | Ralf Grubenmann | Pius von Däniken | Dirk von Grünigen | Jan Deriu | Mark Cieliebak

We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different meta classifier approaches and used some data mining insights to improve the preprocessing and the meta classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced very differently the performance of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%.

STEVENDU2018’s system in VarDial 2018: Discriminating between Dutch and Flemish in Subtitles
Steven Du | Yuan Yuan Wang

This paper introduces the submitted system for team STEVENDU2018 during VarDial 2018 Discriminating between Dutch and Flemish in Subtitles(DFS). Post evaluation analyses are also presented, the results obtained indicate that it is a challenging task to discriminate Dutch and Flemish.

Using Neural Transfer Learning for Morpho-syntactic Tagging of South-Slavic Languages Tweets
Sara Meftah | Nasredine Semmar | Fatiha Sadat | Stephan Raaijmakers

In this paper, we describe a morpho-syntactic tagger of tweets, an important component of the CEA List DeepLIMA tool which is a multilingual text analysis platform based on deep learning. This tagger is built for the Morpho-syntactic Tagging of Tweets (MTT) Shared task of the 2018 VarDial Evaluation Campaign. The MTT task focuses on morpho-syntactic annotation of non-canonical Twitter varieties of three South-Slavic languages: Slovene, Croatian and Serbian. We propose to use a neural network model trained in an end-to-end manner for the three languages without any need for task or domain specific features engineering. The proposed approach combines both character and word level representations. Considering the lack of annotated data in the social media domain for South-Slavic languages, we have also implemented a cross-domain Transfer Learning (TL) approach to exploit any available related out-of-domain annotated data.

When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish
Martin Kroon | Masha Medvedeva | Barbara Plank

In this paper we present the results of our participation in the Discriminating between Dutch and Flemish in Subtitles VarDial 2018 shared task. We try techniques proven to work well for discriminating between language varieties as well as explore the potential of using syntactic features, i.e. hierarchical syntactic subtrees. We experiment with different combinations of features. Discriminating between these two languages turned out to be a very hard task, not only for a machine: human performance is only around 0.51 F1 score; our best system is still a simple Naive Bayes model with word unigrams and bigrams. The system achieved an F1 score (macro) of 0.62, which ranked us 4th in the shared task.

HeLI-based Experiments in Swiss German Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén

In this paper we present the experiments and results by the SUKI team in the German Dialect Identification shared task of the VarDial 2018 Evaluation Campaign. Our submission using HeLI with adaptive language models obtained the best results in the shared task with a macro F1-score of 0.686, which is clearly higher than the other submitted results. Without some form of unsupervised adaptation on the test set, it might not be possible to reach as high an F1-score with the level of domain difference between the datasets of the shared task. We describe the methods used in detail, as well as some additional experiments carried out during the shared task.

Deep Models for Arabic Dialect Identification on Benchmarked Data
Mohamed Elaraby | Muhammad Abdul-Mageed

The Arabic Online Commentary (AOC) (Zaidan and Callison-Burch, 2011) is a large-scale repos-itory of Arabic dialects with manual labels for4varieties of the language. Existing dialect iden-tification models exploiting the dataset pre-date the recent boost deep learning brought to NLPand hence the data are not benchmarked for use with deep learning, nor is it clear how much neural networks can help tease the categories in the data apart. We treat these two limitations:We (1) benchmark the data, and (2) empirically test6different deep learning methods on thetask, comparing peformance to several classical machine learning models under different condi-tions (i.e., both binary and multi-way classification). Our experimental results show that variantsof (attention-based) bidirectional recurrent neural networks achieve best accuracy (acc) on thetask, significantly outperforming all competitive baselines. On blind test data, our models reach87.65%acc on the binary task (MSA vs. dialects),87.4%acc on the 3-way dialect task (Egyptianvs. Gulf vs. Levantine), and82.45%acc on the 4-way variants task (MSA vs. Egyptian vs. Gulfvs. Levantine). We release our benchmark for future work on the dataset

A Neural Approach to Language Variety Translation
Marta R. Costa-jussà | Marcos Zampieri | Santanu Pal

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian - European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvement of 0.9 BLEU points in translating from European to Brazilian Portuguese and 0.2 BLEU points when translating in the opposite direction. We also carried out a human evaluation experiment with native speakers of Brazilian Portuguese which indicates that humans prefer the output produced by the neural-based system in comparison to the statistical system.

Character Level Convolutional Neural Network for Indo-Aryan Language Identification
Mohamed Ali

This submission is a description paper for our system in ILI shared task

German Dialect Identification Using Classifier Ensembles
Alina Maria Ciobanu | Shervin Malmasi | Liviu P. Dinu

In this paper we present the GDI classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transcripts included in the dataset contained speakers from Basel, Bern, Lucerne, and Zurich. Our entry in the challenge reached 62.03% F1 score and was ranked third out of eight teams.