Widening Natural Language Processing (2020)


pdf (full)
bib (full)
Proceedings of the The Fourth Widening Natural Language Processing Workshop

pdf bib
Proceedings of the The Fourth Widening Natural Language Processing Workshop
Rossana Cunha | Samira Shaikh | Erika Varis | Ryan Georgi | Alicia Tsai | Antonios Anastasopoulos | Khyathi Raghavi Chandu

Corpus based Amharic sentiment lexicon generation
Girma Neshir Alemneh | Andreas Rauber | Solomon Atnafu

Sentiment classification is an active research area with several applications including analysis of political opinions, classifying comments, movie reviews, news reviews and product reviews. To employ rule based sentiment classification, we require sentiment lexicons. However, manual construction of sentiment lexicon is time consuming and costly for resource-limited languages. To bypass manual development time and costs, we tried to build Amharic Sentiment Lexicons relying on corpus based approach. The intention of this approach is to handle sentiment terms specific to Amharic language from Amharic Corpus. Small set of seed terms are manually prepared from three parts of speech such as noun, adjective and verb. We developed algorithms for constructing Amharic sentiment lexicons automatically from Amharic news corpus. Corpus based approach is proposed relying on the word co-occurrence distributional embedding including frequency based embedding (i.e. Positive Point-wise Mutual Information PPMI). Using PPMI with threshold value of 100 and 200, we got corpus based Amharic Sentiment lexicons of size 1811 and 3794 respectively by expanding 519 seeds. Finally, the lexicon generated in corpus based approach is evaluated.

Negation handling for Amharic sentiment classification
Girma Neshir Alemneh | Andreas Rauber | Solomon Atnafu

User generated content is bringing new aspects of processing data on the web. Due to the advancement of World Wide Web technology, users are not only consumer of web contents but also they are producers of contents in the form of text, audio, video and picture. This study focuses on the analysis of textual contents with subjective information (referring to sentiment analysis). Most of conventional approaches of sentiment analysis do not effectively capture negation in languages where there are limited computational linguistic resources (e.g. Amharic). For this research, we proposed Amharic negation handling framework for Amharic sentiment classification. The proposed framework combines the lexicon based sentiment classification approach and character ngram based machine learning algorithms. Finally, the performance of framework is evaluated using the annotated Amharic news comments. The system is performing the best of all models and the baselines with accuracy of 98.0. The result is compared with the baselines (without negation handling and word level ngram model).

Embedding Oriented Adaptable Semantic Annotation Framework for Amharic Web Documents
Kidane Woldemariyam | Dr. Fekade Getahun

The Web has become a source of information, where information is provided by humans for humans and its growth has increased necessity to get solutions that intelligently extract valuable knowledge from existing and newly added web documents with no (minimal) supervisions. However, due to the unstructured nature of existing data on the Web, effective extraction of this knowledge is limited for both human beings and software agents. Thus, this research work designed generic and embedding oriented framework that automatically annotates semantically Amharic web documents using ontology. This framework significantly reduces manual annotation and learning cost used for semantic annotation of Amharic web documents with its nature of adaptability with minimal modification. The results have also implied that neural network techniques are promising for semantic annotation, especially for less resourced languages like Amharic in comparison to language dependent techniques that have cost of speed and challenge of adaptation into new domains and languages. We experiment the feasibility of the proposed approach using Amharic news collected from WALTA news agency and Amharic Wikipedia. Our results show that the proposed solution exhibits 70.68% of precision, 66.89% of recall and 68.53% of f-measure in semantic annotation for a morphologically complex Amharic language with limited size dataset.

Similarity and Farness Based Bidirectional Neural Co-Attention for Amharic Natural Language Inference
Abebawu Eshetu | Getenesh Teshome | Ribka Alemayehu

In natural language one idea can be conveyed using different sentences; higher Natural Language Processing applications get difficulties in capturing meaning of ideas stated in different expressions. To solve this difficulty, different scholars have conducted Natural Language Inference (NLI) researches using methods from traditional discrete models with hard logic to an end-to-end neural network for different languages. In context of Amharic language, even though there are number of research efforts in higher NLP applications, still they have limitation on understanding idea expressed in different ways due to an absence of NLI in Amharic language. Accordingly, we proposed deep learning based Natural Language Inference using similarity and farness aware bidirectional attentive matching for Amharic texts. The experiment on limited Amharic NLI dataset prepared also shows promising result that can be used as baseline for subsequent works.

Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta
Solomon Teferra Abate | Martha Yifiru Tachbelie | Michael Melese | Hafte Abera | Tewodros Gebreselassie | Wondwossen Mulugeta | Yaregal Assabie | Million Meshesha Beyene | Solomon Atinafu | Binyam Ephrem Seyoum

Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.

SIMPLEX-PB 2.0: A Reliable Dataset for Lexical Simplification in Brazilian Portuguese
Nathan Hartmann | Gustavo Henrique Paetzold | Sandra Aluísio

Most research on Lexical Simplification (LS) addresses non-native speakers of English, since they are numerous and easy to recruit. This makes it difficult to create LS solutions for other languages and target audiences. This paper presents SIMPLEX-PB 2.0, a dataset for LS in Brazilian Portuguese that, unlike its predecessor SIMPLEX-PB, accurately captures the needs of Brazilian underprivileged children. To create SIMPLEX-PB 2.0, we addressed all limitations of the old SIMPLEX-PB through multiple rounds of manual annotation. As a result, SIMPLEX-PB 2.0 features much more reliable and numerous candidate substitutions to complex words, as well as word complexity rankings produced by a group underprivileged children.

Bi-directional Answer-to-Answer Co-attention for Short Answer Grading using Deep Learning
Abebawu Eshetu | Getenesh Teshome | Ribka Alemahu

So far different research works have been conducted to achieve short answer questions. Hence, due to the advancement of artificial intelligence and adaptability of deep learning models, we introduced a new model to score short answer subjective questions. Using bi-directional answer to answer co-attention, we have demonstrated the extent to which each words and sentences features of student answer detected by the model and shown prom-ising result on both Kaggle and Mohler’s dataset. The experiment on Amharic short an-swer dataset prepared for this research work also shows promising result that can be used as baseline for subsequent works.

Effective questions in referential visual dialogue
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti

An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.

A Translation-Based Approach to Morphology Learning for Low Resource Languages
Tewodros Gebreselassie | Amanuel Mersha | Michael Gasser

“Low resource languages” usually refers to languages that lack corpora and basic tools such as part-of-speech taggers. But a significant number of such languages do benefit from the availability of relatively complex linguistic descriptions of phonology, morphology, and syntax, as well as dictionaries. A further category, probably the majority of the world’s languages, suffers from the lack of even these resources. In this paper, we investigate the possibility of learning the morphology of such a language by relying on its close relationship to a language with more resources. Specifically, we use a transfer-based approach to learn the morphology of the severely under-resourced language Gofa, starting with a neural morphological generator for the closely related language, Wolaytta. Both languages are members of the Omotic family, spoken and southwestern Ethiopia, and, like other Omotic languages, both are morphologically complex. We first create a finite- state transducer for morphological analysis and generation for Wolaytta, based on relatively complete linguistic descriptions and lexicons for the language. Next, we train an encoder-decoder neural network on the task of morphological generation for Wolaytta, using data generated by the FST. Such a network takes a root and a set of grammatical features as input and generates a word form as output. We then elicit Gofa translations of a small set of Wolaytta words from bilingual speakers. Finally, we retrain the decoder of the Wolaytta network, using a small set of Gofa target words that are translations of the Wolaytta outputs of the original network. The evaluation shows that the transfer network performs better than a separate encoder-decoder network trained on a larger set of Gofa words. We conclude with implications for the learning of morphology for severely under-resourced languages in regions where there are related languages with more resources.

Tigrinya Automatic Speech recognition with Morpheme based recognition units
Hafte Abera | Sebsibe Hailemariam

The Tigrinya language is agglutinative and has a large number of inflected and derived forms of words. Therefore a Tigrinya large vocabulary continuous speech recognition system often has a large number of different units and a high out-of-vocabulary (OOV) rate if a word is used as a recognition unit of a language model (LM) and lexicon. Therefore a morpheme-based approach has often been used and a morpheme is used as the recognition unit to reduce the high OOV rate. This paper presents an automatic speech recognition experiment conducted to see the effect of OOV words on the performance speech recognition system for Tigrinya. We tried to solve the OOV problem by using morphemes as lexicon and language model units. It has been found that the morpheme-based recognition system is better lexical and language modeling units than words. An absolute improvement (in word recognition accuracy) of 3.45 token and 8.36 types has been obtained as a result of using a morph-based vocabulary.

Variants of Vector Space Reductions for Predicting the Compositionality of English Noun Compounds
Pegah Alipoormolabashi | Sabine Schulte im Walde

Predicting the degree of compositionality of noun compounds is a crucial ingredient for lexicography and NLP applications, to know whether the compound should be treated as a whole, or through its constituents. Computational approaches for an automatic prediction typically represent compounds and their constituents within a vector space to have a numeric relatedness measure for the words. This paper provides a systematic evaluation of using different vector-space reduction variants for the prediction. We demonstrate that Word2vec and nouns-only dimensionality reductions are the most successful and stable vector space reduction variants for our task.

An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Pedro Vernetti | Larissa Freitas

Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).

A Comparison of Identification Methods of Brazilian Music Styles by Lyrics
Patrick Guimarães | Jader Froes | Douglas Costa | Larissa Freitas

In our work, we applied different techniques for the task of genre classification using lyrics. Utilizing our dataset with lyrics of typical genres in Brazil divided into seven classes, we apply some models used in machine learning and deep learning classification tasks. We explore the performance of usual models for text classification using an input in the Portuguese language. We also compare the use of RNN and classic machine learning approaches for text classification, exploring the most used methods in the field.

Enabling fast and correct typing in ‘Leichte Sprache’ (Easy Language)
Ina Steinmetz | Karin Harbusch

Simplified languages are instruments for inclusion aiming to overcome language barriers. Leichte Sprache (LS), for instance, is a variety of German with reduced complexity (cf. Basic English). So far, LS is mainly provided for, but rarely written by, its target groups, e.g. people with cognitive impairments. One reason may be the lack of technical support during the process from message conceptualization to sentence realization. In the following, we present a system for assisted typing in LS whose accuracy and speed is largely due to the deployment of real time natural-language processing enabling efficient prediction and context-sensitive grammar support.

AI4D - African Language Dataset Challenge
Kathleen Siminyu | Sackey Freshia

As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and PoS taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, curation and uncovering to African language datasets through a competitive challenge, particularly datasets that are annotated or prepared for use in a downstream NLP task.

Can Wikipedia Categories Improve Masked Language Model Pretraining?
Diksha Meghwal | Katharina Kann | Iacer Calixto | Stanislaw Jastrzebski

Pretrained language models have obtained impressive results for a large set of natural language understanding tasks. However, training these models is computationally expensive and requires huge amounts of data. Thus, it would be desirable to automatically detect groups of more or less important examples. Here, we investigate if we can leverage sources of information which are commonly overlooked, Wikipedia categories as listed in DBPedia, to identify useful or harmful data points during pretraining. We define an experimental setup in which we analyze correlations between language model perplexity on specific clusters and downstream NLP task performances during pretraining. Our experiments show that Wikipedia categories are not a good indicator of the importance of specific sentences for pretraining.

FFR v1.1: Fon-French Neural Machine Translation
Chris Chinenye Emezue | Femi Pancrace Bonaventure Dossou

All over the world and especially in Africa, researchers are putting efforts into building Neural Machine Translation (NMT) systems to help tackle the language barriers in Africa, a continent of over 2000 different languages. However, the low-resourceness, diacritical, and tonal complexities of African languages are major issues being faced. The FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we introduce FFR Dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model, trained on the dataset. The dataset and model are made publicly available, to promote collaboration and reproducibility.

Classification and Analysis of Neologisms Produced by Learners of Spanish: Effects of Proficiency and Task
Shira Wein

The Spanish Learner Language Oral Corpora (SPLLOC) of transcribed conversations between investigators and language learners contains a set of neologism tags. In this work, the utterances tagged as neologisms are broken down into three categories: true neologisms, loanwords, and errors. This work examines the relationships between neologism, loanword, and error production and both language learner level and conversation task. The results of this study suggest that loanwords and errors are produced most frequently by language learners with moderate experience, while neologisms are produced most frequently by native speakers. This study also indicates that tasks that require descriptions of images draw more neologism, loanword and error production. We ultimately present a unique analysis of the implications of neologism, loanword, and error production useful for further work in second language acquisition research, as well as for language educators.

Developing a Monolingual Sentence Simplification Corpus for Urdu
Yusra Anees | Sadaf Abdul Rauf | Nauman Iqbal | Abdul Basit Siddiqi

Complex sentences are a hurdle in the learning process of language learners. Sentence simplification aims to convert a complex sentence into its simpler form such that it is easily comprehensible. To build such automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. No such corpus has yet been developed for Urdu and we fill this gap by developing one such corpus to help start readability and automatic sentence simplification research. We present a lexical and syntactically simplified Urdu simplification corpus and a detailed analysis of the various simplification operations. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified, and syntactically simplified corpora.

Translating Natural Language Instructions for Behavioral Robot Navigation with a Multi-Head Attention Mechanism
Patricio Cerda-Mardini | Vladimir Araujo | Álvaro Soto

We propose a multi-head attention mechanism as a blending layer in a neural network model that translates natural language to a high level behavioral language for indoor robot navigation. We follow the framework established by (Zang et al., 2018a) that proposes the use of a navigation graph as a knowledge base for the task. Our results show significant performance gains when translating instructions on previously unseen environments, therefore, improving the generalization capabilities of the model.

Towards Mitigating Gender Bias in a decoder-based Neural Machine Translation model by Adding Contextual Information
Christine Basta | Marta R. Costa-jussà | José A. R. Fonollosa

Gender bias negatively impacts many natural language processing applications, including machine translation (MT). The motivation behind this work is to study whether recent proposed MT techniques are significantly contributing to attenuate biases in document-level and gender-balanced data. For the study, we consider approaches of adding the previous sentence and the speaker information, implemented in a decoder-based neural MT system. We show improvements both in translation quality (+1 BLEU point) as well as in gender bias mitigation on WinoMT (+5% accuracy).

Predicting and Analyzing Law-Making in Kenya
Oyinlola Babafemi | Adewale Akinfaderin

Modelling and analyzing parliamentary legislation, roll-call votes and order of proceedings in developed countries has received significant attention in recent years. In this paper, we focused on understanding the bills introduced in a developing democracy, the Kenyan bicameral parliament. We developed and trained machine learning models on a combination of features extracted from the bills to predict the outcome - if a bill will be enacted or not. We observed that the texts in a bill are not as relevant as the year and month the bill was introduced and the category the bill belongs to.

Defining and Evaluating Fair Natural Language Generation
Catherine Yeo | Alyssa Chen

Our work focuses on the biases that emerge in the natural language generation (NLG) task of sentence completion. In this paper, we introduce a mathematical framework of fairness for NLG followed by an evaluation of gender biases in two state-of-the-art language models. Our analysis provides a theoretical formulation for biases in NLG and empirical evidence that existing language generation models embed gender bias.

Political Advertising Dataset: the use case of the Polish 2020 Presidential Elections
Lukasz Augustyniak | Krzysztof Rajda | Tomasz Kajdanowicz | Michał Bernaczyk

Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categories, which constitute campaigning under Polish electoral law. We achieved a 0.65 inter-annotator agreement (Cohen’s kappa score). An additional annotator resolved the mismatches between the first two annotators improving the consistency and complexity of the annotation process. We used the newly created dataset to train a well established neural tagger (achieving a 70% percent points F1 score). We also present a possible direction of use cases for such datasets and models with an initial analysis of the Polish 2020 Presidential Elections on Twitter.

The human unlikeness of neural language models in next-word prediction
Cassandra L. Jacobs | Arya D. McCarthy

The training objective of unidirectional language models (LMs) is similar to a psycholinguistic benchmark known as the cloze task, which measures next-word predictability. However, LMs lack the rich set of experiences that people do, and humans can be highly creative. To assess human parity in these models’ training objective, we compare the predictions of three neural language models to those of human participants in a freely available behavioral dataset (Luke & Christianson, 2016). Our results show that while neural models show a close correspondence to human productions, they nevertheless assign insufficient probability to how often speakers guess upcoming words, especially for open-class content words.

Long-Tail Predictions with Continuous-Output Language Models
Shiran Dudy | Steven Bedrick

Neural language models typically employ a categorical approach to prediction and training, leading to well-known computational and numerical limitations. An under-explored alternative approach is to perform prediction directly against a continuous word embedding space, which according to recent research is more akin to how lexemes are represented in the brain. Choosing this method opens the door for for large-vocabulary, language models and enables substantially smaller and simpler computational complexities. In this research we explore a different important trait - the continuous output prediction models reach low-frequency vocabulary words which we show are often ignored by the categorical model. Such words are essential, as they can contribute to personalization and user vocabulary adaptation. In this work, we explore continuous-space language modeling in the context of a word prediction task over two different textual domains (newswire text and biomedical journal articles). We investigate both traditional and adversarial training approaches, and report results using several different embedding spaces and decoding mechanisms. We find that our continuous-prediction approach outperforms the standard categorical approach in terms of term diversity, in particular with rare words.

Analyzing the Framing of 2020 Presidential Candidates in the News
Audrey Acken | Dorottya Demszky

In this study, we apply NLP methods to learn about the framing of the 2020 Democratic Presidential candidates in news media. We use both a lexicon-based approach and word embeddings to analyze how candidates are discussed in news sources with different political leanings. Our results show significant differences in the framing of candidates across the news sources along several dimensions, such as sentiment and agency, paving the way for a deeper investigation.

Understanding the Impact of Experiment Design for Evaluating Dialogue System Output
Sashank Santhanam | Samira Shaikh

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output

Studying The Effect of Emotional and Moral Language on Information Contagion during the Charlottesville Event
Khyati Mahajan | Samira Shaikh

We highlight the contribution of emotional and moral language towards information contagion online. We find that retweet count on Twitter is significantly predicted by the use of negative emotions with negative moral language. We find that a tweet is less likely to be retweeted (hence less engagement and less potential for contagion) when it has emotional language expressed as anger along with a specific type of moral language, known as authority-vice. Conversely, when sadness is expressed with authority-vice, the tweet is more likely to be retweeted. Our findings indicate how emotional and moral language can interact in predicting information contagion.

Mapping of Narrative Text Fields To ICD-10 Codes Using Natural Language Processing and Machine Learning
Risuna Nkolele

The assignment of ICD-10 codes is done manually, which is laborious and prone to errors. The use of natural language processing and machine learning approaches have been receiving increasing attention on automating the task of assigning ICD-10 codes. In this study, we investigate the effect of different approaches on automating the task of assigning ICD-10 codes. To do this we use the South African clinical dataset containing three narrative text fields (Clinical Summary, Presenting Complaints, and Examination Findings). The following traditional machine learning algorithms, namely: Logistic Regression, Multinomial Naive Bayes, Support Vector Machine, Decision Tree, RandomForest, and Extreme Gradient Boost were used as our classifiers. Our study results show the strong potential of automated ICD-10 coding from the narrative text fields. ExtremeGradient Boost outperformed other classifiers in automating the task of assigning ICD-10 codes based on the three narrative text fields with an accuracy of 79%, precision of75%, and recall of 78%. While our worst classifier (Decision Tree) achieved the accuracy of 54%, precision of 60% and recall of 56%.

Multitask Models for Controlling the Complexity of Neural Machine Translation
Sweta Agrawal | Marine Carpuat

We introduce a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a novel dataset of news articles available in English and Spanish and written for diverse reading grade levels. We leverage this dataset to train multitask sequence to sequence models that translate Spanish into English targeted at an easier reading grade level than the original Spanish. We show that multitask models outperform pipeline approaches that translate and simplify text independently.

Using Social Media For Bitcoin Day Trading Behavior Prediction
Anna Paula Pawlicka Maule | Kristen Johnson

This abstract presents preliminary work in the application of natural language processing techniques and social network modeling for the prediction of cryptocurrency trading and investment behavior. Specifically, we are building models to use language and social network behaviors to predict if the tweets of a 24-hour period can be used to buy or sell cryptocurrency to make a profit. In this paper we present our novel task and initial language modeling studies.

HausaMT v1.0: Towards English–Hausa Neural Machine Translation
Adewale Akinfaderin

Neural Machine Translation (NMT) for low-resource languages suffers from low performance because of the lack of large amounts of parallel data and language diversity. To contribute to ameliorating this problem, we built a baseline model for English–Hausa machine translation, which is considered a task for low–resource language. The Hausa language is the second largest Afro–Asiatic language in the world after Arabic and it is the third largest language for trading across a larger swath of West Africa countries, after English and French. In this paper, we curated different datasets containing Hausa–English parallel corpus for our translation. We trained baseline models and evaluated the performance of our models using the Recurrent and Transformer encoder–decoder architecture with two tokenization approaches: standard word–level tokenization and Byte Pair Encoding (BPE) subword tokenization.

Outcomes of coming out: Analyzing stories of LGBTQ+
Krithika Ramesh | Tanvi Anand

The Internet is frequently used as a platform through which opinions and views on various topics can be expressed. One such topic that draws controversial attention is LGBTQ+ rights. This paper attempts to analyze the reaction that members of the LGBTQ+ community face when they reveal their gender or sexuality, or in other words, when they ‘come out of the closet’. We aim to classify the experiences shared by them as positive or negative. We collected data from various sources, primarily Twitter. We have applied deep learning techniques and compared the results to other classifiers, and the results obtained from applying classical sentiment analysis techniques to it.

An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages
Aquia Richburg | Ramy Eskander | Smaranda Muresan | Marine Carpuat

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features
Aamir Farhan | Mashrukh Islam | Dipti Misra Sharma

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F 1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.