Widening Natural Language Processing (2019)


bib (full) Proceedings of the 2019 Workshop on Widening NLP

pdf bib
Proceedings of the 2019 Workshop on Widening NLP
Amittai Axelrod | Diyi Yang | Rossana Cunha | Samira Shaikh | Zeerak Waseem

Development of a General Purpose Sentiment Lexicon for Igbo Language
Emeka Ogbuju | Moses Onyesolu

There are publicly available general purpose sentiment lexicons in some high resource languages but very few exist in the low resource languages. This makes it difficult to directly perform sentiment analysis tasks in such languages. The objective of this work is to create a general purpose sentiment lexicon for Igbo language that can determine the sentiment of documents written in Igbo language without having to translate it to English language. The material used was an automatically translated Liu’s lexicon and manual addition of Igbo native words. The result of this work is a general purpose lexicon – IgboSentilex. The performance was tested on the BBC Igbo news channel. It returned an average polarity agreement of 95% with other general purpose sentiment lexicons.

Towards a Resource Grammar for Runyankore and Rukiga
David Bamutura | Peter Ljunglöf

Currently, there is a lack of computational grammar resources for many under-resourced languages which limits the ability to develop Natural Language Processing (NLP) tools and applications such as Multilingual Document Authoring, Computer-Assisted Language Learning (CALL) and Low-Coverage Machine Translation (MT) for these languages. In this paper, we present our attempt to formalise the grammar of two such languages: Runyankore and Rukiga. For this formalisation we use the Grammatical Framework (GF) and its Resource Grammar Library (GF-RGL).

Speech Recognition for Tigrinya language Using Deep Neural Network Approach
Hafte Abera | Sebsibe H/mariam

This work presents a speech recognition model for Tigrinya language .The Deep Neural Network is used to make the recognition model. The Long Short-Term Memory Network (LSTM), which is a special kind of Recurrent Neural Network composed of Long Short-Term Memory blocks, is the primary layer of our neural network model. The 40-dimensional features are MFCC-LDA-MLLT-fMLLR with CMN were used. The acoustic models are trained on features that are obtained by projecting down to 40 dimensions using linear discriminant analysis (LDA). Moreover, speaker adaptive training (SAT) is done using a single feature-space maximum likelihood linear regression (FMLLR) transform estimated per speaker. We train and compare LSTM and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models. Finally, the accuracy of the model is evaluated based on the recognition rate.

Knowledge-Based Word Sense Disambiguation with Distributional Semantic Expansion
Hossein Rouhizadeh | Mehrnoush Shamsfard | Masoud Rouhizadeh

In this paper, we presented a WSD system that uses LDA topics for semantic expansion of document words. Our system also uses sense frequency information from SemCor to give higher priority to the senses which are more probable to happen.

AspeRa: Aspect-Based Rating Prediction Based on User Reviews
Elena Tutubalina | Valentin Malykh | Sergey Nikolenko | Anton Alekseev | Ilya Shenbin

We propose a novel Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items. It is based on aspect extraction with neural networks and combines the advantages of deep learning and topic modeling. It is mainly designed for recommendations, but an important secondary goal of AspeRa is to discover coherent aspects of reviews that can be used to explain predictions or for user profiling. We conduct a comprehensive empirical study of AspeRa, showing that it outperforms state-of-the-art models in terms of recommendation quality and produces interpretable aspects. This paper is an abridged version of our work (Nikolenko et al., 2019)

Recognizing Arrow Of Time In The Short Stories
Fahimeh Hosseini | Hosein Fooladi | Mohammad Reza Samsami

Recognizing the arrow of time in the context of paragraphs in short stories is a challenging task. i.e., given only two paragraphs (excerpted from a random position in a short story), determining which comes first and which comes next is a difficult task even for humans. In this paper, we have collected and curated a novel dataset for tackling this challenging task. We have shown that a pre-trained BERT architecture achieves reasonable accuracy on the task, and outperforms RNN-based architectures.

Amharic Word Sequence Prediction
Nuniyat Kifle

The significance of computers and handheld devices are not deniable in the modern world of today. Texts are entered to these devices using word processing programs as well as other techniques and word prediction is one of the techniques. Word Prediction is the action of guessing or forecasting what word comes after, based on some current information, and it is the main focus of this study. Even though Amharic is used by a large number of populations, no significant work is done on the topic of word sequence prediction. In this study, Amharic word sequence prediction model is developed with statistical methods using Hidden Markov Model by incorporating detailed Part of speech tag. Evaluation of the model is performed using developed prototype and keystroke savings (KSS) as a metrics. According to our experiment, prediction result using a bi-gram with detailed Part of Speech tag model has higher KSS and it is better compared to tri-gram model and better than those without Part of Speech tag. Therefore, statistical approach with Detailed POS has quite good potential on word sequence prediction for Amharic language. This research deals with designing word sequence prediction model in Amharic language. It is a language that is spoken in eastern Africa. One of the needs for Amharic word sequence prediction for mobile use and other digital devices is in order to facilitate data entry and communication in our language. Word sequence prediction is a challenging task for inflected languages. (Arora, 2007) These kinds of languages are morphologically rich and have enormous word forms. i.e. one word can have different forms. As Amharic language is highly inflected language and morphologically rich it shares this problem. (prediction, 2008) This problem makes word prediction system much more difficult and results poor performance. Due to this reason storing all forms in dictionary won’t solve the problem as in English and other less inflected languages. But considering other techniques that could help the predictor to suggest the next word like a POS based prediction should be used. Previous researches used dictionary approach with no consideration of context information. Hence storing all forms of words in dictionary for inflected languages such as Amharic language has been less effective. The main goal of this thesis is to implement Amharic word prediction model that works with better prediction speed and with narrowed search space as much as possible. We introduced two models; tags and words and linear interpolation that use part of speech tag information in addition to word n-grams in order to maximize the likelihood of syntactic appropriateness of the suggestions. We believe the results found reflect this. Amharic word sequence prediction using bi-gram model with higher POS weight and detailed Part of speech tag gave better keystroke savings in all scenarios of our experiment. The study followed Design Science Research Methodology (DSRM). Since DSRM includes approaches, techniques, tools, algorithms and evaluation mechanisms in the process, we followed statistical approach with statistical language modeling and built Amharic prediction model based on information from Part of Speech tagger. The statistics included in the systems varies from single word frequencies to part-of-speech tag n-grams. That means it included the statistics of Word frequencies, Word sequence frequencies, Part-of-speech sequence frequencies and other important information. Later on the system was evaluated using Keystroke Savings. (Lindh, 011). Linux mint was used as the main Operation System during the frame work design. We used corpus of 680,000 tagged words that has 31 tag sets, python programming language and its libraries for both the part of speech tagger and the predictor module. Other Tool that was used is the SRILIM (The SRI language modeling toolkit) in order to generate unigram bigram and trigram count as an input for the language model. SRILIM is toolkit that uses to build and apply statistical language modeling. This thesis presented Amharic word sequence prediction model using the statistical approach. We described a combined statistical and lexical word prediction system for handling inflected languages by making use of POS tags to build the language model. We developed Amharic language models of bigram and trigram for the training purpose. We obtained 29% of KSS using bigram model with detailed part ofspeech tag. Hence, Based on the experiments carried out for this study and the results obtained, the following conclusions were made. We concluded that employing syntactic information in the form of Part-of-Speech (POS) n-grams promises more effective predictions. We also can conclude data quantity, performance of POS tagger and data quality highly affects the keystroke savings. Here in our study the tests were done on a small collection of 100 phrases. According to our evaluation better Keystroke saving (KSS) is achieved when using bi-gram model than the tri-gram models. We believe the results obtained using the experiment of detailed Part of speech tags were effective Since speed and search space are the basic issues in word sequence prediction

A Framework for Relation Extraction Across Multiple Datasets in Multiple Domains
Geeticka Chauhan | Matthew McDermott | Peter Szolovits

In this work, we aim to build a unifying framework for relation extraction (RE), applying this on 3 highly used datasets with the ability to be extendable to new datasets. At the moment, the domain suffers from lack of reproducibility as well as a lack of consensus on generalizable techniques. Our framework will be open-sourced and will aid in performing systematic exploration on the effect of different modeling techniques, pre-processing, training methodologies and evaluation metrics on the 3 datasets to help establish a consensus.

Learning and Understanding Different Categories of Sexism Using Convolutional Neural Network’s Filters
Sima Sharifirad | Alon Jacovi

Sexism is very common in social media and makes the boundaries of free speech tighter for female users. Automatically flagging and removing sexist content requires niche identification and description of the categories. In this study, inspired by social science work, we propose three categories of sexism toward women as follows: “Indirect sexism”, “Sexual sexism” and “Physical sexism”. We build classifiers such as Convolutional Neural Network (CNN) to automatically detect different types of sexism and address problems of annotation. Even though inherent non-interpretability of CNN is a challenge for users who detect sexism, as the reason classifying a given speech instance with regard to sexism is difficult to glance from a CNN. However, recent research developed interpretable CNN filters for text data. In a CNN, filters followed by different activation patterns along with global max-pooling can help us tease apart the most important ngrams from the rest. In this paper, we interpret a CNN model trained to classify sexism in order to understand different categories of sexism by detecting semantic categories of ngrams and clustering them. Then, these ngrams in each category are used to improve the performance of the classification task. It is a preliminary work using machine learning and natural language techniques to learn the concept of sexism and distinguishes itself by looking at more precise categories of sexism in social media along with an in-depth investigation of CNN’s filters.

Modeling Five Sentence Quality Representations by Finding Latent Spaces Produced with Deep Long Short-Memory Models
Pablo Rivas

We present a study in which we train neural models that approximate rules that assess the quality of English sentences. We modeled five rules using deep LSTMs trained over a dataset of sentences whose quality is evaluated under such rules. Preliminary results suggest the neural architecture can model such rules to high accuracy.

English-Ethiopian Languages Statistical Machine Translation
Solomon Teferra Abate | Michael Melese | Martha Yifiru Tachbelie | Million Meshesha | Solomon Atinafu | Wondwossen Mulugeta | Yaregal Assabie | Hafte Abera | Biniyam Ephrem | Tewodros Gebreselassie | Wondimagegnhue Tsegaye Tufa | Amanuel Lemma | Tsegaye Andargie | Seifedin Shifaw

In this paper, we describe an attempt towards the development of parallel corpora for English and Ethiopian Languages, such as Amharic, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. The corpora are used for conducting bi-directional SMT experiments. The BLEU scores of the bi-directional SMT systems show a promising result. The morphological richness of the Ethiopian languages has a great impact on the performance of SMT especially when the targets are Ethiopian languages.

An automatic discourse relation alignment experiment on TED-MDB
Sibel Ozer | Deniz Zeyrek

This paper describes an automatic discourse relation alignment experiment as an empirical justification of the planned annotation projection approach to enlarge the 3600-word multilingual corpus of TED Multilingual Discourse Bank (TED-MDB). The experiment is carried out on a single language pair (English-Turkish) included in TED-MDB. The paper first describes the creation of a large corpus of English-Turkish bi-sentences, then it presents a sense-based experiment that automatically aligns the relations in the English sentences of TED-MDB with the Turkish sentences. The results are very close to the results obtained from an earlier semi-automatic post-annotation alignment experiment validated by human annotators and are encouraging for future annotation projection tasks.

The Design and Construction of the Corpus of China English
Lixin Xia | Yun Xia

The paper describes the development a corpus of an English variety, i.e. China English, in or-der to provide a linguistic resource for researchers in the field of China English. The Corpus of China English (CCE) was built with due consideration given to its representativeness and authenticity. It was composed of more than 13,962,102 tokens in 15,333 texts evenly divided between the following four genres: newspapers, magazines, fiction and academic writings. The texts cover a wide range of domains, such as news, financial, politics, environment, social, culture, technology, sports, education, philosophy, literary, etc. It is a helpful resource for research on China English, computational linguistics, natural language processing, corpus linguistics and English language education.

Learning Trilingual Dictionaries for Urdu – Roman Urdu – English
Moiz Rauf | Sebastian Padó

In this paper, we present an effort to generate a joint Urdu, Roman Urdu and English trilingual lexicon using automated methods. We make a case for using statistical machine translation approaches and parallel corpora for dictionary creation. To this purpose, we use word alignment tools on the corpus and evaluate translations using human evaluators. Despite different writing script and considerable noise in the corpus our results show promise with over 85% accuracy of Roman Urdu–Urdu and 45% English–Urdu pairs.

Joint Inference on Bilingual Parse Trees for PP-attachment Disambiguation
Geetanjali Rakshit

Prepositional Phrase (PP) attachment is a classical problem in NLP for languages like English, which suffer from structural ambiguity. In this work, we solve this problem with the help of another language free from such ambiguities, using the parse tree of the parallel sentence in the other language, and word alignments. We formulate an optimization framework that encourages agreement between the parse trees for two languages, and solve it using a novel Dual Decomposition (DD) based algorithm. Experiments on the English-Hindi language pair show promising improvements over the baseline.

Using Attention-based Bidirectional LSTM to Identify Different Categories of Offensive Language Directed Toward Female Celebrities
Sima Sharifirad | Stan Matwin

Social media posts reflect the emotions, intentions and mental state of the users. Twitter users who harass famous female figures may do so with different intentions and intensities. Recent studies have published datasets focusing on different types of online harassment, vulgar language, and emotional intensities. We trained, validate and test our proposed model, attention-based bidirectional neural network, on the three datasets:”online harassment”, “vulgar language” and “valance” and achieved state of the art performance in two of the datasets. We report F1 score for each dataset separately along with the final precision, recall and macro-averaged F1 score. In addition, we identify ten female figures from different professions and racial backgrounds who have experienced harassment on Twitter. We tested the trained models on ten collected corpuses each related to one famous female figure to predict the type of harassing language, the type of vulgar language and the degree of intensity of language occurring on their social platforms. Interestingly, the achieved results show different patterns of linguistic use targeting different racial background and occupations. The contribution of this study is two-fold. From the technical perspective, our proposed methodology is shown to be effective with a good margin in comparison to the previous state-of-the-art results on one of the two available datasets. From the social perspective, we introduce a methodology which can unlock facts about the nature of offensive language targeting women on online social platforms. The collected dataset will be shared publicly for further investigation.

Sentiment Analysis Model for Opinionated Awngi Text: Case of Music Reviews
Melese Mihret | Muluneh Atinaf

Abstract The analysis of sentiments is imperative to make a decision for individuals, organizations, and governments. Due to the rapid growth of Awngi (Agew) text on the web, there is no available corpus annotated for sentiment analysis. In this paper, we present a SA model for the Awngi language spoken in Ethiopia, by using a supervised machine learning approach. We developed our corpus by collecting around 1500 posts from online sources. This research is begun to build and evaluate the model for opinionated Awngi music reviews. Thus, pre-processing techniques have been employed to clean the data, to convert transliterations to the native Ethiopic script for accessibility and convenience to typing and to change the words to their base form by removing the inflectional morphemes. After pre-processing, the corpus is manually annotated by three the language professional for giving polarity, and rate, their level of confidence in their selection and sentiment intensity scale values. To improve the calculation method of feature selection and weighting and proposed a more suitable SA algorithm for feature extraction named CHI and weight calculation named TF IDF, increasing the proportion and weight of sentiment words in the feature words. We employed Support Vector Machines (SVM), Naïve Bayes (NB) and Maximum Entropy (MxEn) machine learning algorithms. Generally, the results are encouraging, despite the morphological challenge in Awngi, the data cleanness and small size of data. We are believed that the results could improve further with a larger corpus.

A compositional view of questions
Maria Boritchev | Maxime Amblard

We present a research on compositional treatment of questions in neo-davidsonian event semantics style. Our work is based on (Champollion, 2011) where only declarative sentences were considered. Our research is based on complex formal examples, paving the way towards further research in this domain and further testing on real-life corpora.

Controlling the Specificity of Clarification Question Generation
Yang Trista Cao | Sudha Rao | Hal Daumé III

Unlike comprehension-style questions, clarification questions look for some missing information in a given context. However, without guidance, neural models for question generation, similar to dialog generation models, lead to generic and bland questions that cannot elicit useful information. We argue that controlling the level of specificity of the generated questions can have useful applications and propose a neural clarification question generation model for the same. We first train a classifier that annotates a clarification question with its level of specificity (generic or specific) to the given context. Our results on the Amazon questions dataset demonstrate that training a clarification question generation model on specificity annotated data can generate questions with varied levels of specificity to the given context.

Non-Monotonic Sequential Text Generation
Kiante Brantley | Kyunghyun Cho | Hal Daumé | Sean Welleck

Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order while achieving competitive performance with conventional left-to-right generation.

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them
Hila Gonen | Yoav Goldberg

Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demonstrating convincing results. However, we argue that this removal is superficial. While the bias is indeed substantially reduced according to the provided bias definition, the actual effect is mostly hiding the bias, not removing it. The gender bias information is still reflected in the distances between “gender-neutralized” words in the debiased embeddings, and can be recovered from them. We present a series of experiments to support this claim, for two debiasing methods. We conclude that existing bias removal techniques are insufficient, and should not be trusted for providing gender-neutral modeling.

How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?
Hila Gonen | Yova Kementchedjhieva | Yoav Goldberg

Many natural languages assign grammatical gender also to inanimate nouns in the language. In such languages, words that relate to the gender-marked nouns are inflected to agree with the noun’s gender. We show that this affects the word representations of inanimate nouns, resulting in nouns with the same gender being closer to each other than nouns with different gender. While “embedding debiasing” methods fail to remove the effect, we demonstrate that a careful application of methods that neutralize grammatical gender signals from the words’ context when training word embeddings is effective in removing it. Fixing the grammatical gender bias results in a positive effect on the quality of the resulting word embeddings, both in monolingual and cross lingual settings. We note that successfully removing gender signals, while achievable, is not trivial to do and that a language-specific morphological analyzer, together with careful usage of it, are essential for achieving good results.

Automatic Product Categorization for Official Statistics
Andrea Roberson

The North American Product Classification System (NAPCS) is a comprehensive, hierarchical classification system for products (goods and services) that is consistent across the three North American countries. Beginning in 2017, the Economic Census will use NAPCS to produce economy-wide product tabulations. Respondents are asked to report data from a long, pre-specified list of potential products in a given industry, with some lists containing more than 50 potential products. Businesses have expressed the desire to alternatively supply Universal Product Codes (UPC) to the U. S. Census Bureau. Much work has been done around the categorization of products using product descriptions. No study has applied these efforts for the calculation of official statistics (statistics published by government agencies) using only the text of UPC product descriptions. The question we address in this paper is: Given UPC codes and their associated product descriptions, can we accurately predict NAPCS? We tested the feasibility of businesses submitting a spreadsheet with Universal Product Codes and their associated text descriptions. This novel strategy classified text with very high accuracy rates, all of our algorithms surpassed over 90 percent.

An Online Topic Modeling Framework with Topics Automatically Labeled
Jin Fenglei | Gao Cuiyun | Lyu Michael R.

In this paper, we propose a novel online topic tracking framework, named IEDL, for tracking the topic changes related to deep learning techniques on Stack Exchange and automatically interpreting each identified topic. The proposed framework combines the prior topic distributions in a time window during inferring the topics in current time slice, and introduces a new ranking scheme to select most representative phrases and sentences for the inferred topics. Experiments on 7,076 Stack Exchange posts show the effectiveness of IEDL in tracking topic changes.

Construction and Alignment of Multilingual Entailment Graphs for Semantic Inference
Sabine Weber | Mark Steedman

This paper presents ongoing work on the construction and alignment of predicate entailment graphs in English and German. We extract predicate-argument pairs from large corpora of monolingual English and German news text and construct monolingual paraphrase clusters and entailment graphs. We use an aligned subset of entities to derive the bilingual alignment of entities and relations, and achieve better than baseline results on a translated subset of a predicate entailment data set (Levy and Dagan, 2016) and the German portion of XNLI (Conneau et al., 2018).

KB-NLG: From Knowledge Base to Natural Language Generation
Wen Cui | Minghui Zhou | Rongwen Zhao | Narges Norouzi

We perform the natural language generation (NLG) task by mapping sets of Resource Description Framework (RDF) triples into text. First we investigate the impact of increasing the number of entity types in delexicalisaiton on the generation quality. Second we conduct different experiments to evaluate two widely applied language generation systems, encoder-decoder with attention and the Transformer model on a large benchmark dataset. We evaluate different models on automatic metrics, as well as the training time. To our knowledge, we are the first to apply Transformer model to this task.

Acoustic Characterization of Singaporean Children’s English: Comparisons to American and British Counterparts
Yuling Gu | Nancy Chen

We investigate English pronunciation patterns in Singaporean children in relation to their American and British counterparts by conducting archetypal analysis on selected vowel pairs. Given that Singapore adopts British English as the institutional standard, one might expect Singaporean children to follow British pronunciation patterns, but we observe that Singaporean children also present similar patterns to Americans for TRAP-BATH spilt vowels: (1) British and Singaporean children both produce these vowels with a relatively lowered tongue height. (2) These vowels are more fronted for American and Singaporean children (p < 0.001). In addition, when comparing /æ/ and /ε/ productions, British speakers show the clearest distinction between the two vowels; Singaporean and American speakers exhibit a higher and more fronted tongue position for /æ/ (p < 0.001), causing /æ/ to be acoustically more similar to /ε/.

Rethinking Phonotactic Complexity
Tiago Pimentel | Brian Roark | Ryan Cotterell

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this simple measure, gaining insight on how complex different language’s phonotactics are. Finally, we show a very strong negative correlation between phonotactic complexity and the average length of words—Spearman rho=-0.744—when analysing a collection of 106 languages with 1016 basic concepts each.

Implementing a Multi-lingual Chatbot for Positive Reinforcement in Young Learners
Francisca Oladipo | Abdulmalik Rufai

This is a humanitarian work –a counter-terrorism effort. The presentation describes the experiences of developing a multi-lingua, interactive chatbot trained on the corpus of two Nigerian Languages (Hausa and Fulfude), with simultaneous translation to a third (Kanuri), to stimulate conversations, deliver tailored contents to the users thereby aiding in the detection of the probability and degree of radicalization in young learners through data analysis of the games moves and vocabularies. As chatbots have the ability to simulate a human conversation based on rhetorical behavior, the system is able to learn the need of individual user through constant interaction and deliver tailored contents that promote good behavior in Hausa, Fulfulde and Kanuri languages.

A Deep Learning Approach to Language-independent Gender Prediction on Twitter
Reyhaneh Hashempour

This work presents a set of experiments conducted to predict the gender of Twitter users based on language-independent features extracted from the text of the users’ tweets. The experiments were performed on a version of TwiSty dataset including tweets written by the users of six different languages: Portuguese, French, Dutch, English, German, and Italian. Logistic regression (LR), and feed-forward neural networks (FFNN) with back-propagation were used to build models in two different settings: Inter-Lingual (IL) and Cross-Lingual (CL). In the IL setting, the training and testing were performed on the same language whereas in the CL, Italian and German datasets were set aside and only used as test sets and the rest were combined to compose training and development sets. In the IL, the highest accuracy score belongs to LR whereas, in the CL, FFNN with three hidden layers yields the highest score. The results show that neural network based models underperform traditional models when the size of the training set is small; however, they beat traditional models by a non-trivial margin, when they are fed with large enough data. Finally, the feature analysis confirms that men and women have different writing styles independent of their language.

Isolating the Effects of Modeling Recursive Structures: A Case Study in Pronunciation Prediction of Chinese Characters
Minh Nguyen | Gia H Ngo | Nancy Chen

Finding that explicitly modeling structures leads to better generalization, we consider the task of predicting Cantonese pronunciations of logographs (Chinese characters) using logographs’ recursive structures. This task is a suitable case study for two reasons. First, logographs’ pronunciations depend on structures (i.e. the hierarchies of sub-units in logographs) Second, the quality of logographic structures is consistent since the structures are constructed automatically using a set of rules. Thus, this task is less affected by confounds such as varying quality between annotators. Empirical results show that modeling structures explicitly using treeLSTM outperforms LSTM baseline, reducing prediction error by 6.0% relative.

Benchmarking Neural Machine Translation for Southern African Languages
Jade Abbott | Laura Martinus

Unlike major Western languages, most African languages are very low-resourced. Furthermore, the resources that do exist are often scattered and difficult to obtain and discover. As a result, the data and code for existing research has rarely been shared, meaning researchers struggle to reproduce reported results, and almost no publicly available benchmarks or leaderboards for African machine translation models exist. To start to address these problems, we trained neural machine translation models for a subset of Southern African languages on publicly-available datasets. We provide the code for training the models and evaluate the models on a newly released evaluation set, with the aim of starting a leaderboard for Southern African languages and spur future research in the field.

OCR Quality and NLP Preprocessing
Margot Mieskes | Stefan Schmunk

We present initial experiments to evaluate the performance of tasks such as Part of Speech Tagging on data corrupted by Optical Character Recognition (OCR). Our results, based on English and German data, using artificial experiments as well as initial real OCRed data indicate that already a small drop in OCR quality considerably increases the error rates, which would have a significant impact on subsequent processing steps.

Developing a Fine-grained Corpus for a Less-resourced Language: the case of Kurdish
Roshna Abdulrahman | Hossein Hassani | Sina Ahmadi

Kurdish is a less-resourced language consisting of different dialects written in various scripts. Approximately 30 million people in different countries speak the language. The lack of corpora is one of the main obstacles in Kurdish language processing. In this paper, we present KTC-the Kurdish Textbooks Corpus, which is composed of 31 K-12 textbooks in Sorani dialect. The corpus is normalized and categorized into 12 educational subjects containing 693,800 tokens (110,297 types). Our resource is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.

Amharic Question Answering for Biography, Definition, and Description Questions
Tilahun Abedissa Taffa | Mulugeta Libsie

A broad range of information needs can often be stated as a question. Question Answering (QA) systems attempt to provide users concise answer(s) to natural language questions. The existing Amharic QA systems handle fact-based questions that usually take named entities as an answer. To deal with more complex information needs we developed an Amharic non-factoid QA for biography, definition, and description questions. A hybrid approach has been used for the question classification. For document filtering and answer extraction we have used lexical patterns. On the other hand to answer biography questions we have used a summarizer and the generated summary is validated using a text classifier. Our QA system is evaluated and has shown a promising result.

Polysemous Language in Child Directed Speech
Sammy Floyd | Libby Barak | Adele Goldberg | Casey Lew-Williams

Polysemous Language in Child Directed Speech Learning the meaning of words is one of the fundamental building blocks of verbal communication. Models of child language acquisition have generally made the simplifying assumption that each word appears in child-directed speech with a single meaning. To understand naturalistic word learning during childhood, it is essential to know whether children hear input that is in fact constrained to single meaning per word, or whether the environment naturally contains multiple senses.In this study, we use a topic modeling approach to automatically induce word senses from child-directed speech. Our results confirm the plausibility of our automated analysis approach and reveal an increasing rate of using multiple senses in child-directed speech, starting with corpora from children as early as the first year of life.

Principled Frameworks for Evaluating Ethics in NLP Systems
Shrimai Prabhumoye | Elijah Mayfield | Alan W Black

We critique recent work on ethics in natural language processing. Those discussions have focused on data collection, experimental design, and interventions in modeling. But we argue that we ought to first understand the frameworks of ethics that are being used to evaluate the fairness and justice of algorithmic systems. Here, we begin that discussion by outlining deontological and consequentialist ethics, and make predictions on the research agenda prioritized by each.

Understanding the Shades of Sexism in Popular TV Series
Nayeon Lee | Yejin Bang | Jamin Shin | Pascale Fung

[Multiple-submission] In the midst of a generation widely exposed to and influenced by media entertainment, the NLP research community has shown relatively little attention on the sexist comments in popular TV series. To understand sexism in TV series, we propose a way of collecting distant supervision dataset using Character Persona information with the psychological theories on sexism. We assume that sexist characters from TV shows are more prone to making sexist comments when talking about women, and show that this hypothesis is valid through experiment. Finally, we conduct an interesting analysis on popular TV show characters and successfully identify different shades of sexism that is often overlooked.

Evaluating Ways of Adapting Word Similarity
Libby Barak | Adele Goldberg

People judge pairwise similarity by deciding which aspects of the words’ meanings are relevant for the comparison of the given pair. However, computational representations of meaning rely on dimensions of the vector representation for similarity comparisons, without considering the specific pairing at hand. Prior work has adapted computational similarity judgments by using the softmax function in order to address this limitation by capturing asymmetry in human judgments. We extend this analysis by showing that a simple modification of cosine similarity offers a better correlation with human judgments over a comprehensive dataset. The modification performs best when the similarity between two words is calculated with reference to other words that are most similar and dissimilar to the pair.

Exploring the Use of Lexicons to aid Deep Learning towards the Detection of Abusive Language
Anna Koufakou | Jason Scott

Detecting abusive language is a significant research topic, which has received a lot of attention recently. Our work focused on detecting personal attacks in online conversations. State-of-the-art research on this task has largely used deep learning with word embeddings. We explored the use of sentiment lexicons as well as semantic lexicons towards improving the accuracy of the baseline Convolutional Neural Network (CNN) using regular word embeddings. This is a work in progress, limited by time constraints and appropriate infrastructure. Our preliminary results showed promise for utilizing lexicons, especially semantic lexicons, for the task of detecting abusive language.

Entity-level Classification of Adverse Drug Reactions: a Comparison of Neural Network Models
Ilseyar Alimova | Elena Tutubalina

This paper presents our experimental work on exploring the potential of neural network models developed for aspect-based sentiment analysis for entity-level adverse drug reaction (ADR) classification. Our goal is to explore how to represent local context around ADR mentions and learn an entity representation, interacting with its context. We conducted extensive experiments on various sources of text-based information, including social media, electronic health records, and abstracts of scientific articles from PubMed. The results show that Interactive Attention Neural Network (IAN) outperformed other models on four corpora in terms of macro F-measure. This work is an abridged version of our recent paper accepted to Programming and Computer Software journal in 2019.

Context Effects on Human Judgments of Similarity
Libby Barak | Noe Kong-Johnson | Adele Goldberg

The semantic similarity of words forms the basis of many natural language processing methods. These computational similarity measures are often based on a mathematical comparison of vector representations of word meanings, while human judgments of similarity differ in lacking geometrical properties, e.g., symmetric similarity and triangular similarity. In this study, we propose a novel task design to further explore human behavior by asking whether a pair of words is deemed more similar depending on an immediately preceding judgment. Results from a crowdsourcing experiment show that people consistently judge words as more similar when primed by a judgment that evokes a relevant relationship. Our analysis further shows that word2vec similarity correlated significantly better with the out-of-context judgments, thus confirming the methodological differences in human-computer judgments, and offering a new testbed for probing the differences.

NLP Automation to Read Radiological Reports to Detect the Stage of Cancer Among Lung Cancer Patients
Khushbu Gupta | Ratchainant Thammasudjarit | Ammarin Thakkinstian

A common challenge in the healthcare industry today is physicians have access to massive amounts of healthcare data but have little time and no appropriate tools. For instance, the risk prediction model generated by logistic regression could predict the probability of diseases occurrence and thus prioritizing patients’ waiting list for further investigations. However, many medical reports available in current clinical practice system are not yet ready for analysis using either statistics or machine learning as they are in unstructured text format. The complexity of medical information makes the annotation or validation of data very challenging and thus acts as a bottleneck to apply machine learning techniques in medical data. This study is therefore conducted to create such annotations automatically where the computer can read radiological reports for oncologists and mark the staging of lung cancer. This staging information is obtained using the rule-based method implemented using the standards of Tumor Node Metastasis (TNM) staging along with deep learning technology called Long Short Term Memory (LSTM) to extract clinical information from the Computed Tomography (CT) text report. The empirical experiment shows promising results being the accuracy of up to 85%.

Augmenting Named Entity Recognition with Commonsense Knowledge
Gaith Dekhili | Tan Ngoc Le | Fatiha Sadat

Commonsense can be vital in some applications like Natural Language Understanding (NLU), where it is often required to resolve ambiguity arising from implicit knowledge and underspecification. In spite of the remarkable success of neural network approaches on a variety of Natural Language Processing tasks, many of them struggle to react effectively in cases that require commonsense knowledge. In the present research, we take advantage of the availability of the open multilingual knowledge graph ConceptNet, by using it as an additional external resource in Named Entity Recognition (NER). Our proposed architecture involves BiLSTM layers combined with a CRF layer that was augmented with some features such as pre-trained word embedding layers and dropout layers. Moreover, apart from using word representations, we used also character-based representation to capture the morphological and the orthographic information. Our experiments and evaluations showed an improvement in the overall performance with +2.86 in the F1-measure. Commonsense reasonnig has been employed in other studies and NLP tasks but to the best of our knowledge, there is no study relating the integration of a commonsense knowledge base in NER.

Pardon the Interruption: Automatic Analysis of Gender and Competitive Turn-Taking in United States Supreme Court Hearings
Haley Lepp

The United States Supreme Court plays a key role in defining the legal basis for gender discrimination throughout the country, yet there are few checks on gender bias within the court itself. In conversational turn-taking, interruptions have been documented as a marker of bias between speakers of different genders. The goal of this study is to automatically differentiate between respectful and disrespectful conversational turns taken during official hearings, which could help in detecting bias and finding remediation techniques for discourse in the courtroom. In this paper, I present a corpus of turns annotated by legal professionals, and describe the design of a semi-supervised classifier that will use acoustic and lexical features to analyze turn-taking at scale. On completion of annotations, this classifier will be trained to extract the likelihood that turns are respectful or disrespectful for use in studies of speech trends.

Evaluating Coherence in Dialogue Systems using Entailment
Nouha Dziri | Ehsan Kamalloo | Kory Mathewson | Osmar Zaiane

Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers. Automatic metrics such as BLEU correlate weakly with human annotations, resulting in a significant bias across different models and datasets. Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable. Moreover, judges tend to evaluate a small number of dialogues, meaning that minor differences in evaluation configuration may lead to dissimilar results. In this paper, we present interpretable metrics for evaluating topic coherence by making use of distributed sentence representations. Furthermore, we introduce calculable approximations of human judgment based on conversational coherence by adopting state-of-the-art entailment techniques. Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses. This paper has been accepted in NAACL 2019.

Exploiting machine algorithms in vocalic quantification of African English corpora
Lasisi Adeiza Isiaka

Towards procedural fidelity in the processing of African English speech corpora, this work demonstrates how the adaptation of machine-assisted segmentation of phonemes and automatic extraction of acoustic values can significantly speed up the processing of naturalistic data and make the vocalic analysis of the varieties less impressionistic. Research in African English phonology has, till date, been least data-driven – much less the use of comparative corpora for cross-varietal assessments. Using over 30 hours of naturalistic data (from 28 speakers in 5 Nigerian cities), the procedures for segmenting audio files into phonemic units via the Munich Automatic Segmentation System (MAUS), and the extraction of their spectral values in Praat are explained. Evidence from the speech corpora supports a more complex vocalic inventory than attested in previous auditory/manual-based accounts – thus reinforcing the resourcefulness of the algorithms for the current data and cognate varieties. Keywords: machine algorithms; naturalistic data; African English phonology; vowel segmentation

Assessing the Ability of Neural Machine Translation Models to Perform Syntactic Rewriting
Jahkel Robin | Alvin Grissom II | Matthew Roselli

We describe work in progress for evaluating performance of sequence-to-sequence neural networks on the task of syntax-based reordering for rules applicable to simultaneous machine translation. We train models that attempt to rewrite English sentences using rules that are commonly used by human interpreters. We examine the performance of these models to determine which forms of rewriting are more difficult for them to learn and which architectures are the best at learning them.

Authorship Recognition with Short-Text using Graph-based Techniques
Laura Cruz

In recent years, studies of authorship recognition has aroused great interest in graph-based analysis. Modeling the writing style of each author using a network of co-occurrence words. However, short texts can generate some changes in the topology of network that cause impact on techniques of feature extraction based on graph topology. In this work, we evaluate the robustness of global-strategy and local-strategy based on complex network measurements comparing with graph2vec a graph embedding technique based on skip-gram model. The experiment consists of evaluating how each modification in the length of text affects the accuracy of authorship recognition on both techniques using cross-validation and machine learning techniques.

A Parallel Corpus Mixtec-Spanish
Cynthia Montaño | Gerardo Sierra Martínez | Gemma Bel-Enguix | Helena Gomez

This work is about the compilation process of parallel documents Spanish-Mixtec. There are not many Spanish-Mixec parallel texts and most of the sources are non-digital books. Due to this, we need to face the errors when digitizing the sources and difficulties in sentence alignment, as well as the fact that does not exist a standard orthography. Our parallel corpus consists of sixty texts coming from books and digital repositories. These documents belong to different domains: history, traditional stories, didactic material, recipes, ethnographical descriptions of each town and instruction manuals for disease prevention. We have classified this material in five major categories: didactic (6 texts), educative (6 texts), interpretative (7 texts), narrative (39 texts), and poetic (2 texts). The final total of tokens is 49,814 Spanish words and 47,774 Mixtec words. The texts belong to the states of Oaxaca (48 texts), Guerrero (9 texts) and Puebla (3 texts). According to this data, we see that the corpus is unbalanced in what refers to the representation of the different territories. While 55% of speakers are in Oaxaca, 80% of texts come from this region. Guerrero has the 30% of speakers and the 15% of texts and Puebla, with the 15% of the speakers has a representation of the 5% in the corpus.

Emoji Usage Across Platforms: A Case Study for the Charlottesville Event
Khyati Mahajan | Samira Shaikh

We study emoji usage patterns across two social media platforms, one of them considered a fringe community called Gab, and the other Twitter. We find that Gab tends to comparatively use more emotionally charged emoji, but also seems more apathetic towards the violence during the event, while Twitter takes a more empathetic approach to the event.

Reading KITTY: Pitch Range as an Indicator of Reading Skill
Alfredo Gomez | Alicia Ngo | Alessandra Otondo | Julie Medero

While affective outcomes are generally positive for the use of eBooks and computer-based reading tutors in teaching children to read, learning outcomes are often poorer (Korat and Shamir, 2004). We describe the first iteration of Reading Kitty, an iOS application that uses NLP and speech processing to focus children’s time on close reading and prosody in oral reading, while maintaining an emphasis on creativity and artifact creation. We also share preliminary results demonstrating that pitch range can be used to automatically predict readers’ skill level.

Adversarial Attack on Sentiment Classification
Yi-Ting (Alicia) Tsai | Min-Chu Yang | Han-Yu Chen

In this paper, we propose a white-box attack algorithm called “Global Search” method and compare it with a simple misspelling noise and a more sophisticated and common white-box attack approach called “Greedy Search”. The attack methods are evaluated on the Convolutional Neural Network (CNN) sentiment classifier trained on the IMDB movie review dataset. The attack success rate is used to evaluate the effectiveness of the attack methods and the perplexity of the sentences is used to measure the degree of distortion of the generated adversarial examples. The experiment results show that the proposed “Global Search” method generates more powerful adversarial examples with less distortion or less modification to the source text.

CSI Peru News: finding the culprit, victim and location in news articles
Gina Bustamante | Arturo Oncevay

We introduce a shift on the DS method over the domain of crime-related news from Peru, attempting to find the culprit, victim and location of a crime description from a RE perspective. Obtained results are highly promising and show that proposed modifications are effective in non-traditional domains.

Exploring Social Bias in Chatbots using Stereotype Knowledge
Nayeon Lee | Andrea Madotto | Pascale Fung

Exploring social bias in chatbot is an important, yet relatively unexplored problem. In this paper, we propose an approach to understand social bias in chatbots by leveraging stereotype knowledge. It allows interesting comparison of bias between chatbots and humans, and provides intuitive analysis of existing chatbots by borrowing the finer-grain concepts of sexism and racism.

Cross-Sentence Transformations in Text Simplification
Fernando Alva-Manchego | Carolina Scarton | Lucia Specia

Current approaches to Text Simplification focus on simplifying sentences individually. However, certain simplification transformations span beyond single sentences (e.g. joining and re-ordering sentences). In this paper, we motivate the need for modelling the simplification task at the document level, and assess the performance of sequence-to-sequence neural models in this setup. We analyse parallel original-simplified documents created by professional editors and show that there are frequent rewriting transformations that are not restricted to sentence boundaries. We also propose strategies to automatically evaluate the performance of a simplification model on these cross-sentence transformations. Our experiments show the inability of standard sequence-to-sequence neural models to learn these transformations, and suggest directions towards document-level simplification.