Allan Ramsay


2018

In this paper we present our contribution to SemEval-2018, a classifier for classifying multi-label emotions of Arabic and English tweets. We attempted “Affect in Tweets”, specifically Task E-c: Detecting Emotions (multi-label classification). Our method is based on preprocessing the tweets and creating word vectors combined with a self correction step to remove noise. We also make use of emotion specific thresholds. The final submission was selected upon the best performance achieved, selected when using a range of thresholds. Our system was evaluated on the Arabic and English datasets provided for the task by the competition organisers, where it ranked 2nd for the Arabic dataset (out of 14 entries) and 12th for the English dataset (out of 35 entries).

2017

To facilitate cross-lingual studies, there is an increasing interest in identifying linguistic universals. Recently, a new universal scheme was designed as a part of universal dependency project. In this paper, we map the Arabic tweets dependency treebank (ATDT) to the Universal Dependency (UD) scheme to compare it to other language resources and for the purpose of cross-lingual studies.
This paper presents an approach to generating common sense knowledge written in raw English sentences. Instead of using public contributors to feed this source, this system chose to employ expert linguistics decisions by using definitions from English dictionaries. Because the definitions in English dictionaries are not prepared to be transformed into inference rules, some preprocessing steps were taken to turn each relation of word:definition in dictionaries into an inference rule in the form left-hand side ⇒ right-hand side. In this paper, we applied this mechanism using two dictionaries: The MacMillan Dictionary and WordNet definitions. A random set of 200 inference rules were extracted equally from the two dictionaries, and then we used human judgment as to whether these rules are ‘True’ or not. For the MacMillan Dictionary the precision reaches 0.74 with 0.508 recall, and the WordNet definitions resulted in 0.73 precision with 0.09 recall.
This paper presents the development of a natural language inference engine that benefits from two current standard approaches; i.e., shallow and deep approaches. This system combines two non-deterministic algorithms: the approximate matching from the shallow approach and a theorem prover from the deep approach for handling multi-step inference tasks. The theorem prover is customized to accept dependency trees and apply inference rules to these trees. The inference rules are automatically generated as syllogistic rules from our test data (FraCaS test suite). The theorem prover exploits a non-deterministic matching algorithm within a standard backward chaining inference engine. We employ continuation programming as a way of seamlessly handling the combination of these two non-deterministic algorithms. Testing the matching algorithm on “Generalized quantifiers” and “adjectives” topics in FraCaS (MacCartney and Manning 2007), we achieved an accuracy of 92.8% of the single-premise cases. For the multi-steps of inference, we checked the validity of our syllogistic rules and then extracted four generic instances that can be applied to more than one problem.
In this paper, we propose using a “bootstrapping” method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers.

2016

Part-of-Speech(POS) tagging is a key step in many NLP algorithms. However, tweets are difficult to POS tag because they are short, are not always written maintaining formal grammar and proper spelling, and abbreviations are often used to overcome their restricted lengths. Arabic tweets also show a further range of linguistic phenomena such as usage of different dialects, romanised Arabic and borrowing foreign words. In this paper, we present an evaluation and a detailed error analysis of state-of-the-art POS taggers for Arabic when applied to Arabic tweets. On the basis of this analysis, we combine normalisation and external knowledge to handle the domain noisiness and exploit bootstrapping to construct extra training data in order to improve POS tagging for Arabic tweets. Our results show significant improvements over the performance of a number of well-known taggers for Arabic.
Stemming is an essential processing step in a wide range of high level text processing applications such as information extraction, machine translation and sentiment analysis. It is used to reduce words to their stems. Many stemming algorithms have been developed for Modern Standard Arabic (MSA). Although Arabic tweets and MSA are closely related and share many characteristics, there are substantial differences between them in lexicon and syntax. In this paper, we introduce a light Arabic stemmer for Arabic tweets. Our results show improvements over the performance of a number of well-known stemmers for Arabic.

2015

2014

2013

2011

2009

2008

2007

2006

2003

2000

1999

1996

1994

1992

1991

1990

1989

1985