Louise Guthrie

Also published as: L. Guthrie


2022

We present the BeSt corpus, which records cognitive state: who believes what (i.e., factuality), and who has what sentiment towards what. This corpus is inspired by similar source-and-target corpora, specifically MPQA and FactBank. The corpus comprises two genres, newswire and discussion forums, in three languages, Chinese (Mandarin), English, and Spanish. The corpus is distributed through the LDC.

2015

2012

This paper describes our research into methods for inferring social and instrumental roles and relationships from document and discourse corpora. The goal is to identify the roles of initial authors and participants in internet discussions with respect to leadership, influence and expertise. Web documents, forums and blogs provide data from which the relationships between these concepts are empirically derived and compared. Using techniques from Natural Language Processing (NLP), characterizations of authority and expertise are hypothesized and then tested to see if these pick out the same or different participants as may be chosen by techniques based on social network analysis (Huffaker 2010) see if they pick out the same discourse participants for any given level of these qualities (i.e. leadership, expertise and influence). Our methods could be applied, in principle, to any domain topic, but this paper will describe an initial investigation into two subject areas where a range of differing opinions are available and which differ in the nature of their appeals to authority and truth: ‘genetic engineering' and a ‘Muslim Forum'. The available online corpora for these topics contain discussions from a variety of users with different levels of expertise, backgrounds and personalities.

2010

Lexical substitution is the task of finding a replacement for a target word in a sentence so as to preserve, as closely as possible, the meaning of the original sentence. It has been proposed that lexical substitution be used as a basis for assessing the performance of word sense disambiguation systems, an idea realised in the English Lexical Substitution Task of SemEval-2007. In this paper, we examine the evaluation metrics used for the English Lexical Substitution Task and identify some problems that arise for them. We go on to propose some alternative measures for this purpose, that avoid these problems, and which in turn can be seen as redefining the key tasks that lexical substitution systems should be expected to perform. We hope that these new metrics will better serve to guide the development of lexical substitution systems in future work. One of the new metrics addresses how effective systems are in ranking substitution candidates, a key ability for lexical substitution systems, and we report some results concerning the assessment of systems produced by this measure as compared to the relevant measure from SemEval-2007.

2008

Many applications of computational linguistics are greatly influenced by the quality of corpora available and as automatically generated corpora continue to play an increasingly common role, it is essential that we not overlook the importance of well-constructed and homogeneous corpora. This paper describes an automatic approach to improving the homogeneity of corpora using an unsupervised method of statistical outlier detection to find documents and segments that do not belong in a corpus. We consider collections of corpora that are homogeneous with respect to topic (i.e. about the same subject), or genre (written for the same audience or from the same source) and use a combination of stylistic and lexical features of the texts to automatically identify pieces of text in these collections that break the homogeneity. These pieces of text that are significantly different from the rest of the corpus are likely to be errors that are out of place and should be removed from the corpus before it is used for other tasks. We evaluate our techniques by running extensive experiments over large artificially constructed corpora that each contain single pieces of text from a different topic, author, or genre than the rest of the collection and measure the accuracy of identifying these pieces of text without the use of training data. We show that when these pieces of text are reasonably large (1,000 words) we can reliably identify them in a corpus.
The release of the Enron corpus provided a unique resource for studying aspects of email use, because it is largely unfiltered, and therefore presents a relatively complete collection of emails for a reasonably large number of correspondents. This paper describes a newly created subcorpus of the Enron emails which we suggest can be used to test techniqes for authorship attribution, and further shows the application of three different classification methods to this task to present baseline results. Two of the classifiers used are are standard, and have been shown to perform well in the literature, and one of the classifiers is novel and based on concurrent work that proposes a Bayesian hierarchical distribution for word counts in documents. For each of the classifiers, we present results using six text representations, including use of linguistic structures derived from a parser as well as lexical information.
The growing dependence of modern society on the Web as a vital source of information and communication has become inevitable. However, the Web has become an ideal channel for various terrorist organisations to publish their misleading information and send unintelligible messages to communicate with their clients as well. The increase in the number of published anomalous misleading information on the Web has led to an increase in security threats. The existing Web security mechanisms and protocols are not appropriately designed to deal with such recently developed problems. Developing technology to detect anomalous textual information has become one of the major challenges within the NLP community. This paper introduces the problem of anomalous text detection by automatically extracting linguistic features from documents and evaluating those features for patterns of suspicious and/or inconsistent information in Arabic documents. In order to achieve that, we defined specific linguistic features that characterise various Arabic writing styles. Also, the paper introduces the main challenges in Arabic processing and describes the proposed unsupervised learning model for detecting anomalous Arabic textual information.
The Internet has become the most popular platform for communication. However because most of the modern computer keyboard is Latin-based, Asian languages such as Chinese cannot input its characters (Hanzi) directly with these keyboards. As a result, methods for representing Chinese characters using Latin alphabets were introduced. The most popular method among these is the Pinyin input system. Pinyin is also called “Romanised” Chinese in that it phonetically resembles a Chinese character. Due to the highly ambiguous mapping from Pinyin to Chinese characters, word misuses can occur using standard computer keyboard, and more commonly so in internet chat-rooms or instant messengers where the language used is less formal. In this paper we aim to develop a system that can automatically identify such anomalies, whether they are simple typos or whether they are intentional. After identifying them, the system should suggest the correct word to be used.
This paper proposes a distributional model of word use and word meaning which is derived purely from a body of text, and then applies this model to determine whether certain words are used in or out of context. We suggest that we can view the contexts of words as multinomially distributed random variables. We illustrate how using this basic idea, we can formulate the problem of detecting whether or not a word is used in context as a likelihood ratio test. We also define a measure of semantic relatedness between a word and its context using the same model. We assume that words that typically appear together are related, and thus have similar probability distributions and that words used in an unusual way will have probability distributions which are dissimilar from those of their surrounding context. The relatedness of a word to its context is based on Kullback-Leibler divergence between probability distributions assigned to the constituent words in the given sentence. We employed our methods on a defense-oriented application where certain words are substituted with other words in an intercepted communication.

2006

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

2004

2001

1996

1995

1994

1993

1992

1991

1990

1986