Louise Guthrie

Also published as: L. Guthrie

2022

We present the BeSt corpus, which records cognitive state: who believes what (i.e., factuality), and who has what sentiment towards what. This corpus is inspired by similar source-and-target corpora, specifically MPQA and FactBank. The corpus comprises two genres, newswire and discussion forums, in three languages, Chinese (Mandarin), English, and Spanish. The corpus is distributed through the LDC.

2015

2012

pdf abs
LIE: Leadership, Influence and Expertise
Roberta Catizone | Louise Guthrie | Arthur Thomas | Yorick Wilks
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes our research into methods for inferring social and instrumental roles and relationships from document and discourse corpora. The goal is to identify the roles of initial authors and participants in internet discussions with respect to leadership, influence and expertise. Web documents, forums and blogs provide data from which the relationships between these concepts are empirically derived and compared. Using techniques from Natural Language Processing (NLP), characterizations of authority and expertise are hypothesized and then tested to see if these pick out the same or different participants as may be chosen by techniques based on social network analysis (Huffaker 2010) see if they pick out the same discourse participants for any given level of these qualities (i.e. leadership, expertise and influence). Our methods could be applied, in principle, to any domain topic, but this paper will describe an initial investigation into two subject areas where a range of differing opinions are available and which differ in the nature of their appeals to authority and truth: genetic engineering' and a Muslim Forum'. The available online corpora for these topics contain discussions from a variety of users with different levels of expertise, backgrounds and personalities.

2010

pdf
Evaluation Metrics for the Lexical Substitution Task
Sanaz Jabbari | Mark Hepple | Louise Guthrie
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf abs
Evaluating Lexical Substitution: Analysis and New Measures
Sanaz Jabbari | Mark Hepple | Louise Guthrie
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Lexical substitution is the task of finding a replacement for a target word in a sentence so as to preserve, as closely as possible, the meaning of the original sentence. It has been proposed that lexical substitution be used as a basis for assessing the performance of word sense disambiguation systems, an idea realised in the English Lexical Substitution Task of SemEval-2007. In this paper, we examine the evaluation metrics used for the English Lexical Substitution Task and identify some problems that arise for them. We go on to propose some alternative measures for this purpose, that avoid these problems, and which in turn can be seen as redefining the key tasks that lexical substitution systems should be expected to perform. We hope that these new metrics will better serve to guide the development of lexical substitution systems in future work. One of the new metrics addresses how effective systems are in ranking substitution candidates, a key ability for lexical substitution systems, and we report some results concerning the assessment of systems produced by this measure as compared to the relevant measure from SemEval-2007.

2008

pdf abs
An Unsupervised Probabilistic Approach for the Detection of Outliers in Corpora
David Guthrie | Louise Guthrie | Yorick Wilks
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Many applications of computational linguistics are greatly influenced by the quality of corpora available and as automatically generated corpora continue to play an increasingly common role, it is essential that we not overlook the importance of well-constructed and homogeneous corpora. This paper describes an automatic approach to improving the homogeneity of corpora using an unsupervised method of statistical outlier detection to find documents and segments that do not belong in a corpus. We consider collections of corpora that are homogeneous with respect to topic (i.e. about the same subject), or genre (written for the same audience or from the same source) and use a combination of stylistic and lexical features of the texts to automatically identify pieces of text in these collections that break the homogeneity. These pieces of text that are significantly different from the rest of the corpus are likely to be errors that are out of place and should be removed from the corpus before it is used for other tasks. We evaluate our techniques by running extensive experiments over large artificially constructed corpora that each contain single pieces of text from a different topic, author, or genre than the rest of the collection and measure the accuracy of identifying these pieces of text without the use of training data. We show that when these pieces of text are reasonably large (1,000 words) we can reliably identify them in a corpus.

pdf abs
Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation
Ben Allison | Louise Guthrie
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The release of the Enron corpus provided a unique resource for studying aspects of email use, because it is largely unfiltered, and therefore presents a relatively complete collection of emails for a reasonably large number of correspondents. This paper describes a newly created subcorpus of the Enron emails which we suggest can be used to test techniqes for authorship attribution, and further shows the application of three different classification methods to this task to present baseline results. Two of the classifiers used are are standard, and have been shown to perform well in the literature, and one of the classifiers is novel and based on concurrent work that proposes a Bayesian hierarchical distribution for word counts in documents. For each of the classifiers, we present results using six text representations, including use of linguistic structures derived from a parser as well as lexical information.

pdf abs
Unsupervised Learning-based Anomalous Arabic Text Detection
Nasser Abouzakhar | Ben Allison | Louise Guthrie
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The growing dependence of modern society on the Web as a vital source of information and communication has become inevitable. However, the Web has become an ideal channel for various terrorist organisations to publish their misleading information and send unintelligible messages to communicate with their clients as well. The increase in the number of published anomalous misleading information on the Web has led to an increase in security threats. The existing Web security mechanisms and protocols are not appropriately designed to deal with such recently developed problems. Developing technology to detect anomalous textual information has become one of the major challenges within the NLP community. This paper introduces the problem of anomalous text detection by automatically extracting linguistic features from documents and evaluating those features for patterns of suspicious and/or inconsistent information in Arabic documents. In order to achieve that, we defined specific linguistic features that characterise various Arabic writing styles. Also, the paper introduces the main challenges in Arabic processing and describes the proposed unsupervised learning model for detecting anomalous Arabic textual information.

pdf abs
Professor or Screaming Beast? Detecting Anomalous Words in Chinese
Wei Liu | Ben Allison | Louise Guthrie
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Internet has become the most popular platform for communication. However because most of the modern computer keyboard is Latin-based, Asian languages such as Chinese cannot input its characters (Hanzi) directly with these keyboards. As a result, methods for representing Chinese characters using Latin alphabets were introduced. The most popular method among these is the Pinyin input system. Pinyin is also called Romanised Chinese in that it phonetically resembles a Chinese character. Due to the highly ambiguous mapping from Pinyin to Chinese characters, word misuses can occur using standard computer keyboard, and more commonly so in internet chat-rooms or instant messengers where the language used is less formal. In this paper we aim to develop a system that can automatically identify such anomalies, whether they are simple typos or whether they are intentional. After identifying them, the system should suggest the correct word to be used.

pdf abs
Using a Probabilistic Model of Context to Detect Word Obfuscation
Sanaz Jabbari | Ben Allison | Louise Guthrie
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes a distributional model of word use and word meaning which is derived purely from a body of text, and then applies this model to determine whether certain words are used in or out of context. We suggest that we can view the contexts of words as multinomially distributed random variables. We illustrate how using this basic idea, we can formulate the problem of detecting whether or not a word is used in context as a likelihood ratio test. We also define a measure of semantic relatedness between a word and its context using the same model. We assume that words that typically appear together are related, and thus have similar probability distributions and that words used in an unusual way will have probability distributions which are dissimilar from those of their surrounding context. The relatedness of a word to its context is based on Kullback-Leibler divergence between probability distributions assigned to the constituent words in the given sentence. We employed our methods on a defense-oriented application where certain words are substituted with other words in an intercepted communication.

2006

pdf abs
A Closer Look at Skip-gram Modelling
David Guthrie | Ben Allison | Wei Liu | Louise Guthrie | Yorick Wilks
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

pdf
Towards the Orwellian Nightmare: Separation of Business and Personal Emails
Sanaz Jabbari | Ben Allison | David Guthrie | Louise Guthrie
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2004

2001

1996

pdf
A Simple Probabilistic Approach to Classification and Routing
Louise Guthrie | James Leistensnider
TIPSTER TEXT PROGRAM PHASE II: Proceedings of a Workshop held at Vienna, Virginia, May 6-8, 1996

pdf
Integration of Document Detection and Information Extraction
Louise Guthrie | Tomek Strzalkowski | Jin Wang | Fang Lin
TIPSTER TEXT PROGRAM PHASE II: Proceedings of a Workshop held at Vienna, Virginia, May 6-8, 1996

1995

1994

pdf
The Consortium for Lexical Research
Louise Guthrie
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994

pdf
Document Classification by Machine:Theory and Practice
Louise Guthrie | Elbert Walker
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

Louise Guthrie

2022

2015

2012

2010

2008

2006

2004

2001

1996

1995

1994

1993

1992

1991

1990

1986

Co-authors

Venues