David Guthrie

2010

Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
David Guthrie | Mark Hepple
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib abs

Efficient Minimal Perfect Hash Language Models
David Guthrie | Mark Hepple | Wei Liu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The availability of large collections of text have made it possible to build language models that incorporate counts of billions of n-grams. This paper proposes two new methods of efficiently storing large language models that allow O(1) random access and use significantly less space than all known approaches. We introduce two novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams using 2.5 Bytes per n-gram and language models of quantized probabilities using 2.26 Bytes per n-gram. These methods allow language processing applications to take advantage of much larger language models than previously was possible using the same hardware and we additionally describe how they can be used in a distributed environment to store even larger models. We show that our approaches are simple to implement and can easily be combined with pruning and quantization to achieve additional reductions in the size of the language model.

2008

pdf bib abs

An Unsupervised Probabilistic Approach for the Detection of Outliers in Corpora
David Guthrie | Louise Guthrie | Yorick Wilks
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Many applications of computational linguistics are greatly influenced by the quality of corpora available and as automatically generated corpora continue to play an increasingly common role, it is essential that we not overlook the importance of well-constructed and homogeneous corpora. This paper describes an automatic approach to improving the homogeneity of corpora using an unsupervised method of statistical outlier detection to find documents and segments that do not belong in a corpus. We consider collections of corpora that are homogeneous with respect to topic (i.e. about the same subject), or genre (written for the same audience or from the same source) and use a combination of stylistic and lexical features of the texts to automatically identify pieces of text in these collections that break the homogeneity. These pieces of text that are significantly different from the rest of the corpus are likely to be errors that are out of place and should be removed from the corpus before it is used for other tasks. We evaluate our techniques by running extensive experiments over large artificially constructed corpora that each contain single pieces of text from a different topic, author, or genre than the rest of the collection and measure the accuracy of identifying these pieces of text without the use of training data. We show that when these pieces of text are reasonably large (1,000 words) we can reliably identify them in a corpus.

2006

pdf bib abs

A Closer Look at Skip-gram Modelling
David Guthrie | Ben Allison | Wei Liu | Louise Guthrie | Yorick Wilks
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

pdf bib

Towards the Orwellian Nightmare: Separation of Business and Personal Emails
Sanaz Jabbari | Ben Allison | David Guthrie | Louise Guthrie
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

David Guthrie

2010

2008

2006

2004

2003

Co-authors

Venues