Gregory Grefenstette

2016

pdf abs
Extracting Weighted Language Lexicons from Wikipedia
Gregory Grefenstette
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Language models are used in applications as diverse as speech recognition, optical character recognition and information retrieval. They are used to predict word appearance, and to weight the importance of words in these applications. One basic element of language models is the list of words in a language. Another is the unigram frequency of each word. But this basic information is not available for most languages in the world. Since the multilingual Wikipedia project encourages the production of encyclopedic-like articles in many world languages, we can find there an ever-growing source of text from which to extract these two language modelling elements: word list and frequency. Here we present a simple technique for converting this Wikipedia text into lexicons of weighted unigrams for the more than 280 languages present currently present in Wikipedia. The lexicons produced, and the source code for producing them in a Linux-based system are here made available for free on the Web.

2015

pdf
INRIASAC: Simple Hypernym Extraction Methods
Gregory Grefenstette
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2008

pdf abs
A Conceptual Approach to Web Image Retrieval
Adrian Popescu | Gregory Grefenstette
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

People use the Internet to find a wide variety of images. Existing image search engines do not understand the pictures they return. The introduction of semantic layers in information retrieval frameworks may enhance the quality of the results compared to existing systems. One important challenge in the field is to develop architectures that fit the requirements of real-life applications, like the Internet search engines. In this paper, we describe Olive, an image retrieval application that exploits a large scale conceptual hierarchy (extracted from WordNet) to automatically reformulate user queries, search for associated images and present results in an interactive and structured fashion. When searching a concept in the hierarchy, Olive reformulates the query using its deepest subtypes in WordNet. On the answers page, the system displays a selection of related classes and proposes a content based retrieval functionality among the pictures sharing the same linguistic label. In order to validate our approach, we run to series of tests to assess the performances of the application and report the results here. First, two precision evaluations over a panel of concepts from different domains are realized and second, a user test is designed so as to assess the interaction with the system.

pdf abs
Semi-automatic Building Method for a Multidimensional Affect Dictionary for a New Language
Guillaume Pitel | Gregory Grefenstette
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Detecting the tone or emotive content of a text message is increasingly important in many natural language processing applications. While for the English language there exists a number of affect, emotive, opinion, or affect computer-usable lexicons for automatically processing text, other languages rarely possess these primary resources. Here we present a semi-automatic technique for quickly building a multidimensional affect lexicon for a new language. Most of the work consists of defining 44 paired affect directions (e.g. love-hate, courage-fear, etc.) and choosing a small number of seed words for each dimension. From this initial investment, we show how a first pass affect lexicon can be created for new language, using a SVM classifier trained on a feature space produced from Latent Semantic Analysis over a large corpus in the new language. We evaluate the accuracy of placing newly found emotive words in one or more of the defined semantic dimensions. We illustrate this technique by creating an affect lexicon for French, but the techniques can be applied to any language found on the Web and for which a large quantity of text exists.

2006

pdf abs
Exploiting text for extracting image processing resources
Gregory Grefenstette | Fathi Debili | Christian Fluhr | Svitlana Zinger
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Much everyday knowledge about physical aspects of objects does not exist as computer data, though such computer-based knowledge will be needed to communicate with next generation voice-commanded personal robots as well in other applications involving visual scene recognition. The largest attempt at manually creating common-sense knowledge, the CYC project, has not yet produced the information needed for these tasks. A new direction is needed, based on an automated approach to knowledge extraction. In this article we present our project to mine web text to find properties of objects that are not currently stored in computer readable form.