Gregory Grefenstette


Extracting Weighted Language Lexicons from Wikipedia
Gregory Grefenstette
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Language models are used in applications as diverse as speech recognition, optical character recognition and information retrieval. They are used to predict word appearance, and to weight the importance of words in these applications. One basic element of language models is the list of words in a language. Another is the unigram frequency of each word. But this basic information is not available for most languages in the world. Since the multilingual Wikipedia project encourages the production of encyclopedic-like articles in many world languages, we can find there an ever-growing source of text from which to extract these two language modelling elements: word list and frequency. Here we present a simple technique for converting this Wikipedia text into lexicons of weighted unigrams for the more than 280 languages present currently present in Wikipedia. The lexicons produced, and the source code for producing them in a Linux-based system are here made available for free on the Web.


INRIASAC: Simple Hypernym Extraction Methods
Gregory Grefenstette
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


A Conceptual Approach to Web Image Retrieval
Adrian Popescu | Gregory Grefenstette
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

People use the Internet to find a wide variety of images. Existing image search engines do not understand the pictures they return. The introduction of semantic layers in information retrieval frameworks may enhance the quality of the results compared to existing systems. One important challenge in the field is to develop architectures that fit the requirements of real-life applications, like the Internet search engines. In this paper, we describe Olive, an image retrieval application that exploits a large scale conceptual hierarchy (extracted from WordNet) to automatically reformulate user queries, search for associated images and present results in an interactive and structured fashion. When searching a concept in the hierarchy, Olive reformulates the query using its deepest subtypes in WordNet. On the answers page, the system displays a selection of related classes and proposes a content based retrieval functionality among the pictures sharing the same linguistic label. In order to validate our approach, we run to series of tests to assess the performances of the application and report the results here. First, two precision evaluations over a panel of concepts from different domains are realized and second, a user test is designed so as to assess the interaction with the system.

Semi-automatic Building Method for a Multidimensional Affect Dictionary for a New Language
Guillaume Pitel | Gregory Grefenstette
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Detecting the tone or emotive content of a text message is increasingly important in many natural language processing applications. While for the English language there exists a number of affect, emotive, opinion, or affect computer-usable lexicons for automatically processing text, other languages rarely possess these primary resources. Here we present a semi-automatic technique for quickly building a multidimensional affect lexicon for a new language. Most of the work consists of defining 44 paired affect directions (e.g. love-hate, courage-fear, etc.) and choosing a small number of seed words for each dimension. From this initial investment, we show how a first pass affect lexicon can be created for new language, using a SVM classifier trained on a feature space produced from Latent Semantic Analysis over a large corpus in the new language. We evaluate the accuracy of placing newly found emotive words in one or more of the defined semantic dimensions. We illustrate this technique by creating an affect lexicon for French, but the techniques can be applied to any language found on the Web and for which a large quantity of text exists.


Exploiting text for extracting image processing resources
Gregory Grefenstette | Fathi Debili | Christian Fluhr | Svitlana Zinger
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Much everyday knowledge about physical aspects of objects does not exist as computer data, though such computer-based knowledge will be needed to communicate with next generation voice-commanded personal robots as well in other applications involving visual scene recognition. The largest attempt at manually creating common-sense knowledge, the CYC project, has not yet produced the information needed for these tasks. A new direction is needed, based on an automated approach to knowledge extraction. In this article we present our project to mine web text to find properties of objects that are not currently stored in computer readable form.


Modifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications
Gregory Grefenstette | Nasredine Semmar | Faïza Elkateb-Gara
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

The Use of Monolingual Context Vectors for Missing Translations in Cross-Language Information Retrieval
Yan Qu | Gregory Grefenstette | David A. Evans
Second International Joint Conference on Natural Language Processing: Full Papers


Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation
Yan Qu | Gregory Grefenstette
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)


pdf bib
Introduction to the Special Issue on the Web as Corpus
Adam Kilgarriff | Gregory Grefenstette
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus


Expanding lexicons by inducing paradigms and validating attested forms
Gregory Grefenstette | Yan Qu | David A. Evans
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)


The World Wide Web as a Resource for Example-Based Machine Translation Tasks
Gregory Grefenstette
Proceedings of Translating and the Computer 21


Cross language information retrieval
Gregory Grefenstette
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions


An Experiment in Semantic Tagging using Hidden Markov Model Tagging
Frederique Segond | Anne Schiller | Gregory Grefenstette | Jean-Pierre Chanod
Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications


Corpus-based Method for Automatic Identification of Support Verbs for Nominalizations
Simone Teufel | Gregory Grefenstette
Seventh Conference of the European Chapter of the Association for Computational Linguistics


Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches
Gregory Grefenstette
Acquisition of Lexical Knowledge from Text


SEXTANT: Exploring Unexplored Contexts for Semantic Extraction From Syntactic Analysis
Gregory Grefenstette
30th Annual Meeting of the Association for Computational Linguistics