Kapil Dalwani


2010

pdf
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Satoshi Sekine | Kapil Dalwani
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). The previous system (Sekine 08) can only handle tokens and unrestricted wildcards in the query, such as “* was established in *”. However, being able to constrain the wildcards by POS, chunk or NE is quite useful to filter out noise. For example, the new system can search for “NE=COMPANY was established in POS=CD”. This finer specification reduces the number of outputs to less than half and avoids the ngrams which have a comma or a common noun at the first position or location information at the last position. It outputs the matched ngrams with their frequencies as well as all the contexts (i.e. sentences, KWIC lists and document ID information) where the matched ngrams occur in the corpus. It takes a fraction of a second for a search on a single CPU Linux-PC (1GB memory and 500GB disk) environment.

pdf
New Tools for Web-Scale N-grams
Dekang Lin | Kenneth Church | Heng Ji | Satoshi Sekine | David Yarowsky | Shane Bergsma | Kailash Patil | Emily Pitler | Rachel Lathbury | Vikram Rao | Kapil Dalwani | Sushant Narsale
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.