An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Hans Moen; Laura-Maria Peltonen; Henry Suhonen; Hanna-Maria Matinolli; Riitta Mieronkoski; Kirsi Telen; Kirsi Terho; Tapio Salakoski; Sanna Salanterä

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Hanna-Maria Matinolli, Riitta Mieronkoski, Kirsi Telen, Kirsi Terho, Tapio Salakoski, Sanna Salanterä

Abstract

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional unigram model.

Anthology ID:: W19-6114
Volume:: Proceedings of the 22nd Nordic Conference on Computational Linguistics
Month:: September–October
Year:: 2019
Address:: Turku, Finland
Editors:: Mareike Hartmann, Barbara Plank
Venue:: NoDaLiDa
SIG:
Publisher:: Linköping University Electronic Press
Note:
Pages:: 131–139
Language:
URL:: https://aclanthology.org/W19-6114
DOI:
Bibkey:
Cite (ACL):: Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Hanna-Maria Matinolli, Riitta Mieronkoski, Kirsi Telen, Kirsi Terho, Tapio Salakoski, and Sanna Salanterä. 2019. An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 131–139, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):: An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora (Moen et al., NoDaLiDa 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-2023-videos/W19-6114.pdf

PDF Search