Fast Approximate String Matching with Suffix Arrays and A* Parsing

Philipp Koehn; Jean Senellart

Fast Approximate String Matching with Suffix Arrays and A* Parsing

Abstract

We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.

Anthology ID:: 2010.amta-papers.2
Volume:: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:: October 31-November 4
Year:: 2010
Address:: Denver, Colorado, USA
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:: https://aclanthology.org/2010.amta-papers.2
DOI:
Bibkey:
Cite (ACL):: Philipp Koehn and Jean Senellart. 2010. Fast Approximate String Matching with Suffix Arrays and A* Parsing. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Fast Approximate String Matching with Suffix Arrays and A* Parsing (Koehn & Senellart, AMTA 2010)
Copy Citation:
PDF:: https://preview.aclanthology.org/auto-file-uploads/2010.amta-papers.2.pdf

PDF Search