Fast Approximate String Matching with Suffix Arrays and A* Parsing

Philipp Koehn, Jean Senellart


Abstract
We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.
Anthology ID:
2010.amta-papers.2
Volume:
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 31-November 4
Year:
2010
Address:
Denver, Colorado, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2010.amta-papers.2
DOI:
Bibkey:
Cite (ACL):
Philipp Koehn and Jean Senellart. 2010. Fast Approximate String Matching with Suffix Arrays and A* Parsing. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Fast Approximate String Matching with Suffix Arrays and A* Parsing (Koehn & Senellart, AMTA 2010)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2010.amta-papers.2.pdf