Abstract
We present a novel exact solution to the approximate string matching problem in the context of translation memories, where a text segment has to be matched against a large corpus, while allowing for errors. We use suffix arrays to detect exact n-gram matches, A* search heuristics to discard matches and A* parsing to validate candidate segments. The method outperforms the canonical baseline by a factor of 100, with average lookup times of 4.3–247ms for a segment in a realistic scenario.- Anthology ID:
- 2010.amta-papers.2
- Volume:
- Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
- Month:
- October 31-November 4
- Year:
- 2010
- Address:
- Denver, Colorado, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- Language:
- URL:
- https://aclanthology.org/2010.amta-papers.2
- DOI:
- Cite (ACL):
- Philipp Koehn and Jean Senellart. 2010. Fast Approximate String Matching with Suffix Arrays and A* Parsing. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- Fast Approximate String Matching with Suffix Arrays and A* Parsing (Koehn & Senellart, AMTA 2010)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/2010.amta-papers.2.pdf