Abstract
Edit distance has been successfully used to extract training data, i.e., misspelling-correction pairs, of spelling correction models from search query logs in languages including English. However, the success does not readily apply to Japanese, where misspellings are often dissimilar to correct spellings due to the romanization-based input methods. To address this problem, we introduce lattice path edit distance, which utilizes romanization lattices to efficiently consider all possible romanized forms of input strings. Empirical experiments using Japanese search query logs demonstrated that the lattice path edit distance outperformed baseline methods including the standard edit distance combined with an existing transliterator and morphological analyzer. A training data collection pipeline that uses the lattice path edit distance has been deployed in production at our search engine for over a year.- Anthology ID:
- 2023.emnlp-industry.24
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Mingxuan Wang, Imed Zitouni
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 233–242
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-industry.24
- DOI:
- 10.18653/v1/2023.emnlp-industry.24
- Cite (ACL):
- Nobuhiro Kaji. 2023. Lattice Path Edit Distance: A Romanization-aware Edit Distance for Extracting Misspelling-Correction Pairs from Japanese Search Query Logs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 233–242, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Lattice Path Edit Distance: A Romanization-aware Edit Distance for Extracting Misspelling-Correction Pairs from Japanese Search Query Logs (Kaji, EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2023.emnlp-industry.24.pdf