Phonetic Normalization for Machine Translation of User Generated Content
José Carlos Rosales Núñez, Djamé Seddah, Guillaume Wisniewski
Abstract
We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.- Anthology ID:
- D19-5553
- Volume:
- Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 407–416
- Language:
- URL:
- https://aclanthology.org/D19-5553
- DOI:
- 10.18653/v1/D19-5553
- Cite (ACL):
- José Carlos Rosales Núñez, Djamé Seddah, and Guillaume Wisniewski. 2019. Phonetic Normalization for Machine Translation of User Generated Content. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 407–416, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Phonetic Normalization for Machine Translation of User Generated Content (Rosales Núñez et al., WNUT 2019)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/D19-5553.pdf