Abstract
This paper presents an approach for detecting and normalizing neologisms in social media content. Neologisms refer to recent expressions that are specific to certain entities or events and are being increasingly used by the public, but have not yet been accepted in mainstream language. Automated methods for handling neologisms are important for natural language understanding and normalization, especially for informal genres with user generated content. We present an unsupervised approach for detecting neologisms and then normalizing them to canonical words without relying on parallel training data. Our approach builds on the text normalization literature and introduces adaptations to fit the specificities of this task, including phonetic and etymological considerations. We evaluate the proposed techniques on a dataset of Reddit comments, with detected neologisms and corresponding normalizations.- Anthology ID:
- D19-5555
- Volume:
- Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 425–430
- Language:
- URL:
- https://aclanthology.org/D19-5555
- DOI:
- 10.18653/v1/D19-5555
- Cite (ACL):
- Nasser Zalmout, Kapil Thadani, and Aasish Pappu. 2019. Unsupervised Neologism Normalization Using Embedding Space Mapping. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 425–430, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Unsupervised Neologism Normalization Using Embedding Space Mapping (Zalmout et al., WNUT 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/D19-5555.pdf