Abstract
Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.- Anthology ID:
- D19-5539
- Volume:
- Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 297–306
- Language:
- URL:
- https://aclanthology.org/D19-5539
- DOI:
- 10.18653/v1/D19-5539
- Cite (ACL):
- Benjamin Muller, Benoit Sagot, and Djamé Seddah. 2019. Enhancing BERT for Lexical Normalization. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 297–306, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Enhancing BERT for Lexical Normalization (Muller et al., WNUT 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/D19-5539.pdf