Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Thales Felipe Costa Bertaglia; Maria das Graças Volpe Nunes

Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization

Thales Felipe Costa Bertaglia, Maria das Graças Volpe Nunes

Abstract

Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on distributed representation of words (or word embeddings). It generates continuous numeric vectors of high-dimensionality to represent words. The vectors explicitly encode many linguistic regularities and patterns, as well as syntactic and semantic word relationships. Words that share semantic similarity are represented by similar vectors. Based on these features, we present a totally unsupervised, expandable and language and domain independent method for learning normalization lexicons from word embeddings. Our approach obtains high correction rate of orthographic errors and internet slang in product reviews, outperforming the current available tools for Brazilian Portuguese.

Anthology ID:: W16-3916
Volume:: Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Bo Han, Alan Ritter, Leon Derczynski, Wei Xu, Tim Baldwin
Venue:: WNUT
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 112–120
Language:
URL:: https://aclanthology.org/W16-3916
DOI:
Bibkey:
Cite (ACL):: Thales Felipe Costa Bertaglia and Maria das Graças Volpe Nunes. 2016. Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), pages 112–120, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization (Costa Bertaglia & Volpe Nunes, WNUT 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-5/W16-3916.pdf

PDF Search