Classification of Micro-Texts Using Sub-Word Embeddings

Mihir Joshi, Nur Zincir-Heywood


Abstract
Extracting features and writing styles from short text messages is always a challenge. Short messages, like tweets, do not have enough data to perform statistical authorship attribution. Besides, the vocabulary used in these texts is sometimes improvised or misspelled. Therefore, in this paper, we propose combining four feature extraction techniques namely character n-grams, word n-grams, Flexible Patterns and a new sub-word embedding using the skip-gram model. Our system uses a Multi-Layer Perceptron to utilize these features from tweets to analyze short text messages. This proposed system achieves 85% accuracy, which is a considerable improvement over previous systems.
Anthology ID:
R19-1062
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
526–533
Language:
URL:
https://aclanthology.org/R19-1062
DOI:
10.26615/978-954-452-056-4_062
Bibkey:
Cite (ACL):
Mihir Joshi and Nur Zincir-Heywood. 2019. Classification of Micro-Texts Using Sub-Word Embeddings. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 526–533, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Classification of Micro-Texts Using Sub-Word Embeddings (Joshi & Zincir-Heywood, RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/R19-1062.pdf