Sparse Coding in Authorship Attribution for Polish Tweets

Piotr Grzybowski, Ewa Juralewicz, Maciej Piasecki


Abstract
The study explores application of a simple Convolutional Neural Network for the problem of authorship attribution of tweets written in Polish. In our solution we use two-step compression of tweets using Byte Pair Encoding algorithm and vectorisation as an input to the distributional model generated for the large corpus of Polish tweets by word2vec algorithm. Our method achieves results comparable to the state-of-the-art approaches for the similar task on English tweets and expresses a very good performance in the classification of Polish tweets. We tested the proposed method in relation to the number of authors and tweets per author. We also juxtaposed results for authors with different topic backgrounds against each other.
Anthology ID:
R19-1048
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
409–417
Language:
URL:
https://aclanthology.org/R19-1048
DOI:
10.26615/978-954-452-056-4_048
Bibkey:
Cite (ACL):
Piotr Grzybowski, Ewa Juralewicz, and Maciej Piasecki. 2019. Sparse Coding in Authorship Attribution for Polish Tweets. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 409–417, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Sparse Coding in Authorship Attribution for Polish Tweets (Grzybowski et al., RANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/R19-1048.pdf