Abstract
The study explores application of a simple Convolutional Neural Network for the problem of authorship attribution of tweets written in Polish. In our solution we use two-step compression of tweets using Byte Pair Encoding algorithm and vectorisation as an input to the distributional model generated for the large corpus of Polish tweets by word2vec algorithm. Our method achieves results comparable to the state-of-the-art approaches for the similar task on English tweets and expresses a very good performance in the classification of Polish tweets. We tested the proposed method in relation to the number of authors and tweets per author. We also juxtaposed results for authors with different topic backgrounds against each other.- Anthology ID:
- R19-1048
- Volume:
- Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
- Month:
- September
- Year:
- 2019
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 409–417
- Language:
- URL:
- https://aclanthology.org/R19-1048
- DOI:
- 10.26615/978-954-452-056-4_048
- Cite (ACL):
- Piotr Grzybowski, Ewa Juralewicz, and Maciej Piasecki. 2019. Sparse Coding in Authorship Attribution for Polish Tweets. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 409–417, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- Sparse Coding in Authorship Attribution for Polish Tweets (Grzybowski et al., RANLP 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/R19-1048.pdf