Towards the WhAP Corpus: A Resource for the Study of Italian on WhatsApp

Ilaria Fiorentini, Marco Forlano, Nicholas Nese


Abstract
Over the past two decades, the rise of new technologies and social networks has significantly shaped written language, imbuing it with characteristics akin to the spoken language. This study reports on the ongoing initiative to build the WhAP corpus, a resource featuring WhatsApp conversations in Italian, encompassing both written and spoken messages and totaling at present more than 400.000 tokens, 89 conversations, and 194 participants from diverse age groups and geographical regions of Italy. More specifically, this paper focuses on the practical steps involved in the construction of the resource. Once publicly accessible, the WhAP Corpus will enable in-depth linguistic research on the language used on WhatsApp, which shows unique features such as the blending of written and spoken elements.
Anthology ID:
2024.lrec-main.1448
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
16659–16663
Language:
URL:
https://aclanthology.org/2024.lrec-main.1448
DOI:
Bibkey:
Cite (ACL):
Ilaria Fiorentini, Marco Forlano, and Nicholas Nese. 2024. Towards the WhAP Corpus: A Resource for the Study of Italian on WhatsApp. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16659–16663, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Towards the WhAP Corpus: A Resource for the Study of Italian on WhatsApp (Fiorentini et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2024.lrec-main.1448.pdf