DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy

Alan Ramponi; Camilla Casula

doi:10.18653/v1/2023.vardial-1.19

DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy

Abstract

We introduce DiatopIt, the first corpus specifically focused on diatopic language variation in Italy for language varieties other than Standard Italian. DiatopIt comprises over 15K geolocated social media posts from Twitter over a period of two years, including regional Italian usage and content fully written in local language varieties or exhibiting code-switching with Standard Italian. We detail how we tackled key challenges in creating such a resource, including the absence of orthography standards for most local language varieties and the lack of reliable language identification tools. We assess the representativeness of DiatopIt across time and space, and show that the density of non-Standard Italian content across areas correlates with actual language use. We finally conduct computational experiments and find that modeling diatopic variation on highly multilingual areas such as Italy is a complex task even for recent language models.

Anthology ID:: 2023.vardial-1.19
Volume:: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 187–199
Language:
URL:: https://aclanthology.org/2023.vardial-1.19
DOI:: 10.18653/v1/2023.vardial-1.19
Bibkey:
Cite (ACL):: Alan Ramponi and Camilla Casula. 2023. DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 187–199, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy (Ramponi & Casula, VarDial 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/proper-vol2-ingestion/2023.vardial-1.19.pdf
Video:: https://preview.aclanthology.org/proper-vol2-ingestion/2023.vardial-1.19.mp4

PDF Search Video