Abstract
We present a 78.8-million-tweet, 1.3-billion-word corpus aimed at studying regional variation in Canadian English with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. Our data collection and filtering pipeline reflects complex design criteria, which aim to allow for both data-intensive modeling methods and user-level variationist sociolinguistic analysis. It specifically consists in identifying Twitter users from the three cities, crawling their entire timelines, filtering the collected data in terms of user location and tweet language, and automatically excluding near-duplicate content. The resulting corpus mirrors national and regional specificities of Canadian English, it provides sufficient aggregate and user-level data, and it maintains a reasonably balanced distribution of content across regions and users. The utility of this dataset is illustrated by two example applications: the detection of regional lexical and topical variation, and the identification of contact-induced semantic shifts using vector space models. In accordance with Twitter’s developer policy, the corpus will be publicly released in the form of tweet IDs.- Anthology ID:
- 2020.lrec-1.767
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 6255–6264
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.767
- DOI:
- Cite (ACL):
- Filip Miletic, Anne Przewozny-Desriaux, and Ludovic Tanguy. 2020. Collecting Tweets to Investigate Regional Variation in Canadian English. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6255–6264, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Collecting Tweets to Investigate Regional Variation in Canadian English (Miletic et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2020.lrec-1.767.pdf