Toward a Comparable Corpus of Latvian, Russian and English Tweets

Dmitrijs Milajevs

doi:10.18653/v1/W17-2505

Toward a Comparable Corpus of Latvian, Russian and English Tweets

Abstract

Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by collecting and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.

Anthology ID:: W17-2505
Volume:: Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Month:: August
Year:: 2017
Address:: Vancouver, Canada
Editors:: Serge Sharoff, Pierre Zweigenbaum, Reinhard Rapp
Venue:: BUCC
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26–30
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/W17-2505/
DOI:: 10.18653/v1/W17-2505
Bibkey:
Cite (ACL):: Dmitrijs Milajevs. 2017. Toward a Comparable Corpus of Latvian, Russian and English Tweets. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 26–30, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):: Toward a Comparable Corpus of Latvian, Russian and English Tweets (Milajevs, BUCC 2017)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/W17-2505.pdf

PDF Cite Search Fix data