Abstract
We applied word unigram models, character ngram models, and CNNs to the task of distinguishing tweets of two related dialects of Romanian (standard Romanian and Moldavian) for the VarDial 2020 RDI shared task (Gaman et al. 2020). The main challenge of the task was to perform cross-genre text classification: specifically, the models must be trained using text from news articles, and be used to predict tweets. Our best model was a Naive Bayes model trained on character ngrams, with the most common ngrams filtered out. We also applied SVMs and CNNs, but while they yielded the best performance on an evaluation dataset of news article, their accuracy significantly dropped when they were used to predict tweets. Our best model reached an F1 score of 0.715 on the evaluation dataset of tweets, and 0.667 on the held-out test dataset. The model ended up in the third place in the shared task.- Anthology ID:
- 2020.vardial-1.25
- Volume:
- Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer
- Venue:
- VarDial
- SIG:
- Publisher:
- International Committee on Computational Linguistics (ICCL)
- Note:
- Pages:
- 265–272
- Language:
- URL:
- https://aclanthology.org/2020.vardial-1.25
- DOI:
- Cite (ACL):
- Andrea Ceolin and Hong Zhang. 2020. Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 265–272, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
- Cite (Informal):
- Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams (Ceolin & Zhang, VarDial 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.vardial-1.25.pdf
- Code
- AndreaCeolin/VarDial2020
- Data
- MOROCO