Abstract
In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions of this paper are (1) a description of the unsupervised, low-resource method; (2) an evaluation and analysis of its raw performance; and (3) an assessment of its impact within a model comprising common indicators used in language identification. I present and discuss the systems used in the task A, a 12-way language identification task comprising varieties of five main language groups. Additionally I introduce a new off-the-shelf Naive Bayes classifier using a contrastive word and subword n-gram model (“Bayesline”) which outperforms the best submissions.- Anthology ID:
- W16-4827
- Volume:
- Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Editors:
- Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
- Venue:
- VarDial
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 212–220
- Language:
- URL:
- https://aclanthology.org/W16-4827
- DOI:
- Cite (ACL):
- Adrien Barbaresi. 2016. An Unsupervised Morphological Criterion for Discriminating Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 212–220, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- An Unsupervised Morphological Criterion for Discriminating Similar Languages (Barbaresi, VarDial 2016)
- PDF:
- https://preview.aclanthology.org/fix-dup-bibkey/W16-4827.pdf