Abstract
The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).- Anthology ID:
- W17-1223
- Volume:
- Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
- Month:
- April
- Year:
- 2017
- Address:
- Valencia, Spain
- Editors:
- Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
- Venue:
- VarDial
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 184–189
- Language:
- URL:
- https://aclanthology.org/W17-1223
- DOI:
- 10.18653/v1/W17-1223
- Cite (ACL):
- Adrien Barbaresi. 2017. Discriminating between Similar Languages using Weighted Subword Features. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 184–189, Valencia, Spain. Association for Computational Linguistics.
- Cite (Informal):
- Discriminating between Similar Languages using Weighted Subword Features (Barbaresi, VarDial 2017)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/W17-1223.pdf
- Code
- adbar/vardial-experiments