Abstract
We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a mixture of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the two-stage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, and that additional data has a small but decisive impact.- Anthology ID:
- W16-4823
- Volume:
- Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
- Month:
- December
- Year:
- 2016
- Address:
- Osaka, Japan
- Editors:
- Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
- Venue:
- VarDial
- SIG:
- Publisher:
- The COLING 2016 Organizing Committee
- Note:
- Pages:
- 178–184
- Language:
- URL:
- https://aclanthology.org/W16-4823
- DOI:
- Cite (ACL):
- Cyril Goutte and Serge Léger. 2016. Advances in Ngram-based Discrimination of Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 178–184, Osaka, Japan. The COLING 2016 Organizing Committee.
- Cite (Informal):
- Advances in Ngram-based Discrimination of Similar Languages (Goutte & Léger, VarDial 2016)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/W16-4823.pdf