Abstract
This paper describes the system developed by the Laboratoire d’analyse statistique des textes for the Dravidian Language Identification (DLI) shared task of VarDial 2021. This task is particularly difficult because the materials consists of short YouTube comments, written in Roman script, from three closely related Dravidian languages, and a fourth category consisting of several other languages in varying proportions, all mixed with English. The proposed system is made up of a logistic regression model which uses as only features n-grams of characters with a maximum length of 5. After its optimization both in terms of the feature weighting and the classifier parameters, it ranked first in the challenge. The additional analyses carried out underline the importance of optimization, especially when the measure of effectiveness is the Macro-F1.- Anthology ID:
- 2021.vardial-1.11
- Volume:
- Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- April
- Year:
- 2021
- Address:
- Kiyv, Ukraine
- Editors:
- Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer, Tommi Jauhiainen
- Venue:
- VarDial
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 96–101
- Language:
- URL:
- https://aclanthology.org/2021.vardial-1.11
- DOI:
- Cite (ACL):
- Yves Bestgen. 2021. Optimizing a Supervised Classifier for a Difficult Language Identification Problem. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 96–101, Kiyv, Ukraine. Association for Computational Linguistics.
- Cite (Informal):
- Optimizing a Supervised Classifier for a Difficult Language Identification Problem (Bestgen, VarDial 2021)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2021.vardial-1.11.pdf