Optimizing a Supervised Classifier for a Difficult Language Identification Problem

Yves Bestgen


Abstract
This paper describes the system developed by the Laboratoire d’analyse statistique des textes for the Dravidian Language Identification (DLI) shared task of VarDial 2021. This task is particularly difficult because the materials consists of short YouTube comments, written in Roman script, from three closely related Dravidian languages, and a fourth category consisting of several other languages in varying proportions, all mixed with English. The proposed system is made up of a logistic regression model which uses as only features n-grams of characters with a maximum length of 5. After its optimization both in terms of the feature weighting and the classifier parameters, it ranked first in the challenge. The additional analyses carried out underline the importance of optimization, especially when the measure of effectiveness is the Macro-F1.
Anthology ID:
2021.vardial-1.11
Volume:
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
April
Year:
2021
Address:
Kiyv, Ukraine
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer, Tommi Jauhiainen
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
96–101
Language:
URL:
https://aclanthology.org/2021.vardial-1.11
DOI:
Bibkey:
Cite (ACL):
Yves Bestgen. 2021. Optimizing a Supervised Classifier for a Difficult Language Identification Problem. In Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 96–101, Kiyv, Ukraine. Association for Computational Linguistics.
Cite (Informal):
Optimizing a Supervised Classifier for a Difficult Language Identification Problem (Bestgen, VarDial 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2021.vardial-1.11.pdf