An Open Dataset and Model for Language Identification

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield


Abstract
Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model’s performance, both in comparison to existing open models and by language class.
Anthology ID:
2023.acl-short.75
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
865–879
Language:
URL:
https://aclanthology.org/2023.acl-short.75
DOI:
10.18653/v1/2023.acl-short.75
Bibkey:
Cite (ACL):
Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. 2023. An Open Dataset and Model for Language Identification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 865–879, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
An Open Dataset and Model for Language Identification (Burchell et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.acl-short.75.pdf
Video:
 https://preview.aclanthology.org/naacl24-info/2023.acl-short.75.mp4