A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

Mads Toftrup; Søren Asger Sørensen; Manuel R. Ciosici; Ira Assent

doi:10.18653/v1/2021.eacl-srw.6

A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent

Abstract

Language Identification is the task of identifying a document’s language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model’s performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Anthology ID:: 2021.eacl-srw.6
Volume:: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Month:: April
Year:: 2021
Address:: Online
Editors:: Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, Eneko Agirre
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36–42
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.eacl-srw.6/
DOI:: 10.18653/v1/2021.eacl-srw.6
Bibkey:
Cite (ACL):: Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, and Ira Assent. 2021. A reproduction of Apple’s bi-directional LSTM models for language identification in short strings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 36–42, Online. Association for Computational Linguistics.
Cite (Informal):: A reproduction of Apple’s bi-directional LSTM models for language identification in short strings (Toftrup et al., EACL 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.eacl-srw.6.pdf
Code: AU-DIS/LSTM_langid
Data: OpenSubtitles, Universal Dependencies

PDF Cite Search Code Fix data