A reproduction of Apple’s bi-directional LSTM models for language identification in short strings
Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent
Abstract
Language Identification is the task of identifying a document’s language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model’s performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.- Anthology ID:
- 2021.eacl-srw.6
- Volume:
- Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
- Month:
- April
- Year:
- 2021
- Address:
- Online
- Editors:
- Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, Eneko Agirre
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 36–42
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.eacl-srw.6/
- DOI:
- 10.18653/v1/2021.eacl-srw.6
- Cite (ACL):
- Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, and Ira Assent. 2021. A reproduction of Apple’s bi-directional LSTM models for language identification in short strings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 36–42, Online. Association for Computational Linguistics.
- Cite (Informal):
- A reproduction of Apple’s bi-directional LSTM models for language identification in short strings (Toftrup et al., EACL 2021)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2021.eacl-srw.6.pdf
- Code
- AU-DIS/LSTM_langid
- Data
- OpenSubtitles, Universal Dependencies