Abstract
This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method. The HeLI-OTS language identifier is equipped with language models for 200 languages and licensed for academic as well as commercial use. We present the HeLI method and its use in our previous research. Then we compare the performance of the HeLI-OTS language identifier with that of fastText on two different data sets, showing that fastText favors the recall of common languages, whereas HeLI-OTS reaches both high recall and high precision for all languages. While introducing existing off-the-shelf language identification tools, we also give a picture of digital humanities-related research that uses such tools. The validity of the results of such research depends on the results given by the language identifier used, and especially for research focusing on the less common languages, the tendency to favor widely used languages might be very detrimental, which Heli-OTS is now able to remedy.- Anthology ID:
- 2022.lrec-1.416
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3912–3922
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.416
- DOI:
- Cite (ACL):
- Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén. 2022. HeLI-OTS, Off-the-shelf Language Identifier for Text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3912–3922, Marseille, France. European Language Resources Association.
- Cite (Informal):
- HeLI-OTS, Off-the-shelf Language Identifier for Text (Jauhiainen et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2022.lrec-1.416.pdf