An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Abstract
Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).- Anthology ID:
- 2020.winlp-1.15
- Volume:
- Proceedings of the Fourth Widening Natural Language Processing Workshop
- Month:
- July
- Year:
- 2020
- Address:
- Seattle, USA
- Editors:
- Rossana Cunha, Samira Shaikh, Erika Varis, Ryan Georgi, Alicia Tsai, Antonios Anastasopoulos, Khyathi Raghavi Chandu
- Venue:
- WiNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 58–60
- Language:
- URL:
- https://aclanthology.org/2020.winlp-1.15
- DOI:
- 10.18653/v1/2020.winlp-1.15
- Cite (ACL):
- Pedro Vernetti and Larissa Freitas. 2020. An Assessment of Language Identification Methods on Tweets and Wikipedia Articles. In Proceedings of the Fourth Widening Natural Language Processing Workshop, pages 58–60, Seattle, USA. Association for Computational Linguistics.
- Cite (Informal):
- An Assessment of Language Identification Methods on Tweets and Wikipedia Articles (Vernetti & Freitas, WiNLP 2020)