Pedro Vernetti


An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Pedro Vernetti | Larissa Freitas
Proceedings of the Fourth Widening Natural Language Processing Workshop

Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).