Tommi Vatanen


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2010

pdf bib
Language Identification of Short Text Segments with N-gram Models
Tommi Vatanen | Jaakko J. Väyrynen | Sami Virpioja
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by Cavnar and Trenkle (1994). For the n-gram models, we test several standard smoothing techniques, including the current state-of-the-art, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.