Vocabulary-Based Language Similarity using Web Corpora

Dirk Goldhahn, Uwe Quasthoff


Abstract
This paper will focus on the evaluation of automatic methods for quantifying language similarity. This is achieved by ascribing language similarity to the similarity of text corpora. This corpus similarity will first be determined by the resemblance of the vocabulary of languages. Thereto words or parts of them such as letter n-grams are examined. Extensions like transliteration of the text data will ensure the independence of the methods from text characteristics such as the writing system used. Further analyzes will show to what extent knowledge about the distribution of words in parallel text can be used in the context of language similarity.
Anthology ID:
L14-1373
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3294–3299
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/435_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Dirk Goldhahn and Uwe Quasthoff. 2014. Vocabulary-Based Language Similarity using Web Corpora. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3294–3299, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Vocabulary-Based Language Similarity using Web Corpora (Goldhahn & Quasthoff, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/435_Paper.pdf