A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

Yves Scherrer, Benoît Sagot


Abstract
In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.
Anthology ID:
L14-1619
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
502–508
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/797_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Yves Scherrer and Benoît Sagot. 2014. A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 502–508, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages (Scherrer & Sagot, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/797_Paper.pdf