Restoring the Sister: Reconstructing a Lexicon from Sister Languages using Neural Machine Translation

Remo Nitschke


Abstract
The historical comparative method has a long history in historical linguists. It describes a process by which historical linguists aim to reverse-engineer the historical developments of language families in order to reconstruct proto-forms and familial relations between languages. In recent years, there have been multiple attempts to replicate this process through machine learning, especially in the realm of cognate detection (List et al., 2016; Ciobanu and Dinu, 2014; Rama et al., 2018). So far, most of these experiments aimed at actual reconstruction have attempted the prediction of a proto-form from the forms of the daughter languages (Ciobanu and Dinu, 2018; Meloni et al., 2019).. Here, we propose a reimplementation that uses modern related languages, or sisters, instead, to reconstruct the vocabulary of a target language. In particular, we show that we can reconstruct vocabulary of a target language by using a fairly small data set of parallel cognates from different sister languages, using a neural machine translation (NMT) architecture with a standard encoder-decoder setup. This effort is directly in furtherance of the goal to use machine learning tools to help under-served language communities in their efforts at reclaiming, preserving, or reconstructing their own languages.
Anthology ID:
2021.americasnlp-1.13
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Editors:
Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
Venue:
AmericasNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
122–130
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.13
DOI:
10.18653/v1/2021.americasnlp-1.13
Bibkey:
Cite (ACL):
Remo Nitschke. 2021. Restoring the Sister: Reconstructing a Lexicon from Sister Languages using Neural Machine Translation. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 122–130, Online. Association for Computational Linguistics.
Cite (Informal):
Restoring the Sister: Reconstructing a Lexicon from Sister Languages using Neural Machine Translation (Nitschke, AmericasNLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2021.americasnlp-1.13.pdf
Code
 remo-help/restoring_the_sister