La prédiction de cognats est une tâche clef de la linguistique historique et présente de nombreuses similitudes avec les tâches de traduction automatique. Cependant, alors que cette seconde discipline a vu fleurir l’utilisation de méthodes neuronales, celles-ci restent largement absentes des outils utilisés en linguistique historique. Dans ce papier, nous étudions donc la performance des méthodes neuronales utilisées en traduction (les réseaux encodeur-décodeur) pour la tâche de prédiction de cognats. Nous nous intéressons notamment aux types de données utilisables pour cet apprentissage et comparons les résultats obtenus, sur différents types de données, entre des méthodes statistiques et des méthodes neuronales. Nous montrons que l’apprentissage de correspondances phonétiques n’est possible que sur des paires de cognats, et que les méthodes statistiques et neuronales semblent avoir des forces et faiblesses complémentaires quant à ce qu’elles apprennent des données.
Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.
Diachronic lexical information is not only important in the field of historical linguistics, but is also increasingly used in NLP, most recently for machine translation of low resource languages. Therefore, there is a need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation or medieval languages study.