Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages
Alina Maria Cristea, Liviu P. Dinu, Simona Georgescu, Mihnea-Lucian Mihai, Ana Sabina Uban
Abstract
In this paper, we address the problem of automatically discriminating between inherited and borrowed Latin words. We introduce a new dataset and investigate the case of Romance languages (Romanian, Italian, French, Spanish, Portuguese and Catalan), where words directly inherited from Latin coexist with words borrowed from Latin, and explore whether automatic discrimination between them is possible. Having entered the language at a later stage, borrowed words are no longer subject to historical sound shift rules, hence they are presumably less eroded, which is why we expect them to have a different intrinsic structure distinguishable by computational means. We employ several machine learning models to automatically discriminate between inherited and borrowed words and compare their performance with various feature sets. We analyze the models’ predictive power on two versions of the datasets, orthographic and phonetic. We also investigate whether prior knowledge of the etymon provides better results, employing n-gram character features extracted from the word-etymon pairs and from their alignment.- Anthology ID:
- 2021.findings-emnlp.243
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2845–2855
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.243
- DOI:
- 10.18653/v1/2021.findings-emnlp.243
- Cite (ACL):
- Alina Maria Cristea, Liviu P. Dinu, Simona Georgescu, Mihnea-Lucian Mihai, and Ana Sabina Uban. 2021. Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2845–2855, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages (Cristea et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2021.findings-emnlp.243.pdf