Alina Maria Cristea

2021

pdf bib abs
Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban | Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.

pdf bib abs
Towards an Etymological Map of Romanian
Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Ana Sabina Uban | Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.

pdf bib abs
Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages
Alina Maria Cristea | Liviu P. Dinu | Simona Georgescu | Mihnea-Lucian Mihai | Ana Sabina Uban
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we address the problem of automatically discriminating between inherited and borrowed Latin words. We introduce a new dataset and investigate the case of Romance languages (Romanian, Italian, French, Spanish, Portuguese and Catalan), where words directly inherited from Latin coexist with words borrowed from Latin, and explore whether automatic discrimination between them is possible. Having entered the language at a later stage, borrowed words are no longer subject to historical sound shift rules, hence they are presumably less eroded, which is why we expect them to have a different intrinsic structure distinguishable by computational means. We employ several machine learning models to automatically discriminate between inherited and borrowed words and compare their performance with various feature sets. We analyze the models’ predictive power on two versions of the datasets, orthographic and phonetic. We also investigate whether prior knowledge of the etymon provides better results, employing n-gram character features extracted from the word-etymon pairs and from their alignment.

Co-authors

Mihnea-Lucian Mihai 1