Simona Georgescu

2023

pdf abs
RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification
Liviu Dinu | Ana Uban | Alina Cristea | Anca Dinu | Ioan-Bogdan Iordache | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The identification of cognates is a fundamental process in historical linguistics, on which any further research is based. Even though there are several cognate databases for Romance languages, they are rather scattered, incomplete, noisy, contain unreliable information, or have uncertain availability. In this paper we introduce a comprehensive database of Romance cognates and borrowings based on the etymological information provided by the dictionaries. We extract pairs of cognates between any two Romance languages by parsing electronic dictionaries of Romanian, Italian, Spanish, Portuguese and French. Based on this resource, we propose a strong benchmark for the automatic detection of cognates, by applying machine learning and deep learning based methods on any two pairs of Romance languages. We find that automatic identification of cognates is possible with accuracy averaging around 94% for the more difficult task formulations.

2022

pdf abs
CoToHiLi at LSCDiscovery: the Role of Linguistic Features in Predicting Semantic Change
Ana Sabina Uban | Alina Maria Cristea | Anca Daniela Dinu | Liviu P Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

This paper presents the contributions of the CoToHiLi team for the LSCDiscovery shared task on semantic change in the Spanish language. We participated in both tasks (graded discovery and binary change, including sense gain and sense loss) and proposed models based on word embedding distances combined with hand-crafted linguistic features, including polysemy, number of neological synonyms, and relation to cognates in English. We find that models that include linguistically informed features combined using weights assigned manually by experts lead to promising results.

2021

pdf abs
Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages
Alina Maria Cristea | Liviu P. Dinu | Simona Georgescu | Mihnea-Lucian Mihai | Ana Sabina Uban
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we address the problem of automatically discriminating between inherited and borrowed Latin words. We introduce a new dataset and investigate the case of Romance languages (Romanian, Italian, French, Spanish, Portuguese and Catalan), where words directly inherited from Latin coexist with words borrowed from Latin, and explore whether automatic discrimination between them is possible. Having entered the language at a later stage, borrowed words are no longer subject to historical sound shift rules, hence they are presumably less eroded, which is why we expect them to have a different intrinsic structure distinguishable by computational means. We employ several machine learning models to automatically discriminate between inherited and borrowed words and compare their performance with various feature sets. We analyze the models’ predictive power on two versions of the datasets, orthographic and phonetic. We also investigate whether prior knowledge of the etymon provides better results, employing n-gram character features extracted from the word-etymon pairs and from their alignment.

pdf abs
Tracking Semantic Change in Cognate Sets for English and Romance Languages
Ana Sabina Uban | Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Laurentiu Zoicas
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

Semantic divergence in related languages is a key concern of historical linguistics. We cross-linguistically investigate the semantic divergence of cognate pairs in English and Romance languages, by means of word embeddings. To this end, we introduce a new curated dataset of cognates in all pairs of those languages. We describe the types of errors that occurred during the automated cognate identification process and manually correct them. Additionally, we label the English cognates according to their etymology, separating them into two groups: old borrowings and recent borrowings. On this curated dataset, we analyse word properties such as frequency and polysemy, and the distribution of similarity scores between cognate sets in different languages. We automatically identify different clusters of English cognates, setting a new direction of research in cognates, borrowings and possibly false friends analysis in related languages.

pdf abs
Towards an Etymological Map of Romanian
Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Ana Sabina Uban | Laurentiu Zoicas
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.

Co-authors

Mihnea-Lucian Mihai 1

Ana Uban 1

Alina Cristea 1

Ioan-Bogdan Iordache 1

Anca Daniela Dinu 1