Alina Maria Ciobanu

Also published as: Alina Ciobanu

2020

pdf abs
Automatic Reconstruction of Missing Romanian Cognates and Unattested Latin Words
Alina Maria Ciobanu | Liviu P. Dinu | Laurentiu Zoicas
Proceedings of the Twelfth Language Resources and Evaluation Conference

Producing related words is a key concern in historical linguistics. Given an input word, the task is to automatically produce either its proto-word, a cognate pair or a modern word derived from it. In this paper, we apply a method for producing related words based on sequence labeling, aiming to fill in the gaps in incomplete cognate sets in Romance languages with Latin etymology (producing Romanian cognates that are missing) and to reconstruct uncertified Latin words. We further investigate an ensemble-based aggregation for combining and re-ranking the word productions of multiple languages.

2019

pdf bib abs
Automatic Identification and Production of Related Words for Historical Linguistics
Alina Maria Ciobanu | Liviu P. Dinu
Computational Linguistics, Volume 45, Issue 4 - December 2019

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

pdf abs
Studying Laws of Semantic Divergence across Languages using Cognate Sets
Ana Uban | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Semantic divergence in related languages is a key concern of historical linguistics. Intra-lingual semantic shift has been previously studied in computational linguistics, but this can only provide a limited picture of the evolution of word meanings, which often develop in a multilingual environment. In this paper we investigate semantic change across languages by measuring the semantic distance of cognate words in multiple languages. By comparing current meanings of cognates in different languages, we hope to uncover information about their previous meanings, and about how they diverged in their respective languages from their common original etymon. We further study the properties of their semantic divergence, by analyzing how the features of words such as frequency and polysemy are related to the divergence in their meaning, and thus make the first steps towards formulating laws of cross-lingual semantic change.

2018

pdf abs
Ab Initio: Automatic Latin Proto-word Reconstruction
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 27th International Conference on Computational Linguistics

Proto-word reconstruction is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. In this paper we investigate automatic word form reconstruction for Latin proto-words. Having modern word forms in multiple Romance languages (French, Italian, Spanish, Portuguese and Romanian), we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when the Latin words entered the modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

pdf abs
Simulating Language Evolution: a Tool for Historical Linguistics
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

Language change across space and time is one of the main concerns in historical linguistics. In this paper, we develop a language evolution simulator: a web-based tool for word form production to assist in historical linguistics, in studying the evolution of the languages. Given a word in a source language, the system automatically predicts how the word evolves in a target language. The method that we propose is language-agnostic and does not use any external knowledge, except for the training word pairs.

pdf abs
Discriminating between Indo-Aryan Languages Using SVM Ensembles
Alina Maria Ciobanu | Marcos Zampieri | Shervin Malmasi | Santanu Pal | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present a system based on SVM ensembles trained on characters and words to discriminate between five similar languages of the Indo-Aryan family: Hindi, Braj Bhasha, Awadhi, Bhojpuri, and Magahi. The system competed in the Indo-Aryan Language Identification (ILI) shared task organized within the VarDial Evaluation Campaign 2018. Our best entry in the competition, named ILIdentification, scored 88.95% F1 score and it was ranked 3rd out of 8 teams.

pdf abs
German Dialect Identification Using Classifier Ensembles
Alina Maria Ciobanu | Shervin Malmasi | Liviu P. Dinu
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the GDI classification entry to the second German Dialect Identification (GDI) shared task organized within the scope of the VarDial Evaluation Campaign 2018. We present a system based on SVM classifier ensembles trained on characters and words. The system was trained on a collection of speech transcripts of five Swiss-German dialects provided by the organizers. The transcripts included in the dataset contained speakers from Basel, Bern, Lucerne, and Zurich. Our entry in the challenge reached 62.03% F1 score and was ranked third out of eight teams.

pdf abs
ALB at SemEval-2018 Task 10: A System for Capturing Discriminative Attributes
Bogdan Dumitru | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th International Workshop on Semantic Evaluation

Semantic difference detection attempts to capture whether a word is a discriminative attribute between two other words. For example, the discriminative feature red characterizes the first word from the (apple, banana) pair, but not the second. Modeling semantic difference is essential for language understanding systems, as it provides useful information for identifying particular aspects of word senses. This paper describes our system implementation (the ALB system of the NLP@Unibuc team) for the 10th task of the SemEval 2018 workshop, “Capturing Discriminative Attributes”. We propose a method for semantic difference detection that uses an SVM classifier with features based on co-occurrence counts and shallow semantic parsing, achieving 0.63 F1 score in the competition.

2017

pdf abs
Native Language Identification on Text and Speech
Marcos Zampieri | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.

2016

pdf abs
Vanilla Classifiers for Distinguishing between Similar Languages
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

pdf abs
A Computational Perspective on the Romanian Dialects
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we conduct an initial study on the dialects of Romanian. We analyze the differences between Romanian and its dialects using the Swadesh list. We analyze the predictive power of the orthographic and phonetic features of the words, building a classification problem for dialect identification.

2015

pdf
AMBRA: A Ranking Approach to Temporal Text Classification
Marcos Zampieri | Alina Maria Ciobanu | Vlad Niculae | Liviu P. Dinu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
Readability Assessment of Translated Texts
Alina Maria Ciobanu | Liviu P. Dinu | Flaviu Pepelea
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Automatic Discrimination between Cognates and Borrowings
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf abs
On the Romance Languages Mutual Intelligibility
Liviu Dinu | Alina Maria Ciobanu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We propose a method for computing the similarity of natural languages and for clustering them based on their lexical similarity. Our study provides evidence to be used in the investigation of the written intelligibility, i.e., the ability of people writing in different languages to understand one another without prior knowledge of foreign languages. We account for etymons and cognates, we quantify lexical similarity and we extend our analysis from words to languages. Based on the introduced methodology, we compute a matrix of Romance languages intelligibility.

pdf abs
Using a machine learning model to assess the complexity of stress systems
Liviu Dinu | Alina Maria Ciobanu | Ioana Chitoran | Vlad Niculae
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We address the task of stress prediction as a sequence tagging problem. We present sequential models with averaged perceptron training for learning primary stress in Romanian words. We use character n-grams and syllable n-grams as features and we account for the consonant-vowel structure of the words. We show in this paper that Romanian stress is predictable, though not deterministic, by using data-driven machine learning techniques.

pdf abs
Building a Dataset of Multilingual Cognates for the Romanian Lexicon
Liviu Dinu | Alina Maria Ciobanu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymology-related information from electronic dictionaries. We employ the dataset of cognates that we obtain as a gold standard for evaluating to which extent orthographic methods can be used to detect cognate pairs. The question that arises is whether they are able to discriminate between cognates and non-cognates, given the orthographic changes undergone by foreign words when entering new languages. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We run our experiments on the Romanian lexicon, but the method we propose is adaptable to any language, as far as resources are available.

pdf
A Quantitative Insight into the Impact of Translation on Readability
Alina Maria Ciobanu | Liviu Dinu
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf
An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Automatic Detection of Cognates Using Orthographic Alignment
Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Temporal Text Ranking and Automatic Dating of Texts
Vlad Niculae | Marcos Zampieri | Liviu Dinu | Alina Maria Ciobanu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf
Predicting Romanian Stress Assignment
Alina Maria Ciobanu | Anca Dinu | Liviu Dinu
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers