2022
pdf
abs
Rethinking the Authorship Verification Experimental Setups
Florin Brad
|
Andrei Manolache
|
Elena Burceanu
|
Antonio Barbalau
|
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
One of the main drivers of the recent advances in authorship verification is the PAN large-scale authorship dataset. Despite generating significant progress in the field, inconsistent performance differences between the closed and open test sets have been reported. To this end, we improve the experimental setup by proposing five new public splits over the PAN dataset, specifically designed to isolate and identify biases related to the text topic and to the author’s writing style. We evaluate several BERT-like baselines on these splits, showing that such models are competitive with authorship verification state-of-the-art methods. Furthermore, using explainable AI, we find that these baselines are biased towards named entities. We show that models trained without the named entities obtain better results and generalize better when tested on DarkReddit, our new dataset for authorship verification.
2021
pdf
abs
jurBERT: A Romanian BERT Model for Legal Judgement Prediction
Mihai Masala
|
Radu Cristian Alexandru Iacob
|
Ana Sabina Uban
|
Marina Cidota
|
Horia Velicu
|
Traian Rebedea
|
Marius Popescu
Proceedings of the Natural Legal Language Processing Workshop 2021
Transformer-based models have become the de facto standard in the field of Natural Language Processing (NLP). By leveraging large unlabeled text corpora, they enable efficient transfer learning leading to state-of-the-art results on numerous NLP tasks. Nevertheless, for low resource languages and highly specialized tasks, transformer models tend to lag behind more classical approaches (e.g. SVM, LSTM) due to the lack of aforementioned corpora. In this paper we focus on the legal domain and we introduce a Romanian BERT model pre-trained on a large specialized corpus. Our model outperforms several strong baselines for legal judgement prediction on two different corpora consisting of cases from trials involving banks in Romania.
2017
pdf
abs
Can string kernels pass the test of time in Native Language Identification?
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
We describe a machine learning approach for the 2017 shared task on Native Language Identification (NLI). The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from essays or speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided by the shared task organizers. For the learning stage, we choose Kernel Discriminant Analysis (KDA) over Kernel Ridge Regression (KRR), because the former classifier obtains better results than the latter one on the development set. In our previous work, we have used a similar machine learning approach to achieve state-of-the-art NLI results. The goal of this paper is to demonstrate that our shallow and simple approach based on string kernels (with minor improvements) can pass the test of time and reach state-of-the-art performance in the 2017 NLI shared task, despite the recent advances in natural language processing. We participated in all three tracks, in which the competitors were allowed to use only the essays (essay track), only the speech transcripts (speech track), or both (fusion track). Using only the data provided by the organizers for training our models, we have reached a macro F1 score of 86.95% in the closed essay track, a macro F1 score of 87.55% in the closed speech track, and a macro F1 score of 93.19% in the closed fusion track. With these scores, our team (UnibucKernel) ranked in the first group of teams in all three tracks, while attaining the best scores in the speech and the fusion tracks.
2016
pdf
abs
UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels
Radu Tudor Ionescu
|
Marius Popescu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Unlike the common approach, we present a method that uses only character p-grams (also known as n-grams) as features for the Arabic Dialect Identification (ADI) Closed Shared Task of the DSL 2016 Challenge. The proposed approach combines several string kernels using multiple kernel learning. In the learning stage, we try both Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR), and we choose KDA as it gives better results in a 10-fold cross-validation carried out on the training set. Our approach is shallow and simple, but the empirical results obtained in the ADI Shared Task prove that it achieves very good results. Indeed, we ranked on the second place with an accuracy of 50.91% and a weighted F1 score of 51.31%. We also present improved results in this paper, which we obtained after the competition ended. Simply by adding more regularization into our model to make it more suitable for test data that comes from a different distribution than training data, we obtain an accuracy of 51.82% and a weighted F1 score of 52.18%. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral, as it does not require any NLP tools.
pdf
String Kernels for Native Language Identification: Insights from Behind the Curtains
Radu Tudor Ionescu
|
Marius Popescu
|
Aoife Cahill
Computational Linguistics, Volume 42, Issue 3 - September 2016
2014
pdf
Can characters reveal your native language? A language-independent approach to native language identification
Radu Tudor Ionescu
|
Marius Popescu
|
Aoife Cahill
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
2013
pdf
The Story of the Characters, the DNA and the Native Language
Marius Popescu
|
Radu Tudor Ionescu
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
2011
pdf
Studying Translationese at the Character Level
Marius Popescu
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011
2009
pdf
Comparing Statistical Similarity Measures for Stylistic Multivariate Analysis
Marius Popescu
|
Liviu P. Dinu
Proceedings of the International Conference RANLP-2009
pdf
What’s in a name? In some languages, grammatical gender
Vivi Nastase
|
Marius Popescu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
2008
pdf
Rank Distance as a Stylistic Similarity
Marius Popescu
|
Liviu P. Dinu
Coling 2008: Companion volume: Posters
pdf
abs
Authorship Identification of Romanian Texts with Controversial Paternity
Liviu Dinu
|
Marius Popescu
|
Anca Dinu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this work we propose a new strategy for the authorship identification problem and we test it on an example from Romanian literature: did Radu Albala found the continuation of Mateiu Caragiales novel Sub pecetea tainei, or did he write himself the respective continuation? The proposed strategy is based on the similarity of rankings of function words; we compare the obtained results with the results obtained by a learning method (namely Support Vector Machines -SVM- with a string kernel).
2004
pdf
Regularized Least-Squares classification for Word Sense Disambiguation
Marius Popescu
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text