Carmen Mîrzea Vasile


2026

The Romanian journalistic corpus previously annotated with verbal multiword expressions (PARSEME-Ro) has been extended recently with other journalistic texts and annotated with multiword expressions of all parts of speech closely observing version 2.0 of the PARSEME guidelines. The corpus size has been increased by about 40%, it underwent automatic morpho-syntactic annotation following the Universal Dependencies principles, as well as extensive semi-automatic annotation of multiword expressions of all morphological types (nominal, adjectival, adverbial, determiner, pronominal, prepositional, conjunction, interjection, and verbal for the newly added texts). We present here our work methodology, which involves an automatic annotation phase, but the manual work prevails in checking the annotation and its consistency. We also offer quantitative data about the new version of the corpus, the types of multiword expressions existing in Romanian and occurring therein, and characteristics thereof. The new version of the PARSEME-Ro corpus contributes to the field of developing multiword expressions resources per se, i.e. describing this language phenomenon, as well as resources for training, tuning and testing the performance of tools and large language models when dealing with this linguistic phenomenon.The paper also discusses some remarks on the MWE paraphrasing subtask in which a part of the corpus was used. The corpus is released with a permissive license.
This paper presents an enhanced version of the Romanian corpus previously annotated only for verbal multiword expressions. The new release extends the annotation to multiword expressions of other parts of speech, following version 2.0 of the PARSEME guidelines. The corpus has been expanded, its new part was automatically morpho-syntactically annotated based on the Universal Dependencies framework, followed by extensive semi-automatic annotation of multiword expressions across all morphological categories. The paper also reports quantitative data on the updated corpus and discusses the distribution and characteristics of Romanian multiword expressions. We also highlight language-specific annotation challenges and issues arising from the PARSEME 2.0 guidelines.

2023

This article presents a work-in-progress project, which aims to build and utilize a corpus of Romanian texts written or spoken by non-native students of different nationalities, who learn Romanian as a foreign language in the one-year, intensive academic program organized by the University of Bucharest. This corpus, called LECOR – Learner Corpus for Romanian – is made up of pairs of texts: a version of the student and a corrected one of the teacher. Each version is automatically annotated with lemma and POS-tag, and the two versions are then compared, and the differences are marked as errors at this stage. The corpus also contains metadata file sets about students and their samples. In this article, the conceptual framework for building and utilization of the corpus is presented, including the acquisition and organization phases of the primary material, the annotation process, and the first attempts to adapt the NoSketch Engine query interface to the project’s objectives. The article concludes by outlining the next steps in the development of the corpus aimed at quantitative accumulation and the development of the error correction process and the complex error annotation.