Sergiu Nisioi

2026

Archaeology at WE-2026 PARSEME 2.0 Subtask 1 and 2: Parsing is for Encoders, Paraphrasing is for LLMs
Rares-Alexandru Roscan | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

This paper presents our approach to the PARSEME 2.0 Shared Task on Romanian, covering both Identification (Subtask 1) and Paraphrasing (Subtask 2). While Large Language Models (LLMs) excel at semantic generation, we hypothesize that they lack the structural precision required for MWE identification, leading to "boundary hallucinations" that compromise downstream simplification. Our Rank 1 results on Romanian confirm this: a specialized encoder (RoBERT) using standard sequence labeling outperforms both few-shot LLMs and complex structural parsers (MTLB-STRUCT). This justifies our proposed pipeline: using encoders as precise “pointers” to guide the generative power of LLMs.

pdf bib abs

MorphoFiltered-Gemini at MWE-2026 PARSEME 2.0 Subtask 1: Tackling LLM Overgeneration via Universal POS-based Constraints
Irina Moise | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

This paper describes MorphoFiltered-Gemini, a multilingual system submitted to the PARSEME 2.0 shared task on multiword expression (MWE) identification. The system relies on Google Gemini 2.0 Flash-Lite to generate MWE predictions using zero-shot and selectively applied few-shot prompting, without fine-tuning or language-specific resources. To reduce the tendency of large language models to over-generate MWEs, we introduce a lightweight morphological post-filter that removes unlikely constructions while preserving high-precision patterns.Rather than optimizing peak performance for individual languages, our approach prioritizes precision and cross-lingual robustness. As a result, the system exhibits stable behavior across 17 typologically diverse languages and achieves the highest Shannon evenness score among all submitted systems. The experimental results highlight a clear trade-off between recall-oriented LLM prompting strategies and precision-oriented filtering, and show that simple linguistic constraints can effectively improve the stability of LLM-based multilingual MWE identification systems.

pdf bib abs

A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Steven Bedrick | A. Seza Dogruoz | Sergiu Nisioi
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Synthetic datasets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect. Therefore, they are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.

pdf bib abs

We present the MultiplEYE Text Corpus, a large-scale, document-level, multi-parallel resource designed to advance cross-linguistic research on reading and language processing. The corpus provides paragraph-level alignment for texts in 39 languages spanning seven language families and seven scripts. Unlike many existing multilingual corpora, a substantial number of documents were originally written in languages other than English, reducing English-centric bias and supporting more typologically diverse investigations. The texts are carefully selected to balance linguistic richness with experimental feasibility, particularly for eye-tracking-while-reading studies. Developed within a multi-lab initiative, the MultiplEYE Text Corpus follows unified translation, alignment, and experimental design guidelines to ensure cross-linguistic comparability. Its inclusion of texts varying in type and difficulty enables research on discourselevel processing, genre effects, and individual differences across a wide range of languages. The text corpus and accompanying metadata provide a robust foundation for multilingual psycholinguistic and computational modeling research. Data and materials are publicly available at https://doi.org/10.23668/psycharchives.21750.

pdf bib abs

DCSN-NLP at MWE-2026 AdMIRe 2: Bridging Literal and Figurative Meaning Through Hierarchical Multimodal Reasoning
David Cotigă | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

This paper presents our system for the MWE-2026 ADMiRe 2.0 shared task, which aimedto advance multimodal idiomatic understand-ing across 15 languages. We address the taskof selecting, from a set of five images, theone that best represents either the literal oridiomatic meaning of a given compound incontext. Our approach follows a multi-steppipeline: a large language model (LLM) firstdetermines whether the compound is used lit-erally or idiomatically and generates auxiliarytext, consisting of an idiomatic meaning expla-nation and a visual description of the literalmeaning. An ensemble of three CLIP modelsthen identifies the two images most semanti-cally similar to the appropriate generated textvia a voting mechanism. Finally, the LLM se-lects the best image from these two candidates.

2025

pdf bib abs

A Comparison of Elementary Baselines for BabyLM
Rareș Păpușoi | Sergiu Nisioi
Proceedings of the First BabyLM Workshop

This paper explores multiple simple baselines for the BabyLM challenge, covering random models, elementary predictions based on frequency, n-gram language models, LSTM with several tokenizers (BPE, Unigram, SuperBPE), and GPT-BERT, the winning architecture from the prior BabyLM edition. The evaluation is focused on the BLiMP and BLiMP-Supplement benchmarks. Our experiments show that Strict-Small can sometimes outperform Strict, the fact that performance can be highly sensitive to tokenization and the importance of data efficiency. A simple word-frequency baseline scored unexpectedly high, which led to identifying an evaluation artifact in the pipeline: a system that returns identical logits for both sentences in a minimal pair can achieve maximal accuracy.

pdf bib abs

Predicting Total Reading Time Using Romanian Eye-Tracking Data
Anamaria Hodivoianu | Oleksandra Kuvshynova | Filip Popovici | Adrian Luca | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

This work introduces the first Romanian eye-tracking dataset for reading and investigates methods for predicting word-level total reading times. We develop and compare a range of models, from traditional machine learning using handcrafted linguistic features to fine-tuned Romanian BERT architectures, demonstrating strong correlations between predicted and observed reading times. Additionally, we propose a lexical simplification pipeline that leverages these TRT predictions to identify and substitute complex words, enhancing text readability. Our approach is integrated into an interactive web tool, illustrating the practical benefits of combining cognitive signals with NLP techniques for Romanian — a language with limited resources in this area.

pdf bib abs

Arabic to Romanian Machine Translation: A Case Study on Distant Language Pairs
Ioan Alexandru Hirica | Stefana Arina Tabusca | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

This paper investigates machine translation between two linguistically distant languages, Arabic and Romanian, with a focus on translating from Arabic to Romanian. Dataset cleaning techniques are addressed, offering insights on the impact of translation for a language pair with limited resources. Using publicly available corpora (e.g., OPUS) and manually translated diplomatic texts, filtering methods are applied, such as duplicate removal, embedding similarity analysis (LEALLA), and Large Language Model (LLM)-based validation (Gemini-flash-002). Transformer models are trained and evaluated with diverse preprocessing pipelines that incorporate subword tokenization. Additionally, the performance of a fine-tuned LLM is assessed for this task and is compared to their pre-trained counterparts. Despite computational limitations, the results emphasize the importance of targeted preprocessing and model adaptation in improving Arabic-Romanian translation quality.

pdf bib abs

RALS: Resources and Baselines for Romanian Automatic Lexical Simplification
Fabian Anghel | Cristea Petru-Theodor | Claudiu Creanga | Sergiu Nisioi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce the first dataset that jointly covers both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification approaches. We propose a methodology for ordering simplification suggestions using a pairwise ranking approximation method, arranging candidates from simple to complex based on a separate set of human judgments. In addition, we provide human lexical complexity annotations for 3,921 word samples in context. Finally, we explore several novel pipelines for complexity prediction and simplification and present the first text simplification system for Romanian.

pdf bib abs

Exploring Mouse Tracking for Reading on Romanian Data
Cristina Maria Popescu | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

In this paper, we investigate the use of the Mouse Tracking for Reading (MoTR) method for a sample of Romanian texts. MoTR is a novel measurement tool that is meant to collect word-by-word reading times. In a typical MoTR trial, the text is blurred, except for a small area around the mouse pointer and the participants must move the mouse to reveal and read the text. In the current experiment, participants read such texts and afterwords answered comprehension questions, aiming to evaluate reading behavior and cognitive engagement. Mouse movement is recorded and analyzed to evaluate attention distribution across a sentence, providing insights into incremental language processing. Based on all the information gathered, the study confirms the feasibility of this method in a controlled setting and emphasizes MoTR’s potential as an accessible and naturalistic approach for studying text comprehension.

pdf bib abs

Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough?
Ana Roșu | Jany-Gabriel Ispas | Sergiu Nisioi
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

This paper describes our approach for 5 classification tasks from Building Educational Applications (BEA) 2025 Shared Task.Our methods range from classical machine learning models to large-scale transformers with fine-tuning and prompting strategies. Despite the diversity of approaches, performance differences were often minor, suggesting a strong surface-level signal and the limiting effect of annotation noise—particularly around the “To some extent” label. Under lenient evaluation, simple models perform competitively, showing their effectiveness in low-resource settings. Our submissions ranked in the top 10 in four of five tracks.

pdf bib abs

Graph-based RAG for Low-Resource Aromanian–Romanian Translation
Laurentiu G. Ghetoiu | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Aromanian, a linguistically and culturally significant yet low-resource Romance language, poses substantial challenges in computational linguistic research due to its limited NLP resources and non-standardized orthography. In this paper, we present an experimental study aimed at translating Aromanian texts into Romanian using a variety of modern NLP methodologies. We leverage two key resources: a parallel corpus consisting of approximately 3,000 sentence-aligned short stories and a dictionary of over 28,000 Aromanian-Romanian word pairs. Our approaches include Retrieval-Augmented Generation (RAG) supported by a graph-based alignment database, fine-tuning multilingual transformer models (specifically Meta’s NLLB), and parameter-efficient fine-tuning techniques such as LoRA applied to LLaMA-derived models. Evaluations using standard metrics (BLEU, chrF) demonstrate varied effectiveness across these methodologies, highlighting the strong performance of NLLB for general translation tasks, while RAG excels in translating familiar content. Our findings underline the complexities inherent in low-resource language translation and provide valuable insights into effective digital preservation and NLP adaptation strategies for underrepresented languages.

pdf bib abs

Archaeology at TSAR 2025 Shared Task Teaching Small Models to do CEFR Simplifications
Rareş-Alexandru Roşcan | Sergiu Nisioi
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)

Large language models (LLMs) have demonstrated strong performance in text simplification tasks, but their high computational cost and proprietary nature often limit practical use, especially in education. We explore open-source LLMs for CEFR-level text simplification. By reducing model size and computational requirements, our approach enables greater accessibility and deployment in educational environments. Our results show some of the lowest error rates in producing CEFR-compliant texts at TSAR 2025, using models with 8 billion and 1 billion parameters. Such approaches have the potential to democratize NLP technologies for real-world applications.

pdf bib abs

Dialectal and Low Resource Machine Translation for Aromanian
Alexandru-Iulius Jerpelea | Alina Radoi | Sergiu Nisioi
Proceedings of the 31st International Conference on Computational Linguistics

We present a neural machine translation system that can translate between Romanian, English, and Aromanian (an endangered Eastern Romance language); the first of its kind. BLEU scores range from 17 to 32 depending on the direction and genre of the text. Alongside, we release the biggest known Aromanian-Romanian bilingual corpus, consisting of 80k cleaned sentence pairs. Additional tools such as an agnostic sentence embedder (used for both text mining and automatic evaluation) and a diacritics converter are also presented. Lastly, we describe the online deployment of our quantized model, considering a CPU-driven limited resource scenario.

2024

pdf bib abs

A Multilingual Parallel Corpus for Aromanian
Iulia Petrariu | Sergiu Nisioi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We report the creation of the first high-quality corpus of Aromanian - an endangered Romance language spoken in the Balkans - and the equivalent sentence-aligned translations into Romanian, English, and French. The corpus is released publicly using several orthographic standards and consists in short stories collected in the ‘70s in Romania. Additionally, we provide an corpus-based analysis of Aromanian linguistic particularities and the overall demographic and political context which impacts the contemporary development of the language.

pdf bib abs

Cheap Ways of Extracting Clinical Markers from Texts
Anastasia Sandu | Teodor Mihailescu | Sergiu Nisioi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

This paper describes the Unibuc Archaeology team work for CLPsych’s 2024 Shared Task that involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs an LLM that is used for generating the summaries and is guided to provide sequences of text indicating suicidal tendencies through a processing chain for highlights. The second approach involves implementing a good old-fashioned machine learning tf-idf with a logistic regression classifier, whose representative features we use to extract relevant highlights.

pdf bib abs

Archaeology at MLSP 2024: Machine Translation for Lexical Complexity Prediction and Lexical Simplification
Petru Cristea | Sergiu Nisioi
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

We present the submissions of team Archaeology for the Lexical Simplification and Lexical Complexity Prediction Shared Tasks at BEA2024. Our approach for this shared task consists in creating two pipelines for generating lexical substitutions and estimating the complexity: one using machine translation texts into English and one using the original language.For the LCP subtask, our xgb regressor is trained with engineered features (based primarily on English language resources) and shallow word structure features. For the LS subtask we use a locally-executed quantized LLM to generate candidates and sort them by complexity score computed using the pipeline designed for LCP.These pipelines provide distinct perspectives on the lexical simplification process, offering insights into the efficacy and limitations of employing Machine Translation versus direct processing on the original language data.

2023

pdf bib abs

Clark Kent at SemEval-2023 Task 5: SVMs, Transformers, and Pixels for Clickbait Spoiling
Dragos-stefan Mihalcea | Sergiu Nisioi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

In this paper we present an analysis of our approaches for the 2023 SemEval-2023 Clickbait Challenge. We only participated in the sub-task aiming at identifying different clikcbait spoiling types comparing several machine learning and deep learning approaches. Our analysis confirms previous results on this task and show that automatic methods are able to reach approximately 70\% accuracy at predicting what type of additional content is needed to mitigate sensationalistic posts on social media. Furthermore, we provide a qualitative analysis of the results, showing that the models may do better in practice than the metric indicates since the evaluate does not depend only on the predictor, but also on the typology we choose to define clickbait spoiling.

2022

pdf bib abs

Identifying Draft Bills Impacting Existing Legislation: a Case Study on Romanian
Corina Ceausu | Sergiu Nisioi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In our paper, we present a novel corpus of historical legal documents on the Romanian public procurement legislation and an annotated subset of draft bills that have been screened by legal experts and identified as impacting past public procurement legislation. Using the manual annotations provided by the experts, we attempt to automatically identify future draft bills that have the potential to impact existing policies on public procurement.

2020

pdf bib abs

CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts
Sanja Stajner | Sergiu Nisioi | Ioana Hulpuș
Proceedings of the Twelfth Language Resources and Evaluation Conference

Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.

2018

pdf bib

A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification
Sanja Štajner | Sergiu Nisioi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Content Extraction and Lexical Analysis from Customer-Agent Interactions
Sergiu Nisioi | Anca Bucur | Liviu P. Dinu
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an Enterprise Resource Planning (ERP) environment and a potential solution to clean the data and extract relevant content for NLP. As a result, we demonstrate that the actual vocabulary for the language that prevails in the ERP conversations is highly divergent from the standardized dictionary and further different from general language usage as extracted from the Common Crawl corpus. Moreover, in specific business communication circumstances, where it is expected to observe a high usage of standardized language, code switching and non-standard expression are predominant, emphasizing once more the discrepancy between the day-to-day use of language and the standardized one.

2017

pdf bib abs

Exploring Neural Text Simplification Models
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems

2016

pdf bib abs

Vanilla Classifiers for Distinguishing between Similar Languages
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.

pdf bib

On the Similarities Between Native, Non-native and Translated Texts
Ella Rabinovich | Sergiu Nisioi | Noam Ordan | Shuly Wintner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
Anca Bucur | Sergiu Nisioi
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.

pdf bib abs

Comparing Speech and Text Classification on ICNALE
Sergiu Nisioi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we explore and compare a speech and text classification approach on a corpus of native and non-native English speakers. We experiment on a subset of the International Corpus Network of Asian Learners of English containing the recorded speeches and the equivalent text transcriptions. Our results suggest a high correlation between the spoken and written classification results, showing that native accent is highly correlated with grammatical structures found in text.

pdf bib abs

Using Word Embeddings to Translate Named Entities
Octavia-Maria Şulea | Sergiu Nisioi | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on comparable corpora yields comparable vector space representations of those corpora, reducing the problem of translating words to finding a rotation matrix, and results in (Zou et al., 2013), which showed that bilingual word embeddings can improve Chinese Named Entity Recognition (NER) and English to Chinese phrase translation, we use the sentence-aligned English-French EuroParl corpora and show that word embeddings extracted from a merged corpus (corpus resulted from the merger of the two aligned corpora) can be used to NE translation. We extrapolate that word embeddings trained on merged parallel corpora are useful in Named Entity Recognition and Translation tasks for resource-poor languages.

pdf bib abs

A Corpus of Native, Non-native and Translated Texts
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

Sergiu Nisioi

2026

2025

2024

2023

2022

2020

2018

2017

2016

2014

2013

2012

Co-authors

Venues