Sergiu Nisioi
2026
Archaeology at WE-2026 PARSEME 2.0 Subtask 1 and 2: Parsing is for Encoders, Paraphrasing is for LLMs
Rares-Alexandru Roscan | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Rares-Alexandru Roscan | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper presents our approach to the PARSEME 2.0 Shared Task on Romanian, covering both Identification (Subtask 1) and Paraphrasing (Subtask 2). While Large Language Models (LLMs) excel at semantic generation, we hypothesize that they lack the structural precision required for MWE identification, leading to "boundary hallucinations" that compromise downstream simplification. Our Rank 1 results on Romanian confirm this: a specialized encoder (RoBERT) using standard sequence labeling outperforms both few-shot LLMs and complex structural parsers (MTLB-STRUCT). This justifies our proposed pipeline: using encoders as precise “pointers” to guide the generative power of LLMs.
MorphoFiltered-Gemini at MWE-2026 PARSEME 2.0 Subtask 1: Tackling LLM Overgeneration via Universal POS-based Constraints
Irina Moise | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Irina Moise | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper describes MorphoFiltered-Gemini, a multilingual system submitted to the PARSEME 2.0 shared task on multiword expression (MWE) identification. The system relies on Google Gemini 2.0 Flash-Lite to generate MWE predictions using zero-shot and selectively applied few-shot prompting, without fine-tuning or language-specific resources. To reduce the tendency of large language models to over-generate MWEs, we introduce a lightweight morphological post-filter that removes unlikely constructions while preserving high-precision patterns.Rather than optimizing peak performance for individual languages, our approach prioritizes precision and cross-lingual robustness. As a result, the system exhibits stable behavior across 17 typologically diverse languages and achieves the highest Shannon evenness score among all submitted systems. The experimental results highlight a clear trade-off between recall-oriented LLM prompting strategies and precision-oriented filtering, and show that simple linguistic constraints can effectively improve the stability of LLM-based multilingual MWE identification systems.
A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts
Steven Bedrick | A. Seza Dogruoz | Sergiu Nisioi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Steven Bedrick | A. Seza Dogruoz | Sergiu Nisioi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Synthetic datasets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect. Therefore, they are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
The MultiplEYE Text Corpus: Towards a Diverse and Ever-Expanding Multilingual Text Corpus
Ramunė Kasperė | Anna Bondar | Sergiu Nisioi | Maja Stegenwallner-Schütz | Hanne B. Søndergaard Knudsen | Ana Matić | Eva Pavlinušić Vilus | Dorota Klimek-Jankowska | Chiara Tschirner | Not Battesta Soliva | Deborah N. Jakobi | Cui Ding | Dima Abu Romi | Cengiz Acarturk | Matilda Agdler | Anton Marius Alexandru | Mohd Faizan Ansari | Annalisa Arcidiacono | Elizabete Ausma Velta Barisa | Ana Bautista | Lisa Beinborn | Yevgeni Berzak | Nedeljka Bjelanović | Anna Isabelle Bothmann | Jan Brasser | Caterina Cacioli | Anila Çepani | Ilze Ceple | Adelina Cerpja | Dalí Chirino | Jan Chromý | Alessandro Corona Mendozza | Iria de-Dios-Flores | Nazik Dinçtopal Deniz | Ana Došen | Kristian Elersič | Inmaculada Fajardo | Zigmunds Freibergs | Angelina Ganebnaya | Shan Gao | Jéssica Gomes | Annjo Klungervik Greenall | Alba Haveriku | Miao He | Anamaria Hodivoianu | Yu-Yin Hsu | Amanda Isaksen | Andreia Janeiro | Kristine Jensen de López | Aleksandar Jevremovic | Vojislav Jovanovic | Hanna Kędzierska | Nik Kharlamov | Sara Kosutar | Nelda Kote | Vanja Kovic | Izabela Krejtz | Thyra Krosness | Oleksandra Kuvshynova | Eilam Lavy | Ella Lion | Marta Łockiewicz | Kaidi Lõo | Paula Luegi | Mircea Mihai Marin | Clara Martin | Svitlana Matvieieva | Diane C. Mézière | Xavier Mínguez-López | Valeriia Modina | Jurgita Motiejūnienė | Marie-Luise Müller | Tolgonai Nasipbek kyzy | Jamal Abdul Nasir | Johanne S. K. Nedergård | Ayşegül Özkan | Patrizia Paggio | Marijan Palmović | Maria Christina Panagiotopoulou | Alberto Parola | Helena Pérez | Klaudia Petersen | Anja Podlesek | Eva Pospíšilová | Marta Praulina | Mikuláš Preininger | Loredana Pungă | Diego Rossini | Špela Rot | Habib Sani Yahaya | Irina A. Sekerina | Anne Gabija Skadina | Jordi Solé-Casals | Lonneke van der Plas | Saara M. Varjopuro | Spyridoula Varlokosta | João Veríssimo | Oskari Juhapekka Virtanen | Nemanja Vračar | Mila Vulchanova | Ahmad Mustapha Wali | Peizheng Wu | Nilgün Yücel | Stefan Frank | Nora Hollenstein | Lena Jäger
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Ramunė Kasperė | Anna Bondar | Sergiu Nisioi | Maja Stegenwallner-Schütz | Hanne B. Søndergaard Knudsen | Ana Matić | Eva Pavlinušić Vilus | Dorota Klimek-Jankowska | Chiara Tschirner | Not Battesta Soliva | Deborah N. Jakobi | Cui Ding | Dima Abu Romi | Cengiz Acarturk | Matilda Agdler | Anton Marius Alexandru | Mohd Faizan Ansari | Annalisa Arcidiacono | Elizabete Ausma Velta Barisa | Ana Bautista | Lisa Beinborn | Yevgeni Berzak | Nedeljka Bjelanović | Anna Isabelle Bothmann | Jan Brasser | Caterina Cacioli | Anila Çepani | Ilze Ceple | Adelina Cerpja | Dalí Chirino | Jan Chromý | Alessandro Corona Mendozza | Iria de-Dios-Flores | Nazik Dinçtopal Deniz | Ana Došen | Kristian Elersič | Inmaculada Fajardo | Zigmunds Freibergs | Angelina Ganebnaya | Shan Gao | Jéssica Gomes | Annjo Klungervik Greenall | Alba Haveriku | Miao He | Anamaria Hodivoianu | Yu-Yin Hsu | Amanda Isaksen | Andreia Janeiro | Kristine Jensen de López | Aleksandar Jevremovic | Vojislav Jovanovic | Hanna Kędzierska | Nik Kharlamov | Sara Kosutar | Nelda Kote | Vanja Kovic | Izabela Krejtz | Thyra Krosness | Oleksandra Kuvshynova | Eilam Lavy | Ella Lion | Marta Łockiewicz | Kaidi Lõo | Paula Luegi | Mircea Mihai Marin | Clara Martin | Svitlana Matvieieva | Diane C. Mézière | Xavier Mínguez-López | Valeriia Modina | Jurgita Motiejūnienė | Marie-Luise Müller | Tolgonai Nasipbek kyzy | Jamal Abdul Nasir | Johanne S. K. Nedergård | Ayşegül Özkan | Patrizia Paggio | Marijan Palmović | Maria Christina Panagiotopoulou | Alberto Parola | Helena Pérez | Klaudia Petersen | Anja Podlesek | Eva Pospíšilová | Marta Praulina | Mikuláš Preininger | Loredana Pungă | Diego Rossini | Špela Rot | Habib Sani Yahaya | Irina A. Sekerina | Anne Gabija Skadina | Jordi Solé-Casals | Lonneke van der Plas | Saara M. Varjopuro | Spyridoula Varlokosta | João Veríssimo | Oskari Juhapekka Virtanen | Nemanja Vračar | Mila Vulchanova | Ahmad Mustapha Wali | Peizheng Wu | Nilgün Yücel | Stefan Frank | Nora Hollenstein | Lena Jäger
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present the MultiplEYE Text Corpus, a large-scale, document-level, multi-parallel resource designed to advance cross-linguistic research on reading and language processing. The corpus provides paragraph-level alignment for texts in 39 languages spanning seven language families and seven scripts. Unlike many existing multilingual corpora, a substantial number of documents were originally written in languages other than English, reducing English-centric bias and supporting more typologically diverse investigations. The texts are carefully selected to balance linguistic richness with experimental feasibility, particularly for eye-tracking-while-reading studies. Developed within a multi-lab initiative, the MultiplEYE Text Corpus follows unified translation, alignment, and experimental design guidelines to ensure cross-linguistic comparability. Its inclusion of texts varying in type and difficulty enables research on discourselevel processing, genre effects, and individual differences across a wide range of languages. The text corpus and accompanying metadata provide a robust foundation for multilingual psycholinguistic and computational modeling research. Data and materials are publicly available at https://doi.org/10.23668/psycharchives.21750.
DCSN-NLP at MWE-2026 AdMIRe 2: Bridging Literal and Figurative Meaning Through Hierarchical Multimodal Reasoning
David Cotigă | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
David Cotigă | Sergiu Nisioi
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
This paper presents our system for the MWE-2026 ADMiRe 2.0 shared task, which aimedto advance multimodal idiomatic understand-ing across 15 languages. We address the taskof selecting, from a set of five images, theone that best represents either the literal oridiomatic meaning of a given compound incontext. Our approach follows a multi-steppipeline: a large language model (LLM) firstdetermines whether the compound is used lit-erally or idiomatically and generates auxiliarytext, consisting of an idiomatic meaning expla-nation and a visual description of the literalmeaning. An ensemble of three CLIP modelsthen identifies the two images most semanti-cally similar to the appropriate generated textvia a voting mechanism. Finally, the LLM se-lects the best image from these two candidates.
2025
A Comparison of Elementary Baselines for BabyLM
Rareș Păpușoi | Sergiu Nisioi
Proceedings of the First BabyLM Workshop
Rareș Păpușoi | Sergiu Nisioi
Proceedings of the First BabyLM Workshop
This paper explores multiple simple baselines for the BabyLM challenge, covering random models, elementary predictions based on frequency, n-gram language models, LSTM with several tokenizers (BPE, Unigram, SuperBPE), and GPT-BERT, the winning architecture from the prior BabyLM edition. The evaluation is focused on the BLiMP and BLiMP-Supplement benchmarks. Our experiments show that Strict-Small can sometimes outperform Strict, the fact that performance can be highly sensitive to tokenization and the importance of data efficiency. A simple word-frequency baseline scored unexpectedly high, which led to identifying an evaluation artifact in the pipeline: a system that returns identical logits for both sentences in a minimal pair can achieve maximal accuracy.
Predicting Total Reading Time Using Romanian Eye-Tracking Data
Anamaria Hodivoianu | Oleksandra Kuvshynova | Filip Popovici | Adrian Luca | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
Anamaria Hodivoianu | Oleksandra Kuvshynova | Filip Popovici | Adrian Luca | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
This work introduces the first Romanian eye-tracking dataset for reading and investigates methods for predicting word-level total reading times. We develop and compare a range of models, from traditional machine learning using handcrafted linguistic features to fine-tuned Romanian BERT architectures, demonstrating strong correlations between predicted and observed reading times. Additionally, we propose a lexical simplification pipeline that leverages these TRT predictions to identify and substitute complex words, enhancing text readability. Our approach is integrated into an interactive web tool, illustrating the practical benefits of combining cognitive signals with NLP techniques for Romanian — a language with limited resources in this area.
Arabic to Romanian Machine Translation: A Case Study on Distant Language Pairs
Ioan Alexandru Hirica | Stefana Arina Tabusca | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Ioan Alexandru Hirica | Stefana Arina Tabusca | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
This paper investigates machine translation between two linguistically distant languages, Arabic and Romanian, with a focus on translating from Arabic to Romanian. Dataset cleaning techniques are addressed, offering insights on the impact of translation for a language pair with limited resources. Using publicly available corpora (e.g., OPUS) and manually translated diplomatic texts, filtering methods are applied, such as duplicate removal, embedding similarity analysis (LEALLA), and Large Language Model (LLM)-based validation (Gemini-flash-002). Transformer models are trained and evaluated with diverse preprocessing pipelines that incorporate subword tokenization. Additionally, the performance of a fine-tuned LLM is assessed for this task and is compared to their pre-trained counterparts. Despite computational limitations, the results emphasize the importance of targeted preprocessing and model adaptation in improving Arabic-Romanian translation quality.
RALS: Resources and Baselines for Romanian Automatic Lexical Simplification
Fabian Anghel | Cristea Petru-Theodor | Claudiu Creanga | Sergiu Nisioi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fabian Anghel | Cristea Petru-Theodor | Claudiu Creanga | Sergiu Nisioi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We introduce the first dataset that jointly covers both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification approaches. We propose a methodology for ordering simplification suggestions using a pairwise ranking approximation method, arranging candidates from simple to complex based on a separate set of human judgments. In addition, we provide human lexical complexity annotations for 3,921 word samples in context. Finally, we explore several novel pipelines for complexity prediction and simplification and present the first text simplification system for Romanian.
Exploring Mouse Tracking for Reading on Romanian Data
Cristina Maria Popescu | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
Cristina Maria Popescu | Sergiu Nisioi
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
In this paper, we investigate the use of the Mouse Tracking for Reading (MoTR) method for a sample of Romanian texts. MoTR is a novel measurement tool that is meant to collect word-by-word reading times. In a typical MoTR trial, the text is blurred, except for a small area around the mouse pointer and the participants must move the mouse to reveal and read the text. In the current experiment, participants read such texts and afterwords answered comprehension questions, aiming to evaluate reading behavior and cognitive engagement. Mouse movement is recorded and analyzed to evaluate attention distribution across a sentence, providing insights into incremental language processing. Based on all the information gathered, the study confirms the feasibility of this method in a controlled setting and emphasizes MoTR’s potential as an accessible and naturalistic approach for studying text comprehension.
Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough?
Ana Roșu | Jany-Gabriel Ispas | Sergiu Nisioi
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Ana Roșu | Jany-Gabriel Ispas | Sergiu Nisioi
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This paper describes our approach for 5 classification tasks from Building Educational Applications (BEA) 2025 Shared Task.Our methods range from classical machine learning models to large-scale transformers with fine-tuning and prompting strategies. Despite the diversity of approaches, performance differences were often minor, suggesting a strong surface-level signal and the limiting effect of annotation noise—particularly around the “To some extent” label. Under lenient evaluation, simple models perform competitively, showing their effectiveness in low-resource settings. Our submissions ranked in the top 10 in four of five tracks.
Graph-based RAG for Low-Resource Aromanian–Romanian Translation
Laurentiu G. Ghetoiu | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Laurentiu G. Ghetoiu | Sergiu Nisioi
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Aromanian, a linguistically and culturally significant yet low-resource Romance language, poses substantial challenges in computational linguistic research due to its limited NLP resources and non-standardized orthography. In this paper, we present an experimental study aimed at translating Aromanian texts into Romanian using a variety of modern NLP methodologies. We leverage two key resources: a parallel corpus consisting of approximately 3,000 sentence-aligned short stories and a dictionary of over 28,000 Aromanian-Romanian word pairs. Our approaches include Retrieval-Augmented Generation (RAG) supported by a graph-based alignment database, fine-tuning multilingual transformer models (specifically Meta’s NLLB), and parameter-efficient fine-tuning techniques such as LoRA applied to LLaMA-derived models. Evaluations using standard metrics (BLEU, chrF) demonstrate varied effectiveness across these methodologies, highlighting the strong performance of NLLB for general translation tasks, while RAG excels in translating familiar content. Our findings underline the complexities inherent in low-resource language translation and provide valuable insights into effective digital preservation and NLP adaptation strategies for underrepresented languages.
Archaeology at TSAR 2025 Shared Task Teaching Small Models to do CEFR Simplifications
Rareş-Alexandru Roşcan | Sergiu Nisioi
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Rareş-Alexandru Roşcan | Sergiu Nisioi
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Large language models (LLMs) have demonstrated strong performance in text simplification tasks, but their high computational cost and proprietary nature often limit practical use, especially in education. We explore open-source LLMs for CEFR-level text simplification. By reducing model size and computational requirements, our approach enables greater accessibility and deployment in educational environments. Our results show some of the lowest error rates in producing CEFR-compliant texts at TSAR 2025, using models with 8 billion and 1 billion parameters. Such approaches have the potential to democratize NLP technologies for real-world applications.
Dialectal and Low Resource Machine Translation for Aromanian
Alexandru-Iulius Jerpelea | Alina Radoi | Sergiu Nisioi
Proceedings of the 31st International Conference on Computational Linguistics
Alexandru-Iulius Jerpelea | Alina Radoi | Sergiu Nisioi
Proceedings of the 31st International Conference on Computational Linguistics
We present a neural machine translation system that can translate between Romanian, English, and Aromanian (an endangered Eastern Romance language); the first of its kind. BLEU scores range from 17 to 32 depending on the direction and genre of the text. Alongside, we release the biggest known Aromanian-Romanian bilingual corpus, consisting of 80k cleaned sentence pairs. Additional tools such as an agnostic sentence embedder (used for both text mining and automatic evaluation) and a diacritics converter are also presented. Lastly, we describe the online deployment of our quantized model, considering a CPU-driven limited resource scenario.
2024
A Multilingual Parallel Corpus for Aromanian
Iulia Petrariu | Sergiu Nisioi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Iulia Petrariu | Sergiu Nisioi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We report the creation of the first high-quality corpus of Aromanian - an endangered Romance language spoken in the Balkans - and the equivalent sentence-aligned translations into Romanian, English, and French. The corpus is released publicly using several orthographic standards and consists in short stories collected in the ‘70s in Romania. Additionally, we provide an corpus-based analysis of Aromanian linguistic particularities and the overall demographic and political context which impacts the contemporary development of the language.
Cheap Ways of Extracting Clinical Markers from Texts
Anastasia Sandu | Teodor Mihailescu | Sergiu Nisioi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)
Anastasia Sandu | Teodor Mihailescu | Sergiu Nisioi
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)
This paper describes the Unibuc Archaeology team work for CLPsych’s 2024 Shared Task that involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs an LLM that is used for generating the summaries and is guided to provide sequences of text indicating suicidal tendencies through a processing chain for highlights. The second approach involves implementing a good old-fashioned machine learning tf-idf with a logistic regression classifier, whose representative features we use to extract relevant highlights.
Archaeology at MLSP 2024: Machine Translation for Lexical Complexity Prediction and Lexical Simplification
Petru Cristea | Sergiu Nisioi
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Petru Cristea | Sergiu Nisioi
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
We present the submissions of team Archaeology for the Lexical Simplification and Lexical Complexity Prediction Shared Tasks at BEA2024. Our approach for this shared task consists in creating two pipelines for generating lexical substitutions and estimating the complexity: one using machine translation texts into English and one using the original language.For the LCP subtask, our xgb regressor is trained with engineered features (based primarily on English language resources) and shallow word structure features. For the LS subtask we use a locally-executed quantized LLM to generate candidates and sort them by complexity score computed using the pipeline designed for LCP.These pipelines provide distinct perspectives on the lexical simplification process, offering insights into the efficacy and limitations of employing Machine Translation versus direct processing on the original language data.
2023
Clark Kent at SemEval-2023 Task 5: SVMs, Transformers, and Pixels for Clickbait Spoiling
Dragos-stefan Mihalcea | Sergiu Nisioi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Dragos-stefan Mihalcea | Sergiu Nisioi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
In this paper we present an analysis of our approaches for the 2023 SemEval-2023 Clickbait Challenge. We only participated in the sub-task aiming at identifying different clikcbait spoiling types comparing several machine learning and deep learning approaches. Our analysis confirms previous results on this task and show that automatic methods are able to reach approximately 70\% accuracy at predicting what type of additional content is needed to mitigate sensationalistic posts on social media. Furthermore, we provide a qualitative analysis of the results, showing that the models may do better in practice than the metric indicates since the evaluate does not depend only on the predictor, but also on the typology we choose to define clickbait spoiling.
2022
Identifying Draft Bills Impacting Existing Legislation: a Case Study on Romanian
Corina Ceausu | Sergiu Nisioi
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Corina Ceausu | Sergiu Nisioi
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In our paper, we present a novel corpus of historical legal documents on the Romanian public procurement legislation and an annotated subset of draft bills that have been screened by legal experts and identified as impacting past public procurement legislation. Using the manual annotations provided by the experts, we attempt to automatically identify future draft bills that have the potential to impact existing policies on public procurement.
2020
CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts
Sanja Stajner | Sergiu Nisioi | Ioana Hulpuș
Proceedings of the Twelfth Language Resources and Evaluation Conference
Sanja Stajner | Sergiu Nisioi | Ioana Hulpuș
Proceedings of the Twelfth Language Resources and Evaluation Conference
Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.
2018
A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification
Sanja Štajner | Sergiu Nisioi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Sanja Štajner | Sergiu Nisioi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Content Extraction and Lexical Analysis from Customer-Agent Interactions
Sergiu Nisioi | Anca Bucur | Liviu P. Dinu
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Sergiu Nisioi | Anca Bucur | Liviu P. Dinu
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an Enterprise Resource Planning (ERP) environment and a potential solution to clean the data and extract relevant content for NLP. As a result, we demonstrate that the actual vocabulary for the language that prevails in the ERP conversations is highly divergent from the standardized dictionary and further different from general language usage as extracted from the Common Crawl corpus. Moreover, in specific business communication circumstances, where it is expected to observe a high usage of standardized language, code switching and non-standard expression are predominant, emphasizing once more the discrepancy between the day-to-day use of language and the standardized one.
2017
Exploring Neural Text Simplification Models
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems
2016
Vanilla Classifiers for Distinguishing between Similar Languages
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Sergiu Nisioi | Alina Maria Ciobanu | Liviu P. Dinu
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.
On the Similarities Between Native, Non-native and Translated Texts
Ella Rabinovich | Sergiu Nisioi | Noam Ordan | Shuly Wintner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ella Rabinovich | Sergiu Nisioi | Noam Ordan | Shuly Wintner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
Anca Bucur | Sergiu Nisioi
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Anca Bucur | Sergiu Nisioi
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.
Comparing Speech and Text Classification on ICNALE
Sergiu Nisioi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Sergiu Nisioi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we explore and compare a speech and text classification approach on a corpus of native and non-native English speakers. We experiment on a subset of the International Corpus Network of Asian Learners of English containing the recorded speeches and the equivalent text transcriptions. Our results suggest a high correlation between the spoken and written classification results, showing that native accent is highly correlated with grammatical structures found in text.
Using Word Embeddings to Translate Named Entities
Octavia-Maria Şulea | Sergiu Nisioi | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Octavia-Maria Şulea | Sergiu Nisioi | Liviu P. Dinu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on comparable corpora yields comparable vector space representations of those corpora, reducing the problem of translating words to finding a rotation matrix, and results in (Zou et al., 2013), which showed that bilingual word embeddings can improve Chinese Named Entity Recognition (NER) and English to Chinese phrase translation, we use the sentence-aligned English-French EuroParl corpora and show that word embeddings extracted from a merged corpus (corpus resulted from the merger of the two aligned corpora) can be used to NE translation. We extrapolate that word embeddings trained on merged parallel corpora are useful in Named Entity Recognition and Translation tasks for resource-poor languages.
A Corpus of Native, Non-native and Translated Texts
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Sergiu Nisioi | Ella Rabinovich | Liviu P. Dinu | Shuly Wintner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
2014
On the syllabic structures of Aromanian
Sergiu Nisioi
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
Sergiu Nisioi
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
2013
A clustering approach for translationese identification
Sergiu Nisioi | Liviu P. Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
Sergiu Nisioi | Liviu P. Dinu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
2012
Search
Fix author
Co-authors
- Liviu P. Dinu 7
- Sanja Štajner 3
- Anca Bucur 2
- Anamaria Hodivoianu 2
- Oleksandra Kuvshynova 2
- Ella Rabinovich 2
- Rareş-Alexandru Roşcan 2
- Shuly Wintner 2
- Jamal Abdul Nasir 1
- Dima Abu Romi 1
- Cengiz Acarturk 1
- Matilda Agdler 1
- Anton Marius Alexandru 1
- Fabian Anghel 1
- Mohd Faizan Ansari 1
- Annalisa Arcidiacono 1
- Hanne B. Søndergaard Knudsen 1
- Elizabete Ausma Velta Barisa 1
- Not Battesta Soliva 1
- Ana Bautista 1
- Steven Bedrick 1
- Lisa Beinborn 1
- Yevgeni Berzak 1
- Nedeljka Bjelanović 1
- Anna Bondar 1
- Anna Isabelle Bothmann 1
- Jan Brasser 1
- Caterina Cacioli 1
- Corina Ceausu 1
- Ilze Ceple 1
- Adelina Cerpja 1
- Dalí Chirino 1
- Jan Chromý 1
- Alina Maria Ciobanu 1
- Alessandro Corona Mendozza 1
- David Cotigă 1
- Claudiu Creangă 1
- Petru Cristea 1
- Nazik Dinctopal Deniz 1
- Cui Ding 1
- A. Seza Doğruöz 1
- Ana Došen 1
- Kristian Elersič 1
- Inmaculada Fajardo 1
- Stefan L. Frank 1
- Zigmunds Freibergs 1
- Angelina Ganebnaya 1
- Shan Gao 1
- Laurentiu G. Ghetoiu 1
- Jéssica Gomes 1
- Annjo Klungervik Greenall 1
- Alba Haveriku 1
- Miao He 1
- Ioan Alexandru Hirica 1
- Nora Hollenstein 1
- Yu-Yin Hsu 1
- Ioana Hulpuș 1
- Amanda Isaksen 1
- Jany-Gabriel Ispas 1
- Deborah N. Jakobi 1
- Andreia Janeiro 1
- Kristine Jensen de López 1
- Alexandru-Iulius Jerpelea 1
- Aleksandar Jevremovic 1
- Vojislav Jovanovic 1
- Lena Ann Jäger 1
- Ramunė Kasperė 1
- Nik Kharlamov 1
- Dorota Klimek-Jankowska 1
- Nelda Kote 1
- Vanja Kovic 1
- Sara Košutar 1
- Izabela Krejtz 1
- Thyra Krosness 1
- Hanna Kędzierska 1
- Eilam Lavy 1
- Ella Lion 1
- Adrian Luca 1
- Paula Luegi 1
- Kaidi Lõo 1
- Mircea Mihai Marin 1
- Clara Martin 1
- Ana Matić 1
- Svitlana Matvieieva 1
- Teodor Mihailescu 1
- Dragos-stefan Mihalcea 1
- Valeriia Modina 1
- Irina Moise 1
- Jurgita Motiejūnienė 1
- Diane C. Mézière 1
- Xavier Mínguez-López 1
- Marie-Luise Müller 1
- Tolgonai Nasipbek kyzy 1
- Johanne S. K. Nedergård 1
- Noam Ordan 1
- Patrizia Paggio 1
- Marijan Palmović 1
- Maria Christina Panagiotopoulou 1
- Alberto Parola 1
- Eva Pavlinušić Vilus 1
- Klaudia Petersen 1
- Iulia Petrariu 1
- Cristea Petru-Theodor 1
- Anja Podlesek 1
- Simone Paolo Ponzetto 1
- Cristina Maria Popescu 1
- Filip Popovici 1
- Eva Pospíšilová 1
- Marta Praulina 1
- Mikuláš Preininger 1
- Loredana Pungă 1
- Helena Pérez 1
- Rareș Păpușoi 1
- Alina Radoi 1
- Diego Rossini 1
- Špela Rot 1
- Ana Roșu 1
- Anastasia Sandu 1
- Habib Sani Yahaya 1
- Irina A. Sekerina 1
- Anne Gabija Skadina 1
- Jordi Solé-Casals 1
- Maja Stegenwallner-Schütz 1
- Stefana Arina Tabusca 1
- Chiara Tschirner 1
- Saara M. Varjopuro 1
- Spyridoula Varlokosta 1
- João Veríssimo 1
- Oskari Juhapekka Virtanen 1
- Nemanja Vračar 1
- Mila Vulchanova 1
- Ahmad Mustapha Wali 1
- Peizheng Wu 1
- Nilgün Yücel 1
- Iria de-Dios-Flores 1
- Lonneke van der Plas 1
- Anila Çepani 1
- Ayşegül Özkan 1
- Marta Łockiewicz 1
- Octavia-Maria Şulea 1