This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
SergiuNisioi
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
This paper explores multiple simple baselines for the BabyLM challenge, covering random models, elementary predictions based on frequency, n-gram language models, LSTM with several tokenizers (BPE, Unigram, SuperBPE), and GPT-BERT, the winning architecture from the prior BabyLM edition. The evaluation is focused on the BLiMP and BLiMP-Supplement benchmarks. Our experiments show that Strict-Small can sometimes outperform Strict, the fact that performance can be highly sensitive to tokenization and the importance of data efficiency. A simple word-frequency baseline scored unexpectedly high, which led to identifying an evaluation artifact in the pipeline: a system that returns identical logits for both sentences in a minimal pair can achieve maximal accuracy.
This paper describes our approach for 5 classification tasks from Building Educational Applications (BEA) 2025 Shared Task.Our methods range from classical machine learning models to large-scale transformers with fine-tuning and prompting strategies. Despite the diversity of approaches, performance differences were often minor, suggesting a strong surface-level signal and the limiting effect of annotation noise—particularly around the “To some extent” label. Under lenient evaluation, simple models perform competitively, showing their effectiveness in low-resource settings. Our submissions ranked in the top 10 in four of five tracks.
We present a neural machine translation system that can translate between Romanian, English, and Aromanian (an endangered Eastern Romance language); the first of its kind. BLEU scores range from 17 to 32 depending on the direction and genre of the text. Alongside, we release the biggest known Aromanian-Romanian bilingual corpus, consisting of 80k cleaned sentence pairs. Additional tools such as an agnostic sentence embedder (used for both text mining and automatic evaluation) and a diacritics converter are also presented. Lastly, we describe the online deployment of our quantized model, considering a CPU-driven limited resource scenario.
We introduce the first dataset that jointly covers both lexical complexity prediction (LCP) annotations and lexical simplification (LS) for Romanian, along with a comparison of lexical simplification approaches. We propose a methodology for ordering simplification suggestions using a pairwise ranking approximation method, arranging candidates from simple to complex based on a separate set of human judgments. In addition, we provide human lexical complexity annotations for 3,921 word samples in context. Finally, we explore several novel pipelines for complexity prediction and simplification and present the first text simplification system for Romanian.
Large language models (LLMs) have demonstrated strong performance in text simplification tasks, but their high computational cost and proprietary nature often limit practical use, especially in education. We explore open-source LLMs for CEFR-level text simplification. By reducing model size and computational requirements, our approach enables greater accessibility and deployment in educational environments. Our results show some of the lowest error rates in producing CEFR-compliant texts at TSAR 2025, using models with 8 billion and 1 billion parameters. Such approaches have the potential to democratize NLP technologies for real-world applications.
We present the submissions of team Archaeology for the Lexical Simplification and Lexical Complexity Prediction Shared Tasks at BEA2024. Our approach for this shared task consists in creating two pipelines for generating lexical substitutions and estimating the complexity: one using machine translation texts into English and one using the original language.For the LCP subtask, our xgb regressor is trained with engineered features (based primarily on English language resources) and shallow word structure features. For the LS subtask we use a locally-executed quantized LLM to generate candidates and sort them by complexity score computed using the pipeline designed for LCP.These pipelines provide distinct perspectives on the lexical simplification process, offering insights into the efficacy and limitations of employing Machine Translation versus direct processing on the original language data.
This paper describes the Unibuc Archaeology team work for CLPsych’s 2024 Shared Task that involved finding evidence within the text supporting the assigned suicide risk level. Two types of evidence were required: highlights (extracting relevant spans within the text) and summaries (aggregating evidence into a synthesis). Our work focuses on evaluating Large Language Models (LLM) as opposed to an alternative method that is much more memory and resource efficient. The first approach employs an LLM that is used for generating the summaries and is guided to provide sequences of text indicating suicidal tendencies through a processing chain for highlights. The second approach involves implementing a good old-fashioned machine learning tf-idf with a logistic regression classifier, whose representative features we use to extract relevant highlights.
We report the creation of the first high-quality corpus of Aromanian - an endangered Romance language spoken in the Balkans - and the equivalent sentence-aligned translations into Romanian, English, and French. The corpus is released publicly using several orthographic standards and consists in short stories collected in the ‘70s in Romania. Additionally, we provide an corpus-based analysis of Aromanian linguistic particularities and the overall demographic and political context which impacts the contemporary development of the language.
In this paper we present an analysis of our approaches for the 2023 SemEval-2023 Clickbait Challenge. We only participated in the sub-task aiming at identifying different clikcbait spoiling types comparing several machine learning and deep learning approaches. Our analysis confirms previous results on this task and show that automatic methods are able to reach approximately 70\% accuracy at predicting what type of additional content is needed to mitigate sensationalistic posts on social media. Furthermore, we provide a qualitative analysis of the results, showing that the models may do better in practice than the metric indicates since the evaluate does not depend only on the predictor, but also on the typology we choose to define clickbait spoiling.
In our paper, we present a novel corpus of historical legal documents on the Romanian public procurement legislation and an annotated subset of draft bills that have been screened by legal experts and identified as impacting past public procurement legislation. Using the manual annotations provided by the experts, we attempt to automatically identify future draft bills that have the potential to impact existing policies on public procurement.
Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.
In this paper, we provide a lexical comparative analysis of the vocabulary used by customers and agents in an Enterprise Resource Planning (ERP) environment and a potential solution to clean the data and extract relevant content for NLP. As a result, we demonstrate that the actual vocabulary for the language that prevails in the ERP conversations is highly divergent from the standardized dictionary and further different from general language usage as extracted from the Common Crawl corpus. Moreover, in specific business communication circumstances, where it is expected to observe a high usage of standardized language, code switching and non-standard expression are predominant, emphasizing once more the discrepancy between the day-to-day use of language and the standardized one.
We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems
In this paper we investigate the usefulness of neural word embeddings in the process of translating Named Entities (NEs) from a resource-rich language to a language low on resources relevant to the task at hand, introducing a novel, yet simple way of obtaining bilingual word vectors. Inspired by observations in (Mikolov et al., 2013b), which show that training their word vector model on comparable corpora yields comparable vector space representations of those corpora, reducing the problem of translating words to finding a rotation matrix, and results in (Zou et al., 2013), which showed that bilingual word embeddings can improve Chinese Named Entity Recognition (NER) and English to Chinese phrase translation, we use the sentence-aligned English-French EuroParl corpora and show that word embeddings extracted from a merged corpus (corpus resulted from the merger of the two aligned corpora) can be used to NE translation. We extrapolate that word embeddings trained on merged parallel corpora are useful in Named Entity Recognition and Translation tasks for resource-poor languages.
In this paper we explore and compare a speech and text classification approach on a corpus of native and non-native English speakers. We experiment on a subset of the International Corpus Network of Asian Learners of English containing the recorded speeches and the equivalent text transcriptions. Our results suggest a high correlation between the spoken and written classification results, showing that native accent is highly correlated with grammatical structures found in text.
We describe a monolingual English corpus of original and (human) translated texts, with an accurate annotation of speaker properties, including the original language of the utterances and the speaker’s country of origin. We thus obtain three sub-corpora of texts reflecting native English, non-native English, and English translated from a variety of European languages. This dataset will facilitate the investigation of similarities and differences between these kinds of sub-languages. Moreover, it will facilitate a unified comparative study of translations and language produced by (highly fluent) non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
In this paper we will discuss a method for data visualization together with its potential usefulness in digital humanities and philosophy of language. We compiled a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.
In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.