This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
LauraOcchipinti
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.
Assessing word complexity in Italian poses significant challenges, particularly due to the absence of a standardized dataset. This study introduces the first automatic model designed to identify word complexity for native Italian speakers. A dictionary of simple and complex words was constructed, and various configurations of linguistic features were explored to find the best statistical classifier based on Random Forest algorithm. Considering the probabilities of a word to belong to a class, a comparison between the models’ predictions and human assessments derived from a dataset annotated for complexity perception was made. Finally, the degree of accord between the model predictions and the human inter-annotator agreement was analyzed using Spearman correlation. Our findings indicate that a model incorporating both linguistic features and word embeddings performed better than other simpler models, also showing a value of correlation with the human judgements similar to the inter-annotator agreement. This study demonstrates the feasibility of an automatic system for detecting complexity in the Italian language with good performances and comparable effectiveness to humans in this subjective task.
Lexical simplification is a fundamental task in Natural Language Processing, aiming to replace complex words with simpler synonyms while preserving the original meaning of the text. This task is crucial for improving the accessibility of texts for different user groups. In this article, we present MultiLS-IT, the first dataset specifically designed for automatic lexical simplification in Italian, as part of the larger multilingual Multi-LS dataset. We offer a detailed description of the data collection and annotation process, along with a comprehensive statistical analysis of the dataset. Our dataset provides a basis for the development and evaluation of automatic simplification models, contributing to the broader goal of making texts more accessible to all readers.
Morphological analysis is vital for various NLP tasks as it provides insights into word structures and enhances the understanding of morphological and syntactic relationships. This study focuses on surface morphological segmentation for the Italian language, addressing the lack of detailed morphological representation in existing corpora. By utilizing an automatic segmenter, we aim to extract quantitative morphological parameters to understand their impact on word complexity perception. Our correlation analysis reveals that morphological features significantly influence the perceived complexity of words.
We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.