Serena Auriemma


2025

pdf bib
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
Martina Miliani | Serena Auriemma | Alessandro Bondielli | Emmanuele Chersoni | Lucia Passaro | Irene Sucameli | Alessandro Lenci
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

2024

pdf bib
Lost in Disambiguation: How Instruction-Tuned LLMs Master Lexical Ambiguity
Luca Capone | Serena Auriemma | Martina Miliani | Alessandro Bondielli | Alessandro Lenci
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper investigates how decoder-only instruction-tuned LLMs handle lexical ambiguity. Two distinct methodologies are employed: Eliciting rating scores from the model via prompting and analysing the cosine similarity between pairs of polysemous words in context. Ratings and embeddings are obtained by providing pairs of sentences from Haber and Poesio (2021) to the model. These ratings and cosine similarity scores are compared with each other and with the human similarity judgments in the dataset.Surprisingly, the model scores show only a moderate correlation with the subjects’ similarity judgments and no correlation with the target word embedding similarities. A vector space anisotropy inspection has also been performed, as a potential source of the experimental results. The analysis reveals that the embedding spaces of two out of the three analyzed models exhibit poor anisotropy, while the third model shows relatively moderate anisotropy compared to previous findings for models with similar architecture (Ethayarajh 2019). These findings offer new insights into the relationship between generation quality and vector representations in decoder-only LLMs.

2023

pdf bib
Challenging Specialized Transformers on Zero-shot Classification
Serena Auriemma | Mauro Madeddu | Martina Miliani | Alessandro Bondielli | Alessandro Lenci | Lucia Passaro
Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023)

2022

pdf bib
Neural Readability Pairwise Ranking for Sentences in Italian Administrative Language
Martina Miliani | Serena Auriemma | Fernando Alva-Manchego | Alessandro Lenci
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Automatic Readability Assessment aims at assigning a complexity level to a given text, which could help improve the accessibility to information in specific domains, such as the administrative one. In this paper, we investigate the behavior of a Neural Pairwise Ranking Model (NPRM) for sentence-level readability assessment of Italian administrative texts. To deal with data scarcity, we experiment with cross-lingual, cross- and in-domain approaches, and test our models on Admin-It, a new parallel corpus in the Italian administrative language, containing sentences simplified using three different rewriting strategies. We show that NPRMs are effective in zero-shot scenarios (~0.78 ranking accuracy), especially with ranking pairs containing simplifications produced by overall rewriting at the sentence-level, and that the best results are obtained by adding in-domain data (achieving perfect performance for such sentence pairs). Finally, we investigate where NPRMs failed, showing that the characteristics of the training data, rather than its size, have a bigger effect on a model’s performance.