Niccoló Antonelli-Dziri

2026

While large language models (LLMs) excel at semantic reasoning, their discrete token-based outputs introduce limitations for fine-grained regression tasks requiring continuous scoring. We address graded word-sense plausibility estimation by reformulating it as a Natural Language Inference (NLI) regression problem, adapting DeBERTa-v3-large with NLI pretraining and a regression head to predict continuous plausibility scores from story-sense pairs. We compare this model against BERT, vanilla DeBERTa, SmolLM variants and state-of-the art LLMs under various prompting strategies, and show that the NLI-finetuned model achieves superior rank correlation and alignment with human judgments. While several baselines collapse toward mean predictions and LLMs show unstable prompting sensitivity, our findings establish NLI-informed pretraining as highly effective for narrative plausibility regression, highlighting fundamental LLM limitations for word sense disambiguation.

Co-authors

Lorenzo Vaiani 1

Omar Wafaay 1

Venues

SemEval1
WS1

Fix author