This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
ElioMusacchio
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural LanguageGeneration tasks, an increasing number of open weights architectures have been developed and released online. In contrastwith older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shownoutstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend naturallanguage. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is usingbenchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance ineach of these individual tasks can imply having developed a model capable of understanding language. However, while thisassumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains anopen challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets andhow a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditionalapproaches.
The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.
The challenge is designed to assess LLMs’ abilities in understanding lexical semantics through Word Sense Disambiguation, providing valuable insights into their performance.The idea is to cast the classical Word Sense Disambiguation task in a generative problem following two directions. Our idea is to propose two tasks: (T1) Given a target word and a sentence in which the word occurs, the LLM must generate the correct meaning definition, (T2) Given a target word and a sentence in which the word occurs, the LLM should choose from a predefined set the correct meaning definition.For T1, we compare the generated definition with respect to the correct one taken from a sense inventory, while for T2, a classical accuracy metric is used.In T1, we adopt metrics that measures the quality of the generated definition such as RougeL and the BERTscore.For CALAMITA, we test LLMs using a zero-shot setting.
Dungeons & Dragons (D&D) is a classic tabletop game with a 50-year history. Its intricate and customizable gameplay allows players to create endless worlds and stories. Due to the highly narrative component of this game, D&D and many other interactive games represent a challenging setting for the Natural Language Generation (NLG) capabilities of LLMs. This paper explores using LLMs to generate new spells, which are one of the most captivating aspects of D&D gameplay. Due to the scarcity of resources available for such a specific task, we build a dataset of 3,259 instances by combining official and fan-made D&D spells. We considered several LLMs in generating spells, which underwent a quantitative and qualitative evaluation. Metrics including Bleu and BertScore were computed for quantitative assessments. Subsequently, we also conducted an in-vivo evaluation with a survey involving D&D players, which could assess the quality of the generated spells as well as their adherence to the rules. Furthermore, the paper emphasizes the open-sourcing of all models, datasets, and findings, aiming to catalyze further research on this topic.