Proceedings of the 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language (Analogy-Angle II)

Giulia Rambelli, Filip Ilievski, Marianna Bolognesi, Pia Sommerauer (Editors)

Anthology ID:: 2025.analogyangle-1
Month:: August
Year:: 2025
Address:: Vienna, Austria
Venues:: Analogy-Angle | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.analogyangle-1/
DOI:
ISBN:: 979-8-89176-274-9
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.analogyangle-1.pdf

pdf bib
Proceedings of the 2nd Workshop on Analogical Abstraction in Cognition, Perception, and Language (Analogy-Angle II)
Giulia Rambelli | Filip Ilievski | Marianna Bolognesi | Pia Sommerauer

We present a collection of human association norms to German personal name compounds (PNCs) such as “Tore-Klose” (goal-Klose) and corresponding full names (Miroslav Klose), thus providing a novel testbed for PNC evaluation, i.e., analogical vs. contrastive positive vs. negative perception effects. The associations are obtained in an online experiment with German native speakers, analyzed regarding our novel intertwined PNC–person association setup, and accompanied by an LLM synthetic generation approach for augmentation.

pdf bib abs
Using Large Language Models to Perform MIPVU-Inspired Automatic Metaphor Detection
Sebastian Reimann | Tatjana Scheffler

Automatic metaphor detection has often been inspired by linguistic procedures for manual metaphor identification. In this work, we test how closely the steps required by the Metaphor Identification Procedure VU Amsterdam (MIPVU) can be translated into prompts for generative Large Language Models (LLMs) and how well three commonly used LLMs are able to perform these steps. We find that while the procedure itself can be modeled with only a few compromises, neither language model is able to match the performance of supervised, fine-tuned methods for metaphor detection. All models failed to sufficiently filter out literal examples, where no contrast between the contextual and a more basic or concrete meaning was present. Both versions of LLaMa however signaled interesting potentials in detecting similarities between literal and metaphoric meanings that may be exploited in further work.

pdf bib abs
Modeling Background Knowledge with Frame Semantics for Fine-grained Sentiment Classification
Muhammad Okky Ibrohim | Valerio Basile | Danilo Croce | Cristina Bosco | Roberto Basili

Few-shot learning via in-context learning (ICL) is widely used in NLP, but its effectiveness is highly sensitive to example selection, often leading to unstable performance. To address this, we introduce BacKGen, a framework for generating structured Background Knowledge (BK) as an alternative to instance-based prompting. Our approach leverages Frame Semantics to uncover recurring conceptual patterns across data instances, clustering examples based on shared event structures and semantic roles. These patterns are then synthesized into generalized knowledge statements using a large language model (LLM) and injected into prompts to support contextual reasoning beyond surface-level cues. We apply BacKGen to Sentiment Phrase Classification (SPC), a task where polarity judgments frequently depend on implicit commonsense knowledge. In this setting, BK serves as an abstract representation of prototypical scenarios, enabling schematic generalization to help the model perform analogical reasoning by mapping new inputs onto generalized event structures. Experimental results with Mistral-7B and Llama3-8B demonstrate that BK-based prompting consistently outperforms standard few-shot approaches, achieving up to 29.94% error reduction.

pdf bib abs
On choosing the vehicles of metaphors without a body: evidence from Large Language Models
Veronica Mangiaterra | Chiara Barattieri Di San Pietro | Federico Frau | Valentina Bambini | Hamad Al-Azary

Since the advent of Large Language Models (LLMs), much work has been devoted to comparing the linguistic abilities of humans and machines. Figurative language, which is known to rely on pragmatic inferential processes as well as lexical-semantic, sensorimotor, and socio-cognitive information, has been often used as a benchmark for this comparison. In the present study, we build on previous behavioral evidence showing that both distributional and sensorimotor variables come into play when people are asked to produce novel and apt metaphors and examine the behavior of LLMs in the same task. We show that, while distributional features still hold a special status, LLMs are insensitive to the sensorimotor aspects of words. This points to the lack of human-like experience-based grounding in LLMs trained on linguistic input only, while offering further support to the multimodality of conceptual knowledge involved in metaphor processes in humans.

pdf bib abs
Prompting Metaphoricity: Soft Labeling with Large Language Models in Popular Communication of Science Tweets in Spanish
Alec Sánchez-Montero | Gemma Bel-Enguix | Sergio-Luis Ojeda-Trueba | Gerardo Sierra

In this paper, we explore how large language models (LLMs) can be used to assign soft labels for metaphoricity in Popular Communication of Science (PCS) tweets written in Spanish. Instead of treating metaphors as a binary yes/no phenomenon, we focus on their graded nature and the variability commonly found in human annotations. Through a combination of prompt design and quantitative evaluation over a stratified sample of our dataset, we show that GPT-4 can assign probabilistic scores not only for general metaphoricity but also for specific metaphor types with consistency (Direct, Indirect, and Personification). The results show that, while LLMs align reasonably well with average human judgments for some categories, capturing the subtle patterns of inter-annotator disagreement remains a challenge. We present a corpus of 3,733 tweets annotated with LLM-generated soft labels, a valuable resource for further metaphor analysis in scientific discourse and figurative language annotation with LLMs.

pdf bib abs
HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
Ashray Gupta | Rohan Joseph | Sunny Rai

Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi. The test set is publicly available for research purposes here https://github.com/Inequilazitive/HATS-Hindi_Analogy_Test_Set

Human emotional expression emerges from a complex interplay of verbal, para-verbal, and non-verbal cues. This paper presents a dual-path framework for emotionally grounded text generation in large language models by integrating behavioral metadata with analogical retrieval. We introduce the MECC (Multimodal Emotionally Conditioned Corpus), a dataset of 1,764 question-answer pairs collected via structured interviews and annotated across 15 emotion categories with tone, response time, and body language. A LLaMA-3.1–8B–Instruct model is fine-tuned on MECC using behavior-encoded prompts, and inference is supported by a metadata-filtered Retrieval-Augmented Generation (RAG) pipeline. Detailed emotion-level analysis reveals trade-offs between emotional fidelity and semantic diversity, emphasizing the need for nuanced evaluation. This study contributes a richly annotated multimodal emotion corpus, a metadata-driven RAG architecture, a well-structured framework for building emotionally aware language models.Our code is available at https://github.com/MetaResearcher/Framework

pdf bib abs
Can Stories Help LLMs Reason? Curating Information Space Through Narrative
Vahid Sadiri Javadi | Johanne Trippas | Yash Kumar Lal | Lucie Flek

Narratives are widely recognized as a powerful tool for structuring information and facilitating comprehension of complex ideas in various domains such as science communication. This paper explores whether generating narratives can serve “as a specialized mode of thinking” that improves the reasoning abilities of Large Language Models (LLMs). We introduce Story of Thought (SoT), a novel prompt-driven reasoning framework that guides LLMs to construct narratives around the problem statement to solve the task more effectively. SoT enables LLMs to integrate narrative techniques such as metaphor and analogy into their reasoning process. Our experiments show that SoT significantly improves the LLMs’ problem-solving abilities on various tasks, including physics, chemistry, and biology in both JEEBench and GPQA (e.g., SoT resulted in 13% improvement compared to CoT when using GPT-4). To validate LLM-based evaluation for generated narratives, we conduct a human annotation of the narrative techniques used by LLMs. Our results show strong inter-annotator agreement between Llama 3 70B and human annotators. This work brings LLM reasoning closer to human cognitive processes by mirroring mechanisms such as analogical problem-solving, which are central to how humans understand and process complex ideas.

pdf bib abs
Testing Spatial Intuitions of Humans and Large Language and Multimodal Models in Analogies
Ivo Bueno | Anna Bavaresco | João Miguel Cunha | Philipp Wicke

Language and Vision-Language Models exhibit impressive language capabilities akin to human reasoning. However, unlike humans who acquire language through embodied, interactive experiences, these models learn from static datasets without real-world interaction. This difference raises questions about how they conceptualize abstract notions and whether their reasoning aligns with human cognition. We investigate spatial conceptualizations of LLMs and VLMs by conducting analogy prompting studies with LLMs, VLMs, and human participants. We assess their ability to generate and interpret analogies for spatial concepts. We quantitatively compare the analogies produced by each group, examining the impact of multimodal inputs and reasoning mechanisms. Our findings indicate that generative models can produce and interpret analogies but differ significantly from human reasoning in their abstraction of spatial concepts - variability influenced by input modality, model size, and prompting methods, with analogy-based prompts not consistently enhancing alignment. Contributions include a methodology for probing generative models through analogies; a comparative analysis of analogical reasoning among models, and humans; and insights into the effect of multimodal inputs on reasoning.