Proceedings of the 2nd LUHME Workshop

Henrique Lopes Cardoso, Rui Sousa-Silva, Maarit Koponen, Antonio Pareja-Lora (Editors)

Anthology ID:: 2025.luhme-1
Month:: October
Year:: 2025
Address:: Bologna, Italy
Venue:: LUHME
SIG:
Publisher:: LUHME
URL:: https://preview.aclanthology.org/ingest-luhme/2025.luhme-1/
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.luhme-1.pdf

pdf bib
Proceedings of the 2nd LUHME Workshop
Henrique Lopes Cardoso | Rui Sousa-Silva | Maarit Koponen | Antonio Pareja-Lora

pdf bib abs
Understanding Social Interactions in the Era of LLMs – the Challenges of Transparency
Chloé Clavel

Research on AI and social interaction is not entirely new — it falls within the field of social and affective computing, which emerged in the late 1990s. To understand social interactions, the research community has long drawn on both artificial intelligence and social science. In recent years, however, the field has shifted toward a dominant focus on generative large language models (LLMs). These models are undeniably powerful but often opaque. In this talk, I will present our current work on developing machine learning approaches — from classical methods to LLMs — for modeling the socio-emotional layer of interaction, with a particular focus on improving model transparency. I will also briefly present some of the applications we are developing to support human skill development, particularly in the fields of education and health.

pdf bib abs
Building Common Ground in Dialogue: A Survey
Tatiana Anikina | Alina Leippert | Simon Ostermann

Common ground plays a crucial role in human communication and the grounding process helps to establish shared knowledge. However, common ground is also a heavily loaded term that may be interpreted in different ways depending on the context. The scope of common ground ranges from domain-specific and personal shared experiences to common sense knowledge. Representationally, common ground can be uni- or multi-modal, and static or dynamic. In this survey, we attempt to systematize different facets of common ground in dialogue and position it within the current landscape of NLP research that often relies on the usage of language models (LMs) and task-specific short-term interactions. We outline different dimensions of common ground and describe modeling approaches for several grounding tasks, discuss issues caused by the lack of common ground in human-LM interactions, and suggest future research directions. This survey serves as a roadmap of what to pay attention to when equipping a dialogue system with grounding capabilities and provides a summary of current research on grounding in dialogue, categorizing 448 papers and compiling a list of the available datasets.

pdf bib abs
Do Large Language Models Understand Morality Across Cultures?
Hadi Mohammadi | Yasmeen F. S. S. Meijer | Efthymia Papadopoulou | Ayoub Bagheri

Recent advancements in large language models (LLMs) have established them as powerful tools across numerous domains. However, persistent concerns about embedded biases, such as gender, racial, and cultural biases arising from their training data, raise significant questions about the ethical use and societal consequences of these technologies. This study investigates the extent to which LLMs capture cross-cultural differences and similarities in moral perspectives. Specifically, we examine whether LLM outputs align with patterns observed in international survey data on moral attitudes. To this end, we employ three complementary methods: (1) comparing variances in moral scores produced by models versus those reported in surveys, (2) conducting cluster alignment analyses to assess correspondence between country groupings derived from LLM outputs and survey data, and (3) directly probing models with comparative prompts using systematically chosen token pairs. Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation, tending to compress differences and exhibit low alignment with empirical survey patterns. These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs. We conclude by discussing the implications for the responsible development and global deployment of LLMs, emphasizing fairness and ethical alignment.

pdf bib abs
A Nightmare on LLMs Street: On the Importance of Cultural Awareness in Text Adaptation for LRLs
David C. T. Freitas | Henrique Lopes Cardoso

Large Language Models (LLMs) have revolutionized how we generate, interact with, and process language. Still, these models are biased toward WEIRD (Western, Educated, Industrialized, Rich, and Democratic) values. This bias is not merely linguistic but also cultural. Sociocultural contexts influence how people express ideas, interpret meaning, and communicate. In low-resource language settings, where data and cultural representation are limited, this issue becomes even more pronounced when models are applied without cultural adaptation, often leading to outputs that are irrelevant, inaccessible, or even harmful. In this paper, we argue for the importance of incorporating sociocultural context into LLMs. We review existing frameworks that explore culture in Natural Language Processing (NLP), and examine some work aimed at culturally aligning language models. As an illustrative scenario, we analyze the case of Guinea-Bissau. In this linguistically and culturally diverse country, Portuguese is the official language but not the primary means of communication for most of the population, highlighting the urgent need to adapt educational materials to the local sociocultural context. Finally, we propose a revised framework to address the challenge of adapting educational materials to diverse contexts, aiming to improve both the relevance and pedagogical impact of text adaptation.

pdf bib abs
Terminologists as Stewards of Meaning in the Age of LLMs: A Digital Humanism Perspective
Barbara Heinisch

Digital Humanism calls for a reconfiguration of the development of digital technologies that embeds interdisciplinary collaboration, ethical reflexivity and critical scrutiny into both the design and evaluation of these systems. From a Digital Humanism perspective, terminologists play a vital role in safeguarding language understanding in specialized domains where clarity and consistency are critical (in both monolingual and multilingual contexts). This conceptual paper, therefore, examines the role of terminologists (and terminology) in the era of LLMs, with a focus on their function as stewards of meaning in specialized communication. The study draws on the principles of Digital Humanism to critically assess how terminologists can counteract various ethically and epistemologically problematic features characterizing current LLM development and deployment. In this regard, terminologists can ensure terminological precision, help preserve linguistic diversity and knowledge excluded in LLMs. They may also support inclusive, transparent and accountable digital infrastructures. By documenting system and variety-specific terms, they counteract the homogenizing tendencies of LLMs and challenge epistemic monopolies. Their expertise bridges disciplines and reinforces that language is not neutral, but culturally and institutionally embedded. As educators and stewards of meaning, terminologists empower users to critically engage with LLM outputs, ensuring that language technologies remain ethically grounded and responsive to human contexts and values.

pdf bib abs
A Toolbox for Improving Evolutionary Prompt Search
Daniel Grieβhaber | Maximilian Kimmich | Johannes Maucher | Thang Vu

Evolutionary prompt optimization has demonstrated effectiveness in refining prompts for LLMs. However, existing approaches lack robust operators and efficient evaluation mechanisms. In this work, we propose several key improvements to evolutionary prompt optimization that can partially generalize to prompt optimization in general: 1) decomposing evolution into distinct steps to enhance the evolution and its control, 2) introducing an LLM-based judge to verify the evolutions, 3) integrating human feedback to refine the evolutionary operator, and 4) developing more efficient evaluation strategies that maintain performance while reducing computational overhead. Our approach improves both optimization quality and efficiency. We release our code, enabling prompt optimization on new tasks and facilitating further research in this area.

pdf bib abs
Improving LLMs for Machine Translation Using Synthetic Preference Data
Dario Vajda | Domen Vreš | Marko Robnik Šikonja

Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.

pdf bib abs
Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls
Elena Pitta | Tom Kouwenhoven | Tessa Verhoef

This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model’s over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.

pdf bib abs
Do Large Language Models understand how to be judges?
Nicolò Donati | Paolo Torroni | Giuseppe Savino

This paper investigates whether Large Language Models (LLMs) can effectively act as judges for evaluating open-ended text generation tasks, such as summarization, by interpreting nuanced editorial criteria. Traditional metrics like ROUGE and BLEU rely on surface-level overlap, while human evaluations remain costly and inconsistent. To address this, we propose a structured rubric with five dimensions: coherence, consistency, fluency, relevance, and ordering, each defined with explicit sub-criteria to guide LLMs in assessing semantic fidelity and structural quality. Using a purpose-built dataset of Italian news summaries generated by GPT-4o, each tailored to isolate specific criteria, we evaluate LLMs’ ability to assign scores and rationales aligned with expert human judgments. Results show moderate alignment (Spearman’s ρ = 0.6–0.7) for criteria like relevance but reveal systematic biases, such as overestimating fluency and coherence, likely due to training data biases. We identify challenges in rubric interpretation, particularly for hierarchical or abstract criteria, and highlight limitations in cross-genre generalization. The study underscores the potential of LLMs as scalable evaluators but emphasizes the need for fine-tuning, diverse benchmarks, and refined rubrics to mitigate biases and enhance reliability. Future directions include expanding to multilingual and multi-genre contexts and exploring task-specific instruction tuning to improve alignment with human editorial standards.

pdf bib abs
Cross-Genre Native Language Identification with Open-Source Large Language Models
Robin Nicholls | Kenneth Alperin

Native Language Identification (NLI) is a crucial area within computational linguistics, aimed at determining an author’s first language (L1) based on their proficiency in a second language (L2). Recent studies have shown remarkable improvements in NLI accuracy due to advancements in large language models (LLMs). This paper investigates the performance of open-source LLMs on short-form comments from the Reddit-L2 corpus compared to their performance on the TOEFL11 corpus of non-native English essays. Our experiments revealed that fine-tuning on TOEFL11 significantly improved accuracy on Reddit-L2, demonstrating the transferability of linguistic features across different text genres. Conversely, models fine-tuned on Reddit-L2 also generalised well to TOEFL11, achieving over 90% accuracy and F1 scores for the native languages that appear in both corpora. This shows the strong transfer performance from long-form to short-form text and vice versa. Additionally, we explored the task of classifying authors as native or non-native English speakers, where fine-tuned models achieve near-perfect accu- racy on the Reddit-L2 dataset. Our findings emphasize the impact of document length on model performance, with optimal results observed up to approximately 1200 tokens. This study highlights the effectiveness of open-source LLMs in NLI tasks across diverse linguistic contexts, suggesting their potential for broader applications in real-world scenarios.

pdf bib abs
Climate Change Discourse Over Time: A Topic-Sentiment Perspective
Chaya Liebeskind | Barbara Lewandowska-Tomaszczyk

The present paper focuses on the study of opinion dynamics and opinion shifts in social media in the context of climate change discourse in terms of the quantitative NLP analysis, supported by a linguistic outlook. The research draws on two comparable collections of climate-related social media data from different time periods, each based on trending climate-related hashtags and annotated for relevant sentiment values. The quantitative computer-based research methodology has been supported by a language-based perspective in the pragma-linguistic form. The research shows that the latter data source, for the majority of identified topics, exhibits a significant reduction in negative sentiment and a dominance of positive sentiment, i.e., a potential temporal evolution in public sentiment toward climate change. To achieve this, we used a BERT-based clustering approach to identify dominant themes within a combined dataset of tweets from both periods. Subsequently, a unified sentiment classification framework using a Large Language Model (LLM) was applied to reclassify all tweets, ensuring consistent and climate-specific sentiment analysis across both datasets. This methodology allowed for a coherent comparison of public attitudes and their evolution in different time periods and thematic structures.