Veronique Hoste

Other people with similar names: Veronique Hoste

Unverified author pages with similar names: Veronique Hoste


2026

Evaluating emotion generation in large language models (LLMs) remains a challenging problem due to the subjective nature of emotions and the lack of reliable automatic evaluation metrics. In this paper, we introduce a robust and extensible benchmark for systematically assessing automatic metrics in emotion generation tasks. The benchmark currently includes 13 automatic evaluation metrics and five state-of-the-art LLMs, and can be easily extended without requiring additional human annotations. Through a correlation analysis with human evaluations on a carefully curated annotated subset, we identify the emotion recognition score (ERS) metric, computed with gpt-5-nano in an oneshot setting, as the most reliable automatic evaluator, achieving a correlation exceeding 0.99. Interestingly, despite relying on the same underlying LLM, the emotion absolute score (EAS) metric shows a negative correlation, demonstrating that LLM strength alone does not guarantee automatic metric alignment with human judgment. We also provide lightweight, non-LLM-based alternatives, R2_m and R3_m, in the emotion analogy score (EAnS) metric family, suitable for low-resource settings where large models are not accessible. A comprehensive per-class emotion analysis further highlights the strengths and weaknesses of the evaluated models. Overall, our results offer a practical and scalable framework for benchmarking emotion generation evaluation metrics and pave the way for more reliable, fair, and interpretable emotional language evaluation.
We present TrobaCor, a curated corpus of medieval troubadour poetry, which comprises 1668 unique Old Occitan texts by a large variety of authors. Clustering and stylometric experiments show that we can accurately model authorial style beyond topical content, even though formulaic or topically diverse genres remain challenging. Furthermore, we can model and detect traces of an author’s stylistic "DNA" even in short-form collaborative poetry, offering a uniquely fine-grained perspective in the field. In addition, we provide self-organizing map visualizations in order to provide an interpretable view of stylistic patterns across authors. TrobaCor is publicly released to support reproducible research in NLP and digital humanities on this low-resource historical corpus.
This paper introduces LoveHate, a new multi-topic corpus of user-generated arguments in Russian, collected from the historical data of the debate platform lovehate.ru. The dataset contains nearly 19,000 posts spanning 16 socially and politically relevant topics, each mapped to binary pro and con stances. We test multiple approaches to stance detection and stance generation across Russian and English data, including translated variants, using both classifier-based (Roberta, RuRoberta) and instruction-tuned generative (Llama, Qwen) models. Results demonstrate that language-specific pretraining yields the strongest performance for stance classification (F1 = 0.892 with RuRoberta), while multilingual generative models – when fine-tuned on sufficient data – can effectively generate stance in Russian without explicit Russian pretraining. Cross-domain experiments show that English datasets generalize better across corpora, whereas Russian data capture language- and culture-specific argumentation but are less effective for generalizable models. Generating topics remains a more challenging task for both Russian and English data. The dataset and accompanying results contribute to multilingual stance research and provide a valuable new resource for argument mining in Russian.
News recommendation systems play a central role in how readers access and process current events. Most recommenders’ underlying algorithmic strategies, however, prioritize user engagement over comprehension, amplifying risks of misinformation and filter bubbles. This study investigates whether fine-grained content-based recommendation strategies favor human knowledge retention and explores how such a content-based recommendation can be operationalized using event coreference–based document modeling. To this purpose, we first measure the effect of manually curated content-based news recommendation on knowledge retention across five news topics with 126 Dutch speaking participants. Next, we investigate document retrieval by comparing a state-of-the-art event coreference resolution system for Dutch which recommends news articles based on event chains with a document similarity retrieval baseline using state-of-the-art embedding models in three increasingly more complex test settings. The results demonstrate that human-curated content-based recommendation can positively and significantly impact readers’ knowledge retention. Moreover, we show that a fine-grained coreference system can approach said level of human curation better than state-of-the-art document retrieval methods. In general, this holds potential for scalable, comprehension-oriented news recommendation.
Reasoning about alternatives is a fundamental component of human cognition and argumentation, yet it remains unclear whether large language models (LLMs) can coherently generate and assess them. This paper introduces Counter-Hypothesis Generation (CHG), a novel task for evaluating how LLMs construct plausible hypotheses when contextual information changes. Inspired by open-domain commonsense reasoning, where models infer and compare multiple explanations, CHG bridges commonsense and counterfactual reasoning by requiring models to generate hypotheses that remain logically consistent with modified premises. We present a test set annotated by a human expert and complemented with counter-hypotheses generated by OpenAI-o3 and DeepSeek-r1. Experimental results reveal that even advanced reasoning models exhibit notable limitations in counter-hypothesis generation.
Explanation generation has gained increasing attention in the field of NLP because it makes the output of classification models more intuitively understandable for humans. This is particularly relevant for complex semantic tasks such as irony detection, where there may not be any explicit linguistic markers. Generative models have shown great potential for irony explanation in earlier work, but most studies have been limited to English. Since this is the highest-resourced language, these capabilities may not be available in languages other than English. To address this gap, this paper analyses the performance of generative models for explanation generation in Dutch, a lower-resourced but closely related language to English. Our work shows that larger proprietary models, like GPT-4, can generate meaningful explanations based on relevant world knowledge, whereas smaller open-source models still struggle to perform this task. Besides quality evaluation, we also analyse the limitations of these models, showing that GPT models struggle most with verbosity and that both open-source and proprietary models exhibit circular reasoning ("this text is ironic because the person expresses this in an ironic way”). Finally, open-source models struggle in particular for Dutch because they fail to produce the relevant world knowledge that is required to understand the irony. All models and data used for the experiments is available at iRONNIE on Hugging Face.