The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
Jeremy Barnes, Valentin Barriere, Orphée De Clercq, Roman Klinger, Célia Nouri, Debora Nozza, Pranaydeep Singh (Editors)
- Anthology ID:
- 2026.wassa-1
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Venues:
- WASSA | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.wassa-1/
- DOI:
- ISBN:
- 979-8-89176-378-4
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.wassa-1.pdf
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Council of LLMs: Evaluating Capability of Large Language Models to Annotate Propaganda
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Data annotation is essential for supervised natural language processing tasks but remains labor-intensive and expensive. Large language models (LLMs) have emerged as promising alternatives, capable of generating high-quality annotations either autonomously or in collaboration with human annotators. However their use in autonomous annotations is often questioned for their ethical take on subjective matters. This study investigates the effectiveness of LLMs in a autonomous, and hybrid annotation setups in propaganda detection. We evaluate GPT and open-source models on two datasets from different domains, namely, Propaganda Techniques Corpus (PTC) for news articles and the Journalist Media Bias on X (JMBX) for social media. Our results show that LLMs, in general, exhibit high recall but lower precision in detecting propaganda, often over-predicting persuasive content. Multi-annotator setups did not outperform the best models in single-annotator setting although it helped reasoning models boost their performance. Hybrid annotation, combining LLMs and human input, achieved the highest overall accuracy than LLM-only settings. We further analyze misclassifications and found that LLM have higher sensitivity towards certain propaganda techniques like loaded language, name calling, and doubt. Finally, using error typology analysis, we explore the reasoning provided on misclassifications by the LLM. Our result shows that although some studies report LLM outperforming manual annotations and it could prove useful in hybrid annotation, its incorporation in the human annotation pipeline must be implemented with caution.
Emoji Reactions on Telegram: Unreliable Indicators of Emotional Resonance
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Emoji reactions are a frequently used feature of messaging platforms, yet their communicative role remains understudied. Prior work on emojis has focused predominantly on in-text usage, showing that emojis embedded in messages tend to amplify and mirror the author’s affective tone. This evidence has often been extended to emoji reactions, treating them as indicators of emotional resonance or user sentiment. However, they may reflect broader social dynamics. Here, we investigate the communicative function of emoji reactions on Telegram. We analyze over 650k crypto-related messages that received at least one reaction, annotating each with sentiment, emotion, persuasion strategy, and speech act labels, and inferring the sentiment and emotion of emoji reactions using both lexicons and LLMs. We uncover a systematic mismatch between message and reaction sentiment, with positive reactions dominating even for neutral or negative content. This pattern persists across rhetorical strategies and emotional tones, indicating that emojis used as reactions do not reliably function as indicators of emotional mirroring or resonance of the content, in contrast to findings reported for in-text emojis. Finally, we identify the features that most predict emoji engagement. Overall, our findings caution against treating emoji reactions as sentiment labels, highlighting the need for more nuanced approaches in sentiment and engagement analysis.
This paper presents a domain-specific transformer pipeline for quantifying social atmosphere in hostel reviews, an experiential dimension that travelers consistently prioritize but that existing NLP methods and booking platforms fail to capture. We train a cross-encoder on 4,994 manually annotated reviews and use it to pseudo-label 162,840 additional reviews; these labels are then distilled into a sentence-transformer bi-encoder, producing embeddings where proximity reflects social interaction level rather than generic sentiment. On held-out human-labeled data, the domain-adapted embeddings achieve F1 = 0.826, outperforming generic sentence embeddings (0.671) and zero-shot GPT-4o (0.774), with a 40-fold improvement in intra-class versus inter-class similarity. Aggregating predictions to the property level reveals that hostel socialness follows an approximate exponential distribution, confirming that highly social hostels are rare. This work formalizes socialness as a measurable semantic construct and provides a general template for extracting implicit experiential attributes from text at scale.
Predicting Convincingness in Political Speech: How Emotional Tone Shapes Persuasive Strength
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Emotional tone plays a central role in persuasion, yet its impact on computational assessments of political argument quality in real world election campaign speeches remains understudied. In this work, we investigate whether positive emotional framing correlates with higher perceived convincingness in political arguments. We fine-tune language models on argument quality datasets and test their ability to transfer convincingness predictions to real-world campaign speeches. Using a corpus of U.S. presidential campaign speeches, we analyze emotional polarity in relation to predicted persuasive strength to test whether positively framed arguments are judged more convincing than neutral or negative ones. Our empirical analysis shows that political parties rely heavily on argumentation during their election campaigns. Also, we found the evidence that politicians strategically employ emotional cues within their arguments during these campaign speeches, with positive emotions being more strongly associated with persuasive strength, for example in topics such as USMCA’s Effect on American Jobs and Agriculture, Border Control Policies, Progressive Tax Reforms. At the same time, we find that negative emotions have a weaker yet still non-negligible influence on voter persuasion in topics such as City Crime and Civil Unrest and White Supremacist Violence (Charlottesville Incident).
Large language models (LLMs) are now widely used in applications that depend on closed-ended decisions, including automated surveys, policy screening, and decision-support tools. In such contexts, these models are typically expected to produce consistent binary or ternary responses (for example, Yes, No, or Neither) when presented with questions that are semantically equivalent. However recent studies shows that LLM outputs can be influenced by relatively minor changes in prompt wording, raising concerns about the reliability of their decisions under paraphrasing. In this paper, we conduct a systematic analysis of paraphrase robustness across five widely used LLMs. To support this evaluation, we develop a controlled dataset consisting of 200 opinion-based questions drawn from multiple domains, each accompanied by five human-validated paraphrases. All models are evaluated under deterministic inference settings and constrained to a fixed Yes/No/Neither response format. We assess model behavior using a set of complementary metrics that capture the stability of each evaluated model. DeepSeek Reasoner and Gemini 2.0 Flash show the highest stability when responding to paraphrased inputs, whereas Claude 3.7 Sonnet exhibits strong internal consistency but produces judgments that differ more frequently from those of other models. By contrast, GPT-3.5 Turbo and LLaMA 3 70B display greater sensitivity to surface-level variations in prompt phrasing. Overall, these findings suggest that robustness to paraphrasing is driven more by alignment strategies and reasoning design choices than by model size alone.
The Impact of Highlighting Subjective Language on Perceived News Trustworthiness
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
The rise of misinformation and opinionated articles has made understanding how misleading or biased content influences readers an increasingly important problem. While most prior work focuses on detecting misinformation or deceptive language in real time, far less attention has been paid to how such content is perceived by readers, which is an essential component of misinformation’s effectiveness. In this study, we examine whether highlighting subjective sentences in news articles affects perceived trustworthiness. Using a controlled user experiment and 1,334 article–reader evaluations, we find that highlighting subjective content produces a modest yet statistically significant decrease in trust, with substantial variation across articles and participants. To explain this variation, we model trust change after highlighting subjective language as a function of article-level linguistic features and reader-level attitudes. Our findings suggest that readers’ reactions to highlighted subjective language are driven primarily by characteristics of the text itself, and that highlighting subjective language offers benefits for may help readers better assess the reliability of potentially misleading news articles.
Appraisal Trajectories in Narratives Reveal Distinct Patterns of Emotion Evocation
Johannes Schäfer | Janne Wagner | Roman Klinger
Johannes Schäfer | Janne Wagner | Roman Klinger
Understanding emotion responses relies on reconstructing how individuals appraise events. While prior work has studied emotion trajectories and inherent correlations with appraisals, it has considered appraisals only in a snapshot analysis. However, because appraisal is a complex, sequential process, we argue that it should be analyzed based on how it unfolds throughout a narrative. In this study, we investigate whether trajectories of appraisals are distinctive for different emotions in five-event stories – narratives where each of five sentences describes an event. We employ zero-shot prompting with a large language model to predict appraisals on sub-sequences of a narrative. We find that this approach is effective in identifying relevant appraisals in narratives, without prior knowledge of the evoked emotion, enabling a comprehensive analysis of appraisal trajectories. Furthermore, we are the first to quantitatively identify typical patterns of appraisal trajectories that distinguish emotions. For example, a rising trajectory for self-responsibility indicates trust, while a falling trajectory suggests anger.
Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Model
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and identify significant challenges in data availability and quality, despite overall increases in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings show that the volume of data alone is insufficient to improve a language’s standing in NLP.
Emotional Lexicons: How Large Language Models Predict Emotional Ratings of Russian Words
Polina V. Iaroshenko | Natalia V Loukachevitch
Polina V. Iaroshenko | Natalia V Loukachevitch
This study examines the capability of LLMs to predict emotional ratings of Russian words by comparing their assessments with both native speakers’ ratings and expert evaluations. The research utilises two datasets: the ENRuN database containing associative emotional ratings of Russian nouns by native speakers, and RusEmoLex, an expert-compiled lexicon. Various open-source LLMs were evaluated, including international models (Llama-3, Qwen 2.5), Russian-developed models, and Russian-adapted variants, representing three parameter scales. The findings reveal distinct patterns in model performance: Russian-adapted models demonstrated superior alignment with native speakers’ ratings, whilst model size was not a decisive factor. Conversely, larger models showed better performance in matching expert assessments, with language adaptation having minimal impact. Emotional or sensitive lexis with strong connotations produce a more substantial human-model gap.
Emotion-aware text simplification of user generated content using LLMs
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Digital inclusion increasingly supports adults with intellectual disabilities (ID) to participate online, yet social media posts can be difficult to understand, particularly when they contain strong emotions, slang, or non-standard writing. This paper investigates whether large language models (LLMs) can simplify social media texts to improve cognitive accessibility and preserve emotional meaning. Using an accessibility-oriented prompt based on existing guidance, posts are simplified and emotion preservation is assessed. The results suggest that many simplified posts retain the same emotions, though changes occur, especially when emotions are weakly expressed or ambiguous. Qualitative analysis shows that simplification improves fluency and structure but can also shift perceived emotion through changes to tone, formatting, and other affective cues common in social media text. The research has also revealed that different LLMs produce very different outputs.
Crowd-Based Evaluation of Emotion Intensity Preservation in Spanish–Basque Tweet Machine Translation
Nora Aranberri
Nora Aranberri
Machine translation (MT) systems perform well on standard benchmarks, yet their ability to preserve emotional meaning in informal user-generated content—particularly for low-resource languages—remains underexplored. We investigate the preservation of emotion intensity in Spanish–Basque tweet translation, focusing on Basque, an under-represented language in MT research. We compile a small, controlled corpus of Spanish reaction tweets and evaluate Basque translations from three publicly available systems through a crowd-based study. While all systems achieve comparable and above mid-range accuracy and fluency, emotion intensity is systematically attenuated in the translations, with greater loss for more emotionally intense inputs. A follow-up on highly emotional tweets shows that LLM prompting reduces emotion loss, yet substantial attenuation remains, highlighting emotion preservation as a persistent challenge in Spanish–Basque MT.
A Position Paper on Toxic Reasoning: Grounding Categories of Toxic Language in Implications and Attitudes
Stefan F. Schouten | Ilia Markov | Piek Vossen
Stefan F. Schouten | Ilia Markov | Piek Vossen
Automatic detection of toxic language has the potential to considerably improve engagement with online spaces. Previous work has characterized toxic language detection as a classification problem, often using fine-grained classes for increased explainability. In this position paper, we argue for a particular way of operationalizing categories of toxic language. Our approach focuses on what is expressed or implied, and breaks down implications based on two traits: (i) the core content of what was expressed, and (ii) relevant stakeholders’ attitudes towards that content. We argue for an approach, which we call toxic reasoning, where such distinctions are made explicit. We point out the benefits for such an approach, and develop a toxic reasoning schema, which can explain categories of toxic language from diverse sources. We demonstrate this by mapping the classes of existing toxic language datasets to the schema. Toxic reasoning promises to provide improved understanding of implicit toxicity while increasing explainability.
Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Use cases of sentiment analysis in the humanities often require contextualized, continuous scores. Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments. Yet the method’s portability across domains and underlying assumptions remain underexplored.We evaluate CVP across genres, historical periods, languages, and affective dimensions, finding that concept vectors trained on one corpus transfer well to others with minimal performance loss. To understand the patterns of generalization, we further examine the linearity assumption underlying CVP. Our findings suggest that while CVP is a portable approach that effectively captures generalizable patterns, its linearity assumption is approximate, pointing to potential for further development. Code available at: github.com/lauritswl/representation-transfer
Disentangling Emotion Understanding and Generation in Large Language Models
Sadegh Jafari | Els Lefever | Veronique Hoste
Sadegh Jafari | Els Lefever | Veronique Hoste
Large language models (LLMs) have demonstrated strong performance on emotion understanding tasks, yet their ability to faithfully generate emotionally aligned text remains less well understood.We propose a semantic evaluation framework that jointly assesses emotion understanding, emotion generation, and internal consistency, using a VAE-based emotion cost matrix that captures graded semantic similarity between emotion categories.Our framework introduces four complementary metrics that disentangle baseline understanding, human-perceived emotion in generated text, generation quality, and model consistency.Experimental results show that while understanding and consistency scores are highly correlated, emotion generation exhibits substantially weaker correlations with these metrics.These findings motivate the development of specialized evaluation protocols that independently measure emotional understanding and generation, enabling more reliable assessments of LLM emotional intelligence.
News Credibility Assessment by LLMs and Humans: Implications for Political Bias
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
In an era of rapid misinformation spread, LLMs have emerged as tools for assessing news credibility at scale. However, the assessments are influenced by social and cultural biases. Studies investigating political bias, compare model credibility ratings with expert credibility ratings. Comparing LLMs to the perceptions of political camps extends this approach to detecting similarities in their biases.We compare LLM-generated credibility and bias ratings of news outlets with expert assessments and stratified political opinions collected through surveys. We analyse three models (Llama 3.3 70B, Mixtral 8x7B, and GPT-OSS 120B) across 47 news outlets from two countries (U.S. and Germany).We found that models demonstrated consistently high alignment with expert ratings, while showing weaker and more variable alignment with public opinions. For US-American news outlets all models showed stronger alignment with center-left perceptions, while for German news outlets the alignment is more diverse.
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama-3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.
Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high-quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks—Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)—four SemEval datasets, and two encoder–decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.
Antisocial behavior (ASB) on social media encompasses online behaviors that harm individuals, groups, or platform ecosystems, including hate speech, harassment, cyberbullying, trolling, and coordinated abuse. While most prior work has focused on detecting harm after it occurs, a growing body of research on ASB prediction seeks to forecast future harmful outcomes before they materialize, including—but not limited to—hate-speech diffusion, conversational derailment, and user recidivism. However, this emerging field remains fragmented, with limited conceptual grounding and few integrative frameworks. This paper establishes a foundation for ASB prediction by introducing a structured taxonomy spanning temporal, structural, and behavioral dimensions. Drawing on 49 machine learning studies identified through a literature review, we map predictive goals to datasets, modeling choices, and evaluation practices, and identify key challenges, including the lack of standardized benchmarks, the dominance of text-centric representations, and trade-offs between accuracy and interpretability. We conclude by outlining actionable directions toward more robust, generalizable, and responsible ASB prediction systems.
Real-Time Mitigation of Negative Emotion in Customer Care Calls
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Speech emotion recognition (SER) is a compelling yet challenging research area with substantial practical relevance, particularly in enhancing human–machine interaction. Despite considerable progress in the field, the scarcity of realistic datasets that reflect real-world conditions makes it difficult to analyze system behavior in practice and can lead to degraded performance in industrial applications. In this study, we propose a system that detects negative emotions at each turn in a conversation by leveraging both linguistic and acoustic features. The approach is evaluated on real-world data, with a particular focus on identifying and responding to negative emotion in customer support scenarios. Designed for real-time application, the system is suitable for live deployment in call center environments. Furthermore, we propose an effective prompting strategy for using large language models (LLMs) as annotators, generating labeled data used to fine-tune small language models that achieve performance on par with the LLM used for annotation, while remaining suitable for real-time deployment.
Says Who? Argument Convincingness and Reader Stance Are Correlated with Perceived Author Personality
Sabine Weber | Lynn Greschner | Roman Klinger
Sabine Weber | Lynn Greschner | Roman Klinger
Alongside its literal meaning, text also carries implicit social signals: information that is used by the reader to assign the author of the text a specific identity or make assumptions about the author’s character. The reader creates a mental image of the author which influences the interpretation of the presented information. This is especially relevant for argumentative text, where the credibility of the information might depend on who provides it. We therefore focus on the question: How do readers of an argument imagine its author? Using the ContArgA corpus, we study arguments annotated for convincingness and perceived author properties (level of education and Big Five personality traits). We find that annotators perceive an author to be similar to themselves when they agree with the stance of the argument. We also find that the envisioned personality traits and education level of the author are statistically significantly correlated with the argument’s convincingness. We conduct experiments with four generative LLMs and a RoBERTa-based regression model showing that LLMs do not replicate the annotators judgments. Argument convincingness can however provide a useful signal for modeling perceived author personality when it is explicitly used during training.
A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection
Ximing Wen | Rezvaneh Rezapour
Ximing Wen | Rezvaneh Rezapour
Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm’s inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model’s inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
Multimodal Claim Extraction for Fact-Checking
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today’s misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
A Multi-Aspect Evaluation Framework for Synthetic Data: Case Study on Irony and Sarcasm
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Data augmentation (DA) using large language models (LLMs) is a cost-effective method for generating synthetic data, particularly for tasks with scarce datasets. However, its potential remains largely underexplored, both in terms of augmentation configuration and evaluation of synthetic data. This paper investigates LLM-based synthetic data generation for irony and sarcasm, two subjective and context-dependent forms of figurative language. We propose a multi-aspect evaluation framework assessing synthetic data’s utility-plausibility and extrinsic-intrinsic dimensions through four aspects: predictive performance, sample diversity, linguistic properties, and human judgment. Our findings indicate that other aspects of evaluation, like diversity and linguistic features, do not necessarily correlate with an increase in predictive performance, underscoring the importance of multi-faceted evaluation. This work highlights the potential of LLM-based DA for irony and sarcasm detection, offering insights into the linguistic competence of LLMs. As synthetic data becomes increasingly prevalent, our framework offers a broadly applicable and crucial evaluation method, particularly for linguistically complex tasks.