Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Katherine Atwell, Laura Biester, Angana Borah, Daryna Dementieva, Oana Ignat, Neema Kotonya, Ziyi Liu, Ruyuan Wan, Steven Wilson, Jieyu Zhao (Editors)

Anthology ID:: 2025.nlp4pi-1
Month:: July
Year:: 2025
Address:: Vienna, Austria
Venues:: NLP4PI | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.nlp4pi-1/
DOI:
ISBN:: 978-1-959429-19-7
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.nlp4pi-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Tracking Green Industrial Policies with LLMs: A Demonstration
Yucheng Lu

Green industrial policies (GIPs) are government interventions that support environmentally sustainable economic growth through targeted incentives, regulations, and investments in clean technologies. As the backbone of climate mitigation and adaptation, GIPs deserve systematic documentation and analysis. However, two major hurdles impede this systematic documentation. First, unlike other climate policy documents, such as Nationally Determined Contributions (NDCs) which are centrally curated, GIPs are scattered across numerous government legislation and policy announcements. Second, extracting information from these diverse documents is expensive when relying on expert annotation. We address this gap by proposing GreenSpyder, an LLM-based workflow that actively monitors, classifies, and annotates GIPs from open-source information. As a demonstration, we benchmark LLM performance in classifying and annotating GIPs on a small expert-curated dataset. Our results show that LLMs can be quite effective for classification and coarse annotation tasks, though they still need improvement for more nuanced classification. Finally, as a real-world application, we apply GreenSpyder to U.S. Legislative Records from the 117th Congress, paving the way for more comprehensive LLM-based GIP documentation in the future.

pdf bib abs
Guardians of Trust: Risks and Opportunities for LLMs in Mental Health
Miguel Baidal | Erik Derner | Nuria Oliver

The integration of large language models (LLMs) into mental health applications offers promising opportunities for positive social impact. However, it also presents critical risks. While previous studies have often addressed these challenges and risks individually, a broader and multi-dimensional approach is still lacking. In this paper, we introduce a taxonomy of the main challenges related to the use of LLMs for mental health and propose a structured, comprehensive research agenda to mitigate them. We emphasize the need for explainable, emotionally aware, culturally sensitive, and clinically aligned systems, supported by continuous monitoring and human oversight. By placing our work within the broader context of natural language processing (NLP) for positive impact, this research contributes to ongoing efforts to ensure that technological advances in NLP responsibly serve vulnerable populations, fostering a future where mental health solutions improve rather than endanger well-being.

Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events–structured information concerning disease outbreaks or other unusual health events–from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

pdf bib abs
CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ)
Abhilekh Borah | Hasnat Md Abdullah | Kangda Wei | Ruihong Huang

The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our code and dataset to foster further research in this domain.

pdf bib abs
Does “Reasoning” with Large Language Models Improve Recognizing, Generating and Reframing Unhelpful Thoughts?
Yilin Qi | Dong Won Lee | Cynthia Breazeal | Hae Won Park

Cognitive Reframing, a core element of Cognitive Behavioral Therapy (CBT), helps individuals reinterpret negative experiences by finding positive meaning. Recent advances in Large Language Models (LLMs) have demonstrated improved performance through reasoning-based strategies. This inspires a promising direction of leveraging the reasoning capabilities of LLMs to improve CBT and mental reframing by simulating the process of critical thinking, potentially enabling more effective recognition, generation and reframing of cognitive distortions. In this work, we investigate the role of various reasoning methods, including pre-trained reasoning LLMs, such as DeepSeek-R1, and augmented reasoning strategies, such as CoT (Wei et al., 2022) and self-consistency (Wang et al., 2022), in enhancing LLMs’ ability to perform cognitive reframing tasks. We find that augmented reasoning methods, even when applied to older LLMs like GPT-3.5, consistently outperform state-of- the-art pretrained reasoning models such as DeepSeek-R1 (Guo et al., 2025) and o1 (Jaech et al., 2024) on recognizing, generating and reframing unhelpful thoughts.

pdf bib abs
Take Shelter, Zanmi: Digitally Alerting Cyclone Victims in Their Languages
Nathaniel Romney Robinson

Natural disasters such as tropical cyclones cause annual devastation and take a heavy so- cial cost, as disadvantaged communities are typ- ically hit hardest. Among these communities are the speakers of minority and low-resource languages, who may not be sufficiently in- formed about incoming weather events to pre- pare. This work presents an analysis of the current state of machine translation for natural disasters in the languages of communities that are threatened by them. Results suggest that commercial systems are promising, and that in-genre fine-tuning data are beneficial.

Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93—surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.

pdf bib abs
Bridging Perceptual Gaps in Food NLP: A Structured Approach Using Sensory Anchors
Kana Maruyama | Angel Hsing-Chi Hwang | Tarek R. Besold

Understanding how humans perceive and describe food is essential for NLP applications such as semantic search, recommendation, and structured food communication. However, textual similarity often fails to reflect perceptual similarity, which is shaped by sensory experience, wine knowledge, and individual context. To address this, we introduce Sensory Anchors—structured reference points that align textual and perceptual representations. Using Red Wine as a case study, we collect free-form descriptions, metaphor-style responses, and perceptual similarity rankings from participants with varying levels of wine knowledge. These rankings reflect holistic perceptual judgments, with wine knowledge emerging as a key factor. Participants with higher wine knowledge produced more consistent rankings and moderately aligned descriptions, while those with lower knowledge showed greater variability. These findings suggest that structured descriptions based on higher wine knowledge may not generalize across users, underscoring the importance of modeling perceptual diversity. We also find that metaphor-style prompts enhance alignment between language and perception, particularly for less knowledgeable participants. Sensory Anchors thus provide a flexible foundation for capturing perceptual variability in food language, supporting the development of more inclusive and interpretable NLP systems.

We present a study investigating the linguistic sentiment associated with schizophrenia and depression in research-based texts. To this end, we construct a corpus of over 260,000 PubMed abstracts published between 1975 and 2025, covering both disorders. For sentiment analysis, we fine-tune two sentence-transformer models using SetFit with a training dataset consisting of sentences rated for valence by psychiatrists and clinical psychologists. Our analysis identifies significant temporal trends and differences between the two conditions. While the mean positive sentiment in abstracts and titles increases over time, a more detailed analysis reveals a marked rise in both maximum negative and maximum positive sentiment, suggesting a shift toward more polarized language. Notably, sentiment in abstracts on schizophrenia is significantly more negative overall. Furthermore, an exploratory analysis indicates that negative sentences are disproportionately concentrated at the beginning of abstracts. These findings suggest that linguistic style in scientific literature is evolving. We discuss the broader ethical and societal implications of these results and propose recommendations for more cautious language use in scientific discourse.

pdf bib abs
Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Tomas Peterka | Matyas Bohacek

Out-of-context and misattributed imagery is the leading form of media manipulation in today’s misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.

TikTok has emerged as a key platform for discussing polarizing topics, including climate change. Despite its growing influence, there is limited research exploring how content features shape emotional alignment between video creators and audience comments, as well as their impact on user engagement. Using a combination of pretrained and fine-tuned textual and visual models, we analyzed 7,110 TikTok videos related to climate change, focusing on content features such as semantic clustering of video transcriptions, visual elements, tonal shifts, and detected emotions. (1) Our findings reveal that positive emotions and videos featuring factual content or vivid environmental visuals exhibit stronger emotional alignment. Furthermore, emotional intensity and tonal coherence in video speech are significant predictors of higher engagement levels, offering new insights into the dynamics of climate change communication on social media. (2) Our preference learning analysis reveals that comment emotions play a dominant role in predicting video shareability, with both positive and negative emotional responses acting as key drivers of content diffusion. We conclude that user engagement—particularly emotional discourse in comments—significantly shapes climate change content shareability.

pdf bib abs
What Counts Underlying LLMs’ Moral Dilemma Judgments?
Wenya Wu | Weihong Deng

Moral judgments in LLMs increasingly capture the attention of researchers in AI ethics domain. This study explores moral judgments of three open-source large language models (LLMs)—Qwen-1.5-14B, Llama3-8B, and DeepSeek-R1 in plausible moral dilemmas, examining their sensitivity to social exposure and collaborative decision-making. Using a dual-process framework grounded in deontology and utilitarianism, we evaluate LLMs’ responses to moral dilemmas under varying social contexts. Results reveal that all models are significantly influenced by moral norms rather than consequences, with DeepSeek-R1 exhibiting a stronger action tendency compared to Qwen-1.5-14B and Llama3-8B, which show higher inaction preferences. Social exposure and collaboration impact LLMs differently: Qwen-1.5-14B becomes less aligned with moral norms under observation, while DeepSeek-R1’s action tendency is moderated by social collaboration. These findings highlight the nuanced moral reasoning capabilities of LLMs and their varying sensitivity to social cues, providing insights into the ethical alignment of AI systems in socially embedded contexts.

pdf bib abs
Unsupervised Sustainability Report Labeling based on the integration of the GRI and SDG standards
Seyed Alireza Mousavian Anaraki | Danilo Croce | Roberto Basili

Sustainability reports are key instruments for communicating corporate impact, but their unstructured format and varied content pose challenges for large-scale analysis. This paper presents an unsupervised method to annotate paragraphs from sustainability reports against both the Global Reporting Initiative (GRI) and Sustainable Development Goals (SDG) standards. The approach combines structured metadata from GRI content indexes, official GRI–SDG mappings, and text semantic similarity models to produce weakly supervised annotations at scale. To evaluate the quality of these annotations, we train a multi-label classifier on the automatically labeled data and evaluate it on the trusted OSDG Community Dataset. The results show that our method yields meaningful labels and improves classification performance when combined with human-annotated data. Although preliminary, this work offers a foundation for scalable sustainability analysis and opens future directions toward assessing the credibility and depth of corporate sustainability claims.

pdf bib abs
AfD-CCC: Analyzing the Climate Change Discourse of a German Right-wing Political Party
Manfred Stede | Ronja Memminger

While the scientific consensus on anthropogenic climate change (CC) is undisputed now for a long time, public discourse is still divided. Considering the case of Europe, in the majority of countries, an influential right-wing party propagates climate scepticism or outright denial. Our work addresses the German party, which represents the second-largest faction in the federal parliament. In order to make the partys discourse on CC accessible to NLP-based analyses, we are compiling the, a collection of parliamentary speeches and other material from various sources. We report on first analyses of this new dataset using sentiment and emotion analysis as well as classification of populist language, which demonstrate clear differences to the language use of the two largest competing parties (social democrats and conservatives). We make the corpus available to enable further studies of the party’s rhetoric on CC topics.

Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases during training. In this paper, we study a phenomenon we term “stereotype leakage”, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models’ behavior in another. We propose a measurement framework for stereotype leakage and investigate its effect in English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and nonpolar associations across all languages. We find that GPT-3.5 exhibits the most stereotype leakage of these models, and Hindi is the most susceptible to leakage effects.

Publications in the AI for Good space have tended to focus on the research and model development that can support high-impact applications. However, very few AI for Good papers discuss the process of deploying and collaborating with the partner organization, and the resulting real-world impact. In this work, we share details about the close collaboration with a humanitarian-to-humanitarian (H2H) organization and how to not only deploy the AI model in a resource-constrained environment, but also how to maintain it for continuous performance updates, and share key takeaways for practitioners.

While several previous studies have analyzed gender bias in research, we are still missing a comprehensive analysis of gender differences in the AI community, covering diverse topics and different development trends. Using the AI Scholar dataset of 78K researchers in the field of AI, we identify several gender differences: (1) Although female researchers tend to have fewer overall citations than males, this citation difference does not hold for all academic-age groups; (2) There exist large gender homophily in co-authorship on AI papers; (3) Female first-authored papers show distinct linguistic styles, such as longer text, more positive emotion words, and more catchy titles than male first-authored papers. Our analysis provides a window into the current demographic trends in our AI community, and encourages more gender equality and diversity in the future.

Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques (CITATION) into three broader categories, conduct a human annotation study on the HQP dataset (CITATION) that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.

pdf bib abs
Multi-Task Learning approach to identify sentences with impact and affected location in a disaster news report
Sumanta Banerjee | Shyamapada Mukherjee | Sivaji Bandyopadhyay

The first priority of action in the Sendai Framework for Disaster Risk Reduction 2015-2030 advocates the understanding of disaster risk by collecting and processing practical information related to disasters. A smart collection may be the compilation of relevant and summarized news articles focused on some key pieces of information such as disaster event type, geographic location(s), and impacts. In this article, a Multi-Task Learning (MTL) based end-to-end model has been developed to perform three related tasks: sentence classification depending on the presence of (1) relevant locations and (2) impact information to generate a summary,and (3) identification of the causes or event types in disaster news. Each of the three tasks is formulated as a multilabel binary classification problem. The results of the proposed MTL model have been compared with three popular transformer models: BERT, RoBERTa, and ALBERT. It is observed that the proposed model showed better performance scores than the other models in most cases.

Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Large Language Models (LLM) and Retrieval Augmented Generation (RAG) approaches offer a transformative solution, enabling rapid, accurate cross-document information retrieval and synthesis. As the landscape of Natural Language Processing (NLP) and text generation continues to evolve, benchmarking becomes essential to evaluate and compare the performance of different RAG-based LLMs. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental aspects of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level, providing a foundation for rigorous assessment of RAG-based systems in complex scientific domains and enabling researchers to identify areas for improvement in domain-specific applications.

pdf bib abs
Participatory Design for Positive Impact: Behind the Scenes of Three NLP Projects
Marianne Wilson | David M. Howcroft | Ioannis Konstas | Dimitra Gkatzia | Gavin Abercrombie

Researchers in Natural Language Processing (NLP) are increasingly adopting participatory design (PD) principles to better achieve positive outcomes for stakeholders. This paper evaluates two PD perspectives proposed by Delgado et al. (2023) and Caselli et al. (2021) as interpretive and planning tools for NLP research. We reflect on our experiences adopting PD practices in three NLP projects that aim to create positive impact for different communities, and that span different domains and stages of NLP research. We assess how our projects align with PD goals and use these perspectives to identify the benefits and challenges of PD in NLP research. Our findings suggest that, while Caselli et al. (2021) and Delgado et al. (2023) provide valuable guidance, their application in research can be hindered by existing NLP practices, funding structures, and limited access to stakeholders. We propose that researchers adapt their PD praxis to the circumstances of specific projects and communities, using them as flexible guides rather than rigid prescriptions.

pdf bib abs
Mitigating Gender Bias in Job Ranking Systems Using Job Advertisement Neutrality
Deepak Kumar | Shahed Masoudian | Alessandro B. Melchiorre | Markus Schedl

Transformer-based Job Ranking Systems (JRSs) are vulnerable to societal biases inherited in unbalanced datasets. These biases often manifest as unjust job rankings, particularly disadvantaging candidates of different genders. Most bias mitigation techniques leverage candidates’ gender and align gender distributions within the embeddings of JRSs to mitigate bias. While such methods effectively align distributional properties and make JRSs agnostic to gender, they frequently fall short in addressing empirical fairness metrics, such as the performance gap across genders. In this study, we shift our attention from candidate gender to mitigate bias based on gendered language in job advertisements. We propose a novel neutrality score based on automatically discovered biased words in job ads and use it to re-rank the model’s decisions. We evaluate our method by comparing it with different bias mitigation strategies and empirically demonstrate that our proposed method not only improves fairness but can also enhance the model’s performance.

pdf bib abs
STAR: Strategy-Aware Refinement Module in Multitask Learning for Emotional Support Conversations
Suhyun Lee | Changheon Han | Woohwan Jung | Minsam Ko

Effective emotional support in conversation requires strategic decision making, as it involves complex, context-sensitive reasoning tailored to diverse individual needs. The Emotional Support Conversation framework addresses this by organizing interactions into three distinct phases—exploration, comforting, and action—which guide strategy selection during response generation. While multitask learning has been applied to jointly optimize strategy prediction and response generation, it often suffers from task interference due to conflicting learning objectives. To overcome this, we propose the Strategy-Aware Refinement Module (STAR), which disentangles the decoder’s hidden states for each task and selectively fuses them via a dynamic gating mechanism. This design preserves task-specific representations while allowing controlled information exchange between tasks, thus reducing interference. Experimental results demonstrate that STAR effectively reduces task interference and achieves state-of-the-art performance in both strategy prediction and supportive response generation.

pdf bib abs
AI Tools Can Generate Misculture Visuals! Detecting Prompts Generating Misculture Visuals For Prevention
Venkatesh Velugubantla | Raj Sonani | Msvpj Sathvik

Advanced AI models that generate realistic images from text prompts offer new creative possibilities but also risk producing culturally insensitive or offensive content. To address this issue, we introduce a novel dataset designed to classify text prompts that could lead to the generation of harmful images misrepresenting different cultures and communities. By training machine learning models on this dataset, we aim to automatically identify and filter out harmful prompts before image generation, balancing cultural sensitivity with creative freedom. Benchmarking with state-ofthe-art language models, our baseline models achieved an accuracy of 73.34%.

pdf bib abs
Cross-cultural Sentiment Analysis of Social Media Responses to a Sudden Crisis Event
Zheng Hui | Zihang Xu | John Kender

Although the responses to events such as COVID-19 have been extensively studied, research on sudden crisis response in a multicultural context is still limited. In this paper, our contributions are 1)We examine cultural differences in social media posts related to such events in two different countries, specifically the United Kingdom lockdown of 2020-03-23 and the China Urumqi fire1 of 2022-11-24. 2) We extract the emotional polarity of tweets and weibos gathered temporally adjacent to those two events, by fine-tuning transformer-based language models for each language. We evaluate each model’s performance on 2 benchmarks, and show that, despite being trained on a relatively small amount of data, they exceed baseline accuracies. We find that in both events, the increase in negative responses is both dramatic and persistent, and does not return to baseline even after two weeks. Nevertheless, the Chinese dataset reflects, at the same time, positive responses to subsequent government action. Our study is one of the first to show how sudden crisis events can be used to explore affective reactions across cultures

pdf bib abs
Tapping into Social Media in Crisis: A Survey
William D. Lewis | Haotian Zhu | Keaton Strawn | Fei Xia

When a crisis hits, people often turn to social media to ask for help, offer help, find out how others are doing, and decide what they should do. The growth of social media use during crises has been helpful to aid providers as well, giving them a nearly immediate read of the on-the-ground situation that they might not otherwise have. The amount of crisis-related content posted to social media over the past two decades has been explosive, which, in turn, has been a boon to Language Technology (LT) researchers. In this study, we conducted a systematic survey of 355 papers published in the past five years to better understand the expanding growth of LT as it is applied to crisis content, specifically focusing on corpora built over crisis social media data as well as systems and applications that have been developed on this content. We highlight the challenges and possible future directions of research in this space. Our goal is to engender interest in the LT field writ large, in particular in an area of study that can have dramatic impacts on people’s lives. Indeed, the use of LT in crisis response has already been shown to save people’s lives.