Workshop on NLP and Computational Social Science (2026)
up
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Dallas Card | Anjalie Field | Katherine Keith | Julia Mendelsohn
Dallas Card | Anjalie Field | Katherine Keith | Julia Mendelsohn
Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses
Jens Rupprecht | Georg Ahnert | Markus Strohmaier
Jens Rupprecht | Georg Ahnert | Markus Strohmaier
Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts—we test 18 LLMs on questions taken from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 334,800 simulated survey interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also show that almost all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
Borrowed Words, Borrowed Minds: Probing LLM Choice of English-Derived Loanwords in Japanese
Joseph James
Joseph James
The choice between English-derived loanwords (gairaigo) and native Japanese equivalents is a socially meaningful aspect of language use, carrying implications for register, style, and pragmatic interpretation. We introduce a controlled evaluation dataset probing how large language models encode this form of sociolinguistic variation. The dataset comprises 113 interchangeable lexical pairs embedded across six communicative contexts spanning formal and informal, spoken and written registers. We evaluate 16 Japanese-capable LLMs across three complementary tasks: sentence rating, pairwise choice, and masked word prediction. Although both lexical forms were generally rated as natural, models diverged substantially in contextual sensitivity and lexical preference, revealing architectural differences in how socially grounded lexical alternatives are represented. These findings suggest that surface fluency may mask instability in modeling pragmatic variation, with implications for socially aware language generation and evaluation.
Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations
Miriam Wanner | Sophia Hager | Anjalie Field
Miriam Wanner | Sophia Hager | Anjalie Field
Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We analyze YouTube content put out by local news stations through topic modeling, log-odds ratios, and word embedding analyses to investigate changes after being acquired by Sinclair. We find evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases. These findings associate acquisition by Sinclair with increasing polarization and nationalization of news content, which in-turn risks increasing political polarization of local news viewers.
Learning Moral Diversity: Modelling Individual Perspectives in Moral Classification of Texts
Yi Ren | Lewis Mitchell | Matthew Roughan
Yi Ren | Lewis Mitchell | Matthew Roughan
Understanding moral values in social media text offers insight into moral judgement formation, and supervised NLP models trained on crowdsourced data have achieved strong classification performance. However, most approaches simplify the problem by aggregating multiple annotators’ labels into a single "ground truth", overlooking the inherent subjectivity of the task. In practice, there are disagreements between annotators caused by personal viewpoint or inherent ambiguities, particularly for short tweets. Here, we extend a pretrained language model with a layer that learns annotator-specific features. Our model improves predictions of individual annotations and yields representations that reveal meaningful insights into annotators’ moral perspectives. We show that models trained on aggregated labels may hide variation and give a misleading impression of performance. Overall, we demonstrate that disagreement reflects the inherent subjectivity of the task and that modelling individual perspectives creates benefits for moral classification of texts.
Launch and Aftermath: Contrasting Social Media Responses to Chatbot Releases. The Cases of Meta’s Galactica and OpenAI’s ChatGPT
Maximilian Weber | Johannes B. Gruber
Maximilian Weber | Johannes B. Gruber
In November 2022, Meta’s Galactica and OpenAI’s ChatGPT were released within fifteen days of each other, two transformer-based language models that were architecturally similar and built on comparable underlying technology, yet experienced starkly different outcomes. Where they diverged was not in technical kind but in domain positioning and epistemic framing: Galactica was explicitly marketed as a reliable scientific assistant, while ChatGPT was presented as a general-purpose conversational tool. Using Twitter data collected via the Twitter Research API, we conduct a comparative analysis of early social media discourse surrounding both models.Through sentiment classification, zero-shot harm and risk annotation, and LLM-based topic modeling, we find that negative sentiment escalated rapidly for Galactica while remaining comparatively stable for ChatGPT in the release period. Galactica experienced a marked escalation in criticism during its first week, eventually structuring much of the conversation. In contrast, ChatGPT’s early discourse remained more evenly distributed across hype, experimentation, practical engagement, and criticism. We argue that domain positioning and epistemic expectations, rather than any meaningful technological difference, played a central role in shaping public perception, with Galactica’s scientific presentation making its well-documented hallucinations appear far more damaging in public opinion.
When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Social scientists increasingly use large language models (LLMs) to classify text at scale, raising a key question: when can LLMs replace expert human annotation? Prior work found that earlier generative models failed on complex social science tasks while fine-tuned BERT succeeded, but whether current frontier-scale models close this gap remained untested. We investigate this question on a challenging legal reasoning task—classifying paragraphs from U.S. Supreme Court opinions as employing formal, grand, or no reasoning. Testing frontier LLMs including GPT-5.2 and leading open-weight alternatives, we find that even the most capable prompted models consistently underperform fine-tuned BERT. Only when high-parameter-count generative LLMs are fine-tuned on human-annotated training data does performance improve, and fine-tuned BERT remains a cost-effective alternative. Contrary to a common view, our results demonstrate that scaling to frontier-size LLMs does not eliminate the need for expert annotation on tasks requiring deep domain expertise—a finding with important implications for computational social science measurement.
An NLP Framework for Analyzing Corporate Strategic Behavior in the Opioid Industry Documents Archive
Duy Dang Phu | Thìn Đặng Văn
Duy Dang Phu | Thìn Đặng Văn
The Opioid Industry Documents Archive (OIDA) provides extensive internal corporate records that offer valuable insight into the drivers of the opioid crisis, yet its use in systematic analysis of corporate strategy remains limited. In this study, we propose an NLP-based framework to analyze strategic behavior in large-scale litigation archives, combining relevance filtering and topic modeling with large language model (LLM)-assisted interpretation. Applied to documents from Insys Therapeutics and Mallinckrodt Pharmaceuticals, our approach uncovers systematic differences in corporate strategies and organizational priorities. These results highlight the potential of integrating representation learning and LLMs for large-scale analysis in public health and corporate accountability research.
Large-scale ASR systems such as Whisper achieve competitive aggregate Word Error Rate (WER) on multilingual benchmarks, but this aggregate conceals systematic disparities across speaker populations. We evaluate Whisper large-v3 on 276 recordings from the Corpus Oral y Sonoro del Español Rural (COSER), a dialectological archive of elderly rural speakers across all Spanish provinces. WER is computed separately for Informants and Interviewers within each recording, revealing that mixed-role evaluation underestimates Informant WER in the majority of provinces, with the largest corrections in southern areas.Negative Binomial regression with cluster-robust standar errors shows that Andalusia and Extremadura generate significantly more Informant errors than the Castilian heartland (Andalusia IRR = 1.20, p < 0.001; Extremadura IRR = 1.24, p = 0.020), while no geographic predictor reaches significance for Interviewers sharing the same recording environment. Male Informants generate 12.5% more errors than females after geographic adjustment (p < 0.001), consistent with differential vernacular retention in traditional rural communities. The geographic pattern aligns with established dialectological classifications of Peninsular Spanish. These results demonstrate that role-disaggregated evaluation is a necessary methodological prerequisite for fairness audits of ASR systems applied to sociolinguistically diverse corpora: aggregate benchmarks systematically suppress disparities that are borne disproportionately by the most underrepresented speaker populations, and their use in isolation constitutes both an allocative harm and a measurement failure
Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion
Radhakrishnan Venkatakrishnan | Travis Brodbeck | Michael D. Young
Radhakrishnan Venkatakrishnan | Travis Brodbeck | Michael D. Young
Technological advancements, such as Large Language Models (LLMs), offer a potential solution to the two-faceted problem facing social science researchers: rising costs and declining response rates. The use of artificial personas is a budding practice, where chatbots are given the demographic characteristics of the person they are supposed to role-play as and answer questions for researchers. Before scholars and practitioners augment or replace the data created by interviewing humans, it is essential to understand how well models perform in generating accurate, reliable, and robust data, with concerns that the training of LLMs results in a bias towards the norms of WEIRD cultures. We present a procedure for practitioners to use to evaluate the quality of their synthetic data by measuring Intra Class Correlation (ICC), Earth Mover Distance (EMD), Variance, Hedging, and demographic drivers of LLM output. We find that the models may generate plausible results in the aggregate, but these synthetic data do not exhibit the depth or nuance of human respondents. Secondarily, we find that despite having generated definitive answers on a ten-point scale, the reasoning provided by the LLM exhibited varying degrees of hedging that do not consistently align with the LLM’s answer. The distortion of the results was not uniformly distributed; instead, the effects were more extreme for some demographic groups. Our findings suggest that the technology generating synthetic survey data may not be mature enough to address the increasing challenges of interviewing humans for public opinion research.
Documenting Corporate Harm: A Semantic Action Trajectories Approach to the Opioid Industry Document Archive Shared Task
Ben Miller
Ben Miller
This paper presents a method for modeling change in the possibility space of actors over time as represented in the Opioid Industry Document Archive (OIDA). The approach treats documents as a structured field of actor–action relations and models these relations as semantic action trajectories across time. Semantic role labeling (SRL) using the Emory Language and Information Toolkit (ELIT) is applied to extract subject–predicate structures from a corpus of internal industry documents. Subjects are normalized and grouped into actor categories using a combination of rule-based heuristics and constrained language model adjudication. Predicate vocabularies associated with these actors are mapped to psycholinguistic categories using the LIWC lexicon, and random forest feature selection with principal component analysis is used to construct a low-dimensional representation of discourse structure across periods.The resulting discourse space reveals systematic shifts in how corporate actors, regulators, clinicians, and patients are positioned over time. In particular, corporate entities and the opioid products they produce follow nearly identical semantic trajectories, suggesting that companies and the pharmaceutical drugs they produce occupy similar roles in the archive’s discourse. This method provides a way to analyze changing institutional behavior at scale across heterogeneous litigation and historical archives.
Toward Unsupervised Conceptual Metaphor Discovery: A Case Study in Online Immigration Discourse
Alexandria Leto | Maria Leonor Pacheco
Alexandria Leto | Maria Leonor Pacheco
In Conceptual Metaphor Theory (CMT), a metaphor is a systematic mapping from a concrete source domain (e.g., physical load) to a more abstract target domain (e.g., taxes), so that reasoning about concepts in the target domain is guided by inferences from the source domain. In this work, we propose that since different source domains can frame the same target in starkly different ways, the conceptual mappings evidenced by metaphorical expressions can guide computational political discourse analysis. We present a proof-of-concept for an unsupervised method that uncovers salient conceptual mappings from a corpus. Prior work in computational political metaphor analysis has drawn on CMT, but it typically requires a predetermined inventory of focused source and target domains. In contrast, we introduce a simple LLM-based method that detects metaphorical expressions from a corpus with strong performance, then clusters them to approximate source domain categories. We demonstrate its utility through a case study on online immigration discourse, showing that the resulting metaphor clusters provide context for frame analysis. We conclude by outlining future work needed to develop a robust framework for conceptual metaphor discovery in political discourse.
Simulating Social Attitudes with LLMs: Accuracy, Demographic Effects, and Refusal Behavior in the Sensitive Domain of Suicide Prevention
Cristina J. Perez | Michael P. Vasquez Jr | Philippe Giabbanelli | Patrick Y. Wu
Cristina J. Perez | Michael P. Vasquez Jr | Philippe Giabbanelli | Patrick Y. Wu
Large language models (LLMs) are increasingly used to simulate public opinion, yet their validity in sensitive policy domains remains underexplored. We evaluate whether LLMs can reproduce attitudes toward suicide prevention policies using 32 questions drawn from seven nationally representative U.S. surveys (2023-2025). We systematically vary demographic conditioning (race/ethnicity, gender, age, education, income, party), prompt framing (direct elicitation, respondent embodiment, specialist embodiment), and model architecture (GPT-5 Nano, DeepSeek V3.2, Meta Llama 3.1 8B, Mistral Small 24B). Across 811,560 prompts, the mean absolute error—the average gap between predicted and human response distributions—is 23 percentage points. We also find that LLM responses to demographic-conditioned prompts diverge substantially from prompts without demographic information. In short, what distribution LLMs draw on when generating responses to sensitive polling questions remains unclear. Model choice matters more than framing for accuracy, whereas refusal behavior varies sharply across models and prompt designs. Our findings highlight the limitations of LLMs for social simulation in the context of sensitive topics.
Gender Disparities in LLM-Based Intimate Partner Violence Detection
Tabia Tanzin Prama | Mikaela Irene Fudolig | Abigail M. Crocker | Christopher M. Danforth | Peter Dodds
Tabia Tanzin Prama | Mikaela Irene Fudolig | Abigail M. Crocker | Christopher M. Danforth | Peter Dodds
Intimate Partner Violence (IPV) is a major public health concern, and large language models (LLMs) are increasingly used for support and information-seeking in sensitive domains. We examine whether LLMs perceive relationship abuse differently depending on victim–perpetrator gender configuration. Using 475 Reddit posts from r/relationship_advice, we generate counterfactual variants by swapping gendered identifiers to create four dyads: female–female (F/F), female–male (F/M), male–female (M/F), and male–male (M/M), where the first position denotes the victim. Four recent LLMs (GPT-5o, Gemini 3, Llama 4, and Grok 3) evaluate each variant using a structured questionnaire covering IPV, perpetrator intent, cheating, and abuse subtypes. Results show substantial variation across models and dyads. Abuse and intent detection systematically decrease in mixed-gender dyads where the victim is male, with female perpetrator identity emerging as a consistent negative predictor of abuse recognition. Mixed-effects logistic regression confirms that gender roles significantly shape model outputs. Our findings suggest that LLMs reproduce gendered biases from online training data, with implications for support-related deployment. Code and resources are available at GitHub.
Datasets and Methods for Improving the Cultural Capabilities of NLP Systems: A Survey
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
In recent years, there has been a surge of interest in Cultural NLP, with substantial efforts to create globally inclusive NLP systems. The rapid growth of literature in this field makes it difficult to track trends in methods and data resources. To address this, we survey over 375 papers to answer three complementary questions: (1) What Cultural Capabilities (CCs) are being targeted in NLP systems? (2) How are cultural data resources being created? and (3) What methods are being used to improve the CCs of those systems? We discuss trends observed across the three questions, and identify relevant research gaps. To facilitate further research in this field, we release our full list of surveyed papers, in the form of an interactive web interface, CultureMine, which includes a feature to allow researchers to add their work; we hope this facilitates future research and proves to be a valuable resource for the Cultural NLP community.
Towards More Transparent Online Campaigning: Detecting Political Campaign Content in Election-related Social Media Posts
Abdullah Alabdullah | Conor Gaughan | Thomas Flavel | Shubhanjay Varma | Rachel Gibson | Marta Cantijoch | Alexandru Cernat | Riza Batista-Navarro
Abdullah Alabdullah | Conor Gaughan | Thomas Flavel | Shubhanjay Varma | Rachel Gibson | Marta Cantijoch | Alexandru Cernat | Riza Batista-Navarro
A large part of political campaigns during elections is now being conducted online, with political actors leveraging their networks on social media platforms. To maintain transparency in political communications, regulations applicable to online campaigning have been put in place in many democracies. While it should be straightforward for voters to determine who produced and funded online advertisements comprising paid political campaigns, it is much more challenging to detect if organic content, i.e., social media posts, pertains to political campaigning, due to possibly subtle yet suggestive language that can be used by certain actors. In this paper, we investigate the feasibility of automatically detecting whether a given tweet posted by a political actor pertains to political campaigning, and if yes, whether it was conveyed in a direct or indirect (subtle) manner. After establishing an annotation scheme for the task of detecting political campaign content in tweets, we fine-tuned three encoder models (BERT, BERTweet and PoliBERTweet) for the same task and evaluated their performance. Our results show that fine-tuning BERTweet leads to the best macro-averaged F1-score (0.776), although all models consistently struggle to detect indirect campaigning.
Mapping the Landscape of Unregulated eXplicit Contents on Reddit
Msvpj Sathvik | Manan Roy Choudhury | Rishita Agarwal | Sathwik Narkedimilli | Thao Ha | Liesel Sharabi | Vivek Gupta
Msvpj Sathvik | Manan Roy Choudhury | Rishita Agarwal | Sathwik Narkedimilli | Thao Ha | Liesel Sharabi | Vivek Gupta
The rise of online platforms has facilitated covert forms of explicit content, which pose significant challenges for detection and regulation. Often using coded language to bypass moderation, this content erodes user trust and may be associated with scam-related risks, posing direct financial and personal risks. In this study, we map the landscape of online explicit content posts, focusing on their categorization, linguistic strategies, and temporal and behavioral patterns as they appear within mainstream platform reddit. We investigated five distinct content categories including Virtual Services (VS), Physical Services (PS), Exhibitionism (Ex), Couples and Group Interactions (CGI), and Content Creation and Sales (CCS) and performedmed large-scale experimentation using state-of-the-art large language models (LLMs) such as GPT-4, LLaMA 3.3-70B-Instruct, Gemini 1.5 Flash, Mistral 8×7B, Qwen 2.5 Turbo, and Claude 3.5 Haiku. Our work demonstrates that a nuanced classification of these services requires moving beyond simple keywords, and we establish that expressive signals such as sentiment, emotion, and tone are critical features for accurate detection. Our analysis reveals the distinct behavioral and psychosocial expression patterns that characterize each service category, providing a robust framework for future moderation.
From Adoption to Adaptation: Tracing the Diffusion of New Emojis on Twitter
Yuhang Zhou | Xuan Lu | Wei Ai
Yuhang Zhou | Xuan Lu | Wei Ai
The frequent introduction of new emojis in each Unicode release creates a dynamic shift in social media content, providing a unique opportunity to explore the evolution of digital language. Analyzing a large dataset of sampled English tweets, we examine how newly released emojis gain popularity and evolve in meaning. We find that the community size of early adopters and emoji semantics are positively correlated with their popularity. Certain emojis experienced notable shifts in the meanings and sentiment associations during the diffusion process. Additionally, we propose a novel framework utilizing language models to extract words and pre-existing emojis with semantically similar contexts, which enhances the interpretation of new emojis. The framework demonstrates its effectiveness in improving downstream text classification performance by substituting unknown new emojis with familiar ones. This study offers a new perspective in understanding how new language units are adopted, adapted, and integrated into the fabric of online communication.
Social Construction of Urban Space: Using LLMs to Identify Neighborhood Boundaries From Craigslist Ads
Adam Visokay | Ruth Bagley | Chris Hess | Ian Kennedy | Kyle Crowder | Rob Voigt | Denis Peskoff
Adam Visokay | Ruth Bagley | Chris Hess | Ian Kennedy | Kyle Crowder | Rob Voigt | Denis Peskoff
Rental listings offer a window into how urban space is socially constructed through language. We analyze Chicago Craigslist rental advertisements from 2018 to 2024 to examine how listing agents characterize neighborhoods, identifying mismatches between institutional boundaries and neighborhood claims. Through manual and large language model annotation, we classify unstructured listings from Craigslist according to their neighborhood. Further geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and “reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Natural language processing techniques reveal how definitions of urban spaces are contested in ways that traditional methods overlook.
The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation
Yuhang Zhou | Yimin Xiao | Wei Ai | Ge Gao
Yuhang Zhou | Yimin Xiao | Wei Ai | Ge Gao
Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet’s semantic intent. Human evaluations demonstrate that our approach effectively reduces offensiveness while preserving semantic integrity. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.