Tania Chakraborty
2026
Datasets and Methods for Improving the Cultural Capabilities of NLP Systems: A Survey
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
In recent years, there has been a surge of interest in Cultural NLP, with substantial efforts to create globally inclusive NLP systems. The rapid growth of literature in this field makes it difficult to track trends in methods and data resources. To address this, we survey over 375 papers to answer three complementary questions: (1) What Cultural Capabilities (CCs) are being targeted in NLP systems? (2) How are cultural data resources being created? and (3) What methods are being used to improve the CCs of those systems? We discuss trends observed across the three questions, and identify relevant research gaps. To facilitate further research in this field, we release our full list of surveyed papers, in the form of an interactive web interface, CultureMine, which includes a feature to allow researchers to add their work; we hope this facilitates future research and proves to be a valuable resource for the Cultural NLP community.
Splits! Flexible Sociocultural Linguistic Investigation at Scale
Eylon Caplan | Tania Chakraborty | Dan Goldwasser
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Eylon Caplan | Tania Chakraborty | Dan Goldwasser
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Variation in language use, shaped by speakers’ sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss *healthy eating* with words like *timing*, *regularity*, and *digestion*, whereas Americans use vocabulary like *balancing food groups* and *avoiding fat and sugar*, reflecting distinct cultural models of nutrition (Banna et al., 2016). The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization—a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, **Splits!**, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox’s utility with a scalable, two-stage process that filters large collections of *potential* SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.
2025
VIBE: Can a VLM Read the Room?
Tania Chakraborty | Eylon Caplan | Dan Goldwasser
Findings of the Association for Computational Linguistics: EMNLP 2025
Tania Chakraborty | Eylon Caplan | Dan Goldwasser
Findings of the Association for Computational Linguistics: EMNLP 2025
Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.