2025
pdf
bib
abs
Towards Style Alignment in Cross-Cultural Translation
Shreya Havaldar
|
Adam Stein
|
Eric Wong
|
Lyle Ungar
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style – biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
pdf
bib
abs
Entailed Between the Lines: Incorporating Implication into NLI
Shreya Havaldar
|
Hamidreza Alvari
|
John Palowitch
|
Mohammad Javad Hosseini
|
Senaka Buthpitiya
|
Alex Fabrikant
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and facilitate human communication, they must be responsive to the text’s implicit meaning. We focus on Natural Language Inference (NLI), a core tool for many language tasks, and find that state-of-the-art NLI models and datasets struggle to recognize a range of cases where entailment is implied, rather than explicit from the text. We formalize implied entailment as an extension of the NLI task and introduce the Implied NLI dataset (INLI) to help today’s LLMs both recognize a broader variety of implied entailments and to distinguish between implicit and explicit entailment. We show how LLMs fine-tuned on INLI understand implied entailment and can generalize this understanding across datasets and domains.
pdf
bib
abs
Probabilistic Soundness Guarantees in LLM Reasoning Chains
Weiqiu You
|
Anton Xue
|
Shreya Havaldar
|
Delip Rao
|
Helen Jin
|
Chris Callison-Burch
|
Eric Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
pdf
bib
abs
Adaptively profiling models with task elicitation
Davis Brown
|
Prithvi Balehannina
|
Helen Jin
|
Shreya Havaldar
|
Hamed Hassani
|
Eric Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks—an order of magnitude more than prior work—where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
pdf
bib
abs
Culturally-Aware Conversations: A Framework & Benchmark for LLMs
Shreya Havaldar
|
Young Min Cho
|
Sunny Rai
|
Lyle Ungar
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style — a key element of cultural communication — is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today’s top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.
pdf
bib
abs
Social Norms in Cinema: A Cross-Cultural Analysis of Shame, Pride and Prejudice
Sunny Rai
|
Khushang Zaveri
|
Shreya Havaldar
|
Soumna Nema
|
Lyle Ungar
|
Sharath Chandra Guntuku
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Shame and pride are social emotions expressed across cultures to motivate and regulate people’s thoughts, feelings, and behaviors. In this paper, we introduce the first cross-cultural dataset of over 10k shame/pride-related expressions with underlying social expectations from ~5.4K Bollywood and Hollywood movies. We examine *how* and *why* shame and pride are expressed across cultures using a blend of psychology-informed language analysis combined with large language models. We find significant cross-cultural differences in shame and pride expression aligning with known cultural tendencies of the USA and India – e.g., in Hollywood, shame-expressions predominantly discuss *self* whereas shame is expressed toward *others* in Bollywood. Women are more sanctioned across cultures and for violating similar social expectations.
2024
pdf
bib
abs
Building Knowledge-Guided Lexica to Model Cultural Variation
Shreya Havaldar
|
Salvatore Giorgi
|
Sunny Rai
|
Thomas Talhelm
|
Sharath Chandra Guntuku
|
Lyle Ungar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs’ failure to measure cultural variation or generate culturally varied language.
2023
pdf
bib
abs
Comparing Styles across Languages
Shreya Havaldar
|
Matthew Pressimone
|
Eric Wong
|
Lyle Ungar
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.
pdf
bib
Faithful Chain-of-Thought Reasoning
Qing Lyu
|
Shreya Havaldar
|
Adam Stein
|
Li Zhang
|
Delip Rao
|
Eric Wong
|
Marianna Apidianaki
|
Chris Callison-Burch
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
abs
Multilingual Language Models are not Multicultural: A Case Study in Emotion
Shreya Havaldar
|
Sunny Rai
|
Bhumika Singhal
|
Langchen Liu
|
Sharath Chandra Guntuku
|
Lyle Ungar
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Emotions are experienced and expressed differently across the world. In order to use Large Language Models (LMs) for multilingual tasks that require emotional sensitivity, LMs must reflect this cultural variation in emotion. In this study, we investigate whether the widely-used multilingual LMs in 2023 reflect differences in emotional expressions across cultures and languages. We find that embeddings obtained from LMs (e.g., XLM-RoBERTa) are Anglocentric, and generative LMs (e.g., ChatGPT) reflect Western norms, even when responding to prompts in other languages. Our results show that multilingual LMs do not successfully learn the culturally appropriate nuances of emotion and we highlight possible research directions towards correcting this.