2025
pdf
bib
abs
Towards Style Alignment in Cross-Cultural Translation
Shreya Havaldar
|
Adam Stein
|
Eric Wong
|
Lyle Ungar
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style – biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
pdf
bib
abs
Probabilistic Soundness Guarantees in LLM Reasoning Chains
Weiqiu You
|
Anton Xue
|
Shreya Havaldar
|
Delip Rao
|
Helen Jin
|
Chris Callison-Burch
|
Eric Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because earlier errors can corrupt judgments of downstream reasoning. To better detect such errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a probabilistic framework that evaluates each reasoning step based solely on previously-verified premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
pdf
bib
abs
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
Andong Hua
|
Kenan Tang
|
Chenhe Gu
|
Jindong Gu
|
Eric Wong
|
Yao Qin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Prompt sensitivity, referring to the phenomenon where paraphrasing (that is, repeating something written or spoken using different words) leads to significant changes in large language model performance, has been widely accepted as a core limitation of large language models. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of large language models, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate seven large language models (for example, the GPT and Gemini families) across six benchmarks, including both multiple-choice and open-ended tasks on twelve diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt large language model as a judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern large language models are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.
pdf
bib
abs
Adaptively profiling models with task elicitation
Davis Brown
|
Prithvi Balehannina
|
Helen Jin
|
Shreya Havaldar
|
Hamed Hassani
|
Eric Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks—an order of magnitude more than prior work—where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
pdf
bib
abs
Avoiding Copyright Infringement via Large Language Model Unlearning
Guangyao Dou
|
Zheyuan Liu
|
Qing Lyu
|
Kaize Ding
|
Eric Wong
Findings of the Association for Computational Linguistics: NAACL 2025
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model’s parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.
pdf
bib
abs
NSF-SciFy: Mining the NSF Awards Database for Scientific Claims
Delip Rao
|
Weiqiu You
|
Eric Wong
|
Chris Callison-Burch
Proceedings of The 5th New Frontiers in Summarization Workshop
We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with an estimated 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset’s utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis.
2023
pdf
bib
abs
Comparing Styles across Languages
Shreya Havaldar
|
Matthew Pressimone
|
Eric Wong
|
Lyle Ungar
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.
pdf
bib
Faithful Chain-of-Thought Reasoning
Qing Lyu
|
Shreya Havaldar
|
Adam Stein
|
Li Zhang
|
Delip Rao
|
Eric Wong
|
Marianna Apidianaki
|
Chris Callison-Burch
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)