2025
pdf
bib
abs
Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?
Hua Shen
|
Nicholas Clark
|
Tanu Mitra
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing research assesses LLMs’ values by analyzing their stated inclinations, overlooking potential discrepancies between stated values and actions—termed the “Value-Action Gap.” This study introduces ValueActionLens, a framework to evaluate the alignment between LLMs’ stated values and their value-informed actions. The framework includes a dataset of 14.8k value-informed actions across 12 cultures and 11 social topics, along with two tasks measuring alignment through three metrics. Experiments show substantial misalignment between LLM-generated value statements and their actions, with significant variations across scenarios and models. Misalignments reveal potential harms, highlighting risks in relying solely on stated values to predict behavior. The findings stress the need for context-aware evaluations of LLM values and the value-action gaps.
pdf
bib
abs
Causally Modeling the Linguistic and Social Factors that Predict Email Response
Yinuo Xu
|
Hong Chen
|
Sushrita Rakshit
|
Aparna Ananthasubramaniam
|
Omkar Yadav
|
Mingqian Zheng
|
Michael Jiang
|
Lechen Zhang
|
Bowen Yi
|
Kenan Alkiek
|
Abraham Israeli
|
Bangzhao Shu
|
Hua Shen
|
Jiaxin Pei
|
Haotian Zhang
|
Miriam Schirmer
|
David Jurgens
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Email is a vital conduit for human communication across businesses, organizations, and broader societal contexts. In this study, we aim to model the intents, expectations, and responsiveness in email exchanges. To this end, we release SIZZLER, a new dataset containing 1800 emails annotated with nuanced types of intents and expectations. We benchmark models ranging from feature-based logistic regression to zero-shot prompting of large language models. Leveraging the predictive model for intent, expectations, and 14 other features, we analyze 11.3M emails from GMANE to study how linguistic and social factors influence the conversational dynamics in email exchanges. Through our causal analysis, we find that the email response rates are influenced by social status, argumentation, and in certain limited contexts, the strength of social connection.
pdf
bib
Proceedings of the 9th Widening NLP Workshop
Chen Zhang
|
Emily Allaway
|
Hua Shen
|
Lesly Miculicich
|
Yinqiao Li
|
Meryem M'hamdi
|
Peerat Limkonchotiwat
|
Richard He Bai
|
Santosh T.y.s.s.
|
Sophia Simeng Han
|
Surendrabikram Thapa
|
Wiem Ben Rim
Proceedings of the 9th Widening NLP Workshop
pdf
bib
abs
ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
Hua Shen
|
Tiffany Knearem
|
Reshmi Ghosh
|
Yu-Ju Yang
|
Nicholas Clark
|
Tanu Mitra
|
Yun Huang
Proceedings of the 9th Widening NLP Workshop
As AI advances, aligning it with diverse human and societal values grows critical. But how do we define these values and measure AI’s adherence to them? We present ValueCompass, a framework grounded in psychological theories, to assess human-AI alignment. Applying it to five diverse LLMs and 112 humans from seven countries across four scenarios—collaborative writing, education, public sectors, and healthcare—we uncover key misalignments. For example, humans prioritize national security, while LLMs often reject it. Values also shift across contexts, demanding scenario-specific alignment strategies. This work advances AI design by mapping how systems can better reflect societal ethics.
2023
pdf
bib
abs
MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup
Hua Shen
|
Vicky Zayats
|
Johann Rocholl
|
Daniel Walker
|
Dirk Padfield
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Current disfluency detection models focus on individual utterances each from a single speaker. However, numerous discontinuity phenomena in spoken conversational transcripts occur across multiple turns, which can not be identified by disfluency detection models. This study addresses these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken conversational transcripts and collecting a new dataset, MultiTurnCleanup. We design a data labeling schema to collect the high-quality dataset and provide extensive data analysis. Furthermore, we leverage two modeling approaches for experimental evaluation as benchmarks for future research.
pdf
bib
abs
Gentopia.AI: A Collaborative Platform for Tool-Augmented LLMs
Binfeng Xu
|
Xukun Liu
|
Hua Shen
|
Zeyu Han
|
Yuhan Li
|
Murong Yue
|
Zhiyuan Peng
|
Yuchen Liu
|
Ziyu Yao
|
Dongkuan Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Augmented Language Models (ALMs) empower large language models with the ability to use tools, transforming them into intelligent agents for real-world interactions. However, most existing frameworks for ALMs, to varying degrees, are deficient in the following critical features: flexible customization, collaborative democratization, and holistic evaluation. This paper proposes Gentopia, a lightweight and extensible framework for ALMs. Gentopia allows the flexible customization of agents through simple configurations, seamlessly integrating various language models, task formats, prompting modules, and plugins into a unified paradigm. Furthermore, we establish Gentpool, a public platform enabling the registration and sharing of user-customized agents. Agents registered in Gentpool are composable such that they can be assembled together for agent collaboration, advancing the democratization of artificial intelligence. To ensure high-quality agents, Gentbench, an integral component of Gentpool, is designed to thoroughly evaluate user-customized agents across diverse aspects such as safety, robustness, efficiency, etc. We release Gentopia on Github and will continuously move forward.
2022
pdf
bib
abs
Are Shortest Rationales the Best Explanations for Human Understanding?
Hua Shen
|
Tongshuang Wu
|
Wenbo Guo
|
Ting-Hao Huang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Existing self-explaining models typically favor extracting the shortest possible rationales — snippets of an input text “responsible for” corresponding output — to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this question, we design a self-explaining model, LimitedInk, which allows users to extract rationales at any target length. Compared to existing baselines, LimitedInk achieves compatible end-task performance and human-annotated rationale agreement, making it a suitable representation of the recent class of self-explaining models. We use LimitedInk to conduct a user study on the impact of rationale length, where we ask human judges to predict the sentiment label of documents based only on LimitedInk-generated rationales with different lengths. We show rationales that are too short do not help humans predict labels better than randomly masked text, suggesting the need for more careful design of the best human rationales.