Yun-Shiuan Chuang
2026
Ada-RS: Adaptive Rejection Sampling for Selective Thinking
Yirou Ge | Yixi Li | Alec M. Chiu | Shivani Shekhar | Zijie Pan | Avinash Thangali | Yun-Shiuan Chuang | Chaitanya Kulkarni | Uma Kona | Linsey Pang | Prakhar Mehrotra
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yirou Ge | Yixi Li | Alec M. Chiu | Shivani Shekhar | Zijie Pan | Avinash Thangali | Yun-Shiuan Chuang | Chaitanya Kulkarni | Uma Kona | Linsey Pang | Prakhar Mehrotra
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large language models (LLMs) are increasingly being deployed in cost- and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and introduce Adaptive Rejection Sampling (Ada-RS), an algorithm-agnostic sample filtering framework for learning selective and efficient reasoning. For each given context, Ada-RS scores multiple sampled completions with an adaptive length-penalized reward then applies stochastic rejection sampling to retain only high-reward candidates (or preference pairs) for downstream optimization. We demonstrate how Ada-RS plugs into both preference pair (e.g. DPO) or grouped policy optimization strategies (e.g. DAPO). Using Qwen3-8B with LoRA on a synthetic tool call-oriented e-commerce benchmark, Ada-RS improves the accuracy-efficiency frontier over standard algorithms by reducing average output tokens by up to ∼80% and reducing thinking rate by up to ∼95% while maintaining or improving tool call accuracy. We further demonstrate that these gains generalize across model scales (Qwen3-1.7B, 8B, 14B) and domains (τ 2-Bench airline and telecom). These results highlight that training signal selection is a powerful lever for efficient reasoning in latency-sensitive deployments.
Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents
Yun-Shiuan Chuang | Chaitanya Kulkarni | Alec M. Chiu | Avinash Thangali | Zijie Pan | Shivani Shekhar | Yirou Ge | Yixi Li | Uma Kona | Linsey Pang | Prakhar Mehrotra
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yun-Shiuan Chuang | Chaitanya Kulkarni | Alec M. Chiu | Avinash Thangali | Zijie Pan | Shivani Shekhar | Yirou Ge | Yixi Li | Uma Kona | Linsey Pang | Prakhar Mehrotra
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau^2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on- and off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates, as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.
2025
Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding
Yun-Shiuan Chuang | Sameer Narendran | Nikunj Harlalka | Alexander Cheung | Sizhe Gao | Siddharth Suresh | Junjie Hu | Timothy T. Rogers
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yun-Shiuan Chuang | Sameer Narendran | Nikunj Harlalka | Alexander Cheung | Sizhe Gao | Siddharth Suresh | Junjie Hu | Timothy T. Rogers
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Guesstimation—the task of making approximate quantitative estimates about objects or events—is a common real-world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)—where the median of multiple estimates improves accuracy—we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy, self-consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position guesstimation as a useful probe of LLM world knowledge and highlight WOC decoding as a strategy for enhancing LLM guesstimation performance on real-world tasks.
2024
Simulating Opinion Dynamics with Networks of LLM-based Agents
Yun-Shiuan Chuang | Agam Goyal | Nikunj Harlalka | Siddharth Suresh | Robert Hawkins | Sijia Yang | Dhavan Shah | Junjie Hu | Timothy Rogers
Findings of the Association for Computational Linguistics: NAACL 2024
Yun-Shiuan Chuang | Agam Goyal | Nikunj Harlalka | Siddharth Suresh | Robert Hawkins | Sijia Yang | Dhavan Shah | Junjie Hu | Timothy Rogers
Findings of the Association for Computational Linguistics: NAACL 2024
Accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. However, the agent-based models (ABMs) commonly used for such simulations often over-simplify human behavior. We propose a new approach to simulating opinion dynamics based on populations of Large Language Models (LLMs). Our findings reveal a strong inherent bias in LLM agents towards producing accurate information, leading simulated agents to consensus in line with scientific reality. This bias limits their utility for understanding resistance to consensus views on issues like climate change. After inducing confirmation bias through prompt engineering, however, we observed opinion fragmentation in line with existing agent-based modeling and opinion dynamics research. These insights highlight the promise and limitations of LLM agents in this domain and suggest a path forward: refining LLMs with real-world discourse to better simulate the evolution of human beliefs.
Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks
Yun-Shiuan Chuang | Krirk Nirunwiroj | Zach Studdiford | Agam Goyal | Vincent V. Frigo | Sijia Yang | Dhavan V. Shah | Junjie Hu | Timothy T. Rogers
Findings of the Association for Computational Linguistics: EMNLP 2024
Yun-Shiuan Chuang | Krirk Nirunwiroj | Zach Studdiford | Agam Goyal | Vincent V. Frigo | Sijia Yang | Dhavan V. Shah | Junjie Hu | Timothy T. Rogers
Findings of the Association for Computational Linguistics: EMNLP 2024
Creating human-like large language model (LLM) agents is crucial for faithful social simulation. Having LLMs role-play based on demographic information sometimes improves human likeness but often does not. This study assessed whether LLM alignment with human behavior can be improved by integrating information from empirically-derived human belief networks. Using data from a human survey, we estimated a belief network encompassing 64 topics loading on nine non-overlapping latent factors. We then seeded LLM-based agents with an opinion on one topic, and assessed the alignment of its expressed opinions on remaining test topics with corresponding human data. Role-playing based on demographic information alone did not align LLM and human opinions, but seeding the agent with a single belief greatly improved alignment for topics related in the belief network, and not for topics outside the network. These results suggest a novel path for human-LLM belief alignment in work seeking to simulate and understand patterns of belief distributions in society.
Search
Fix author
Co-authors
- Junjie Hu 3
- Alec M. Chiu 2
- Yirou Ge 2
- Agam Goyal 2
- Nikunj Harlalka 2
- Uma Kona 2
- Chaitanya Kulkarni 2
- Yixi Li 2
- Prakhar Mehrotra 2
- Zijie Pan 2
- Linsey Pang 2
- Timothy T. Rogers 2
- Shivani Shekhar 2
- Siddharth Suresh 2
- Avinash Thangali 2
- Sijia Yang 2
- Alexander Cheung 1
- Vincent V. Frigo 1
- Sizhe Gao 1
- Robert D. Hawkins 1
- Sameer Narendran 1
- Krirk Nirunwiroj 1
- Timothy Rogers 1
- Dhavan Shah 1
- Dhavan V. Shah 1
- Zach Studdiford 1