Giyong Kim


2026

We examine whether large language models (LLMs) can reliably simulate historical FOMC policy decisions and whether persona-based agentic deliberation improves performance. Using strictly time-consistent vintage economic information, we evaluate multiple state-of-the-art LLMs on a three-way Hike/Hold/Cut classification task in both single-agent and multi-agent settings. Single-LLM baselines achieve nontrivial accuracy and track broad policy regime shifts, establishing a simple but strong benchmark. However, we identify a systematic behavioral asymmetry that we term Hold bias: models disproportionately favor Hold decisions and remain reluctant to predict Cut outcomes even during easing cycles. This conservatism is especially costly around regime turning points, where reliable adaptation matters most. We further find that standard agentic workflows, including debate and consensus-style aggregation, do not mitigate this problem and often amplify caution rather than improve accuracy. Overall, our results show that plausible deliberation is not sufficient for trustworthy decision support. Progress will require agentic systems explicitly designed to diagnose and correct structural bias, rather than merely reproducing surface-level committee interaction.