Sojung Kim


2026

We examine whether large language models (LLMs) can reliably simulate historical FOMC policy decisions and whether persona-based agentic deliberation improves performance. Using strictly time-consistent vintage economic information, we evaluate multiple state-of-the-art LLMs on a three-way Hike/Hold/Cut classification task in both single-agent and multi-agent settings. Single-LLM baselines achieve nontrivial accuracy and track broad policy regime shifts, establishing a simple but strong benchmark. However, we identify a systematic behavioral asymmetry that we term Hold bias: models disproportionately favor Hold decisions and remain reluctant to predict Cut outcomes even during easing cycles. This conservatism is especially costly around regime turning points, where reliable adaptation matters most. We further find that standard agentic workflows, including debate and consensus-style aggregation, do not mitigate this problem and often amplify caution rather than improve accuracy. Overall, our results show that plausible deliberation is not sufficient for trustworthy decision support. Progress will require agentic systems explicitly designed to diagnose and correct structural bias, rather than merely reproducing surface-level committee interaction.

2024

Large Language Models (LLMs) have a great potential to serve as readily available and cost-efficient Conversational Intelligent Tutoring Systems (CITS) for teaching L2 learners of English. Existing CITS, however, are designed to teach only simple concepts or lack the pedagogical depth necessary to address diverse learning strategies. To develop a more pedagogically informed CITS capable of teaching complex concepts, we construct a BIlingual PEDagogically-informed Tutoring Dataset (BIPED) of one-on-one, human-to-human English tutoring interactions. Through post-hoc analysis of the tutoring interactions, we come up with a lexicon of dialogue acts (34 tutor acts and 9 student acts), which we use to further annotate the collected dataset. Based on a two-step framework of first predicting the appropriate tutor act then generating the corresponding response, we implemented two CITS models using GPT-4 and SOLAR-KO, respectively. We experimentally demonstrate that the implemented models not only replicate the style of human teachers but also employ diverse and contextually appropriate pedagogical strategies.