Bo Sun
2026
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Ziyi Wang | Yuxuan Lu | Wenbo Li | Amirali Amini | Bo Sun | Yakov Bart | Weimin Lyu | Jiri Gesi | Tian Wang | Jing Huang | Yu Su | Upol Ehsan | Malihe Alikhani | Toby Jia-Jun Li | Lydia Chilton | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziyi Wang | Yuxuan Lu | Wenbo Li | Amirali Amini | Bo Sun | Yakov Bart | Weimin Lyu | Jiri Gesi | Tian Wang | Jing Huang | Yu Su | Upol Ehsan | Malihe Alikhani | Toby Jia-Jun Li | Lydia Chilton | Dakuo Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Can Large Language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating believable human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPeRA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. **OPeRA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales**. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPeRA, we establish **the first benchmark to evaluate how well current LLMs can predict a specific user’s next action** and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
Unifying Inference-Time Planning Language Generation
Prabhu Prakash Kagitha | Bo Sun | Ishan Desai | Andrew Zhu | Cassie Huang | Manling Li | Ziyang Li | Li Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Prabhu Prakash Kagitha | Bo Sun | Ishan Desai | Andrew Zhu | Cassie Huang | Manling Li | Ziyang Li | Li Zhang
Findings of the Association for Computational Linguistics: ACL 2026
A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying organizational framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high-resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.
2023
System Report for CCL23-Eval Task 8: Chinese Grammar Error Detection and Correction Using Multi-Granularity Information
Yixuan Wang | Yijun Liu | Bo Sun | Wanxiang Che
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Yixuan Wang | Yijun Liu | Bo Sun | Wanxiang Che
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“This paper introduces our system at CCL-2023 Task: Chinese Essay Fluency Evaluation (CEFE).The CEFE task aims to study the identification and correction of grammatical errors in primaryand middle school students’ test compositions. The evaluation has three tracks to examine therecognition of wrong sentence types, character-level error correction, and wrong sentence rewrit-ing. According to the task characteristics and data distribution of each track, we propose a token-level discriminative model based on sequence labeling for the multi-label classification task ofwrong sentences, an auto-encoder model based on edited labels for character-level error correc-tion and a seq2seq model obtained by pre-training on pseudo data and fine-tuning on labeleddata to solve the wrong sentence rewriting task. In the final evaluation results, the method weproposed won the first place in all three tracks according to the corresponding evaluation metrics.”