Bo Sun


2026

Can Large Language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating believable human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPeRA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. **OPeRA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales**. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPeRA, we establish **the first benchmark to evaluate how well current LLMs can predict a specific user’s next action** and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language, which can be input into a symbolic solver to deterministically find a plan. While showing improved trust and promising performance, dozens of recent publications have proposed scattered methods on a variety of benchmarks under different experimental settings. We attempt to unify the inference-time LLM-as-formalizer methodology for classical planning by proposing a unifying organizational framework based on intermediate representations. We thus systematically evaluate more than a dozen pipelines that subsume most existing work, while proposing novel ones that involve syntactically similar but high-resource intermediate languages (such as a Python wrapper of PDDL). We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components, and evidence their robustness against problem complexity.

2023

“This paper introduces our system at CCL-2023 Task: Chinese Essay Fluency Evaluation (CEFE).The CEFE task aims to study the identification and correction of grammatical errors in primaryand middle school students’ test compositions. The evaluation has three tracks to examine therecognition of wrong sentence types, character-level error correction, and wrong sentence rewrit-ing. According to the task characteristics and data distribution of each track, we propose a token-level discriminative model based on sequence labeling for the multi-label classification task ofwrong sentences, an auto-encoder model based on edited labels for character-level error correc-tion and a seq2seq model obtained by pre-training on pseudo data and fine-tuning on labeleddata to solve the wrong sentence rewriting task. In the final evaluation results, the method weproposed won the first place in all three tracks according to the corresponding evaluation metrics.”