Kevin Gao

2026

Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment. Code will be released upon acceptance.

pdf bib abs

Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.

Co-authors

Venues

Fix author