Mark Gaynor
2025
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
Alexandru Coca
|
Mark Gaynor
|
Zhenxing Zhang
|
Jianpeng Cheng
|
Bo-Hsiang Tseng
|
Peter Boothroyd
|
Hector Martinez Alonso
|
Diarmuid O Seaghdha
|
Anders Johannsen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. Such assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.
2024
LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues
Joe Stacey
|
Jianpeng Cheng
|
John Torr
|
Tristan Guigue
|
Joris Driesen
|
Alexandru Coca
|
Mark Gaynor
|
Anders Johannsen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data.
Search
Fix author
Co-authors
- Jianpeng Cheng 2
- Alexandru Coca 2
- Anders Johannsen 2
- Peter Boothroyd 1
- Joris Driesen 1
- show all...