David Traum

Other people with similar names: David Traum

Unverified author pages with similar names: David Traum

2026

Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
Tianyi Zhang | David Traum
Proceedings of the Fifteenth Language Resources and Evaluation Conference

In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

pdf bib abs

Disentangling Approaches to Conversation Disentanglement: Fine-Tune or Learn from Scratch?
Debaditya Pal | Anton Leuski | Ron Artstein | David Traum | Kallirroi Georgila
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Conversation disentanglement is the process of segmenting a stream of messages or utterances into separate conversations or "threads" that can be more easily understood and processed. We compare the performance of GPT-4o and GPT-4o Mini with deep learning models built from scratch for this task. We show that, using the same amount of training data, out-of-the-box GPT-4o performs poorly, and fine-tuning GPT-4o Mini results in performance comparable to learning small-size models from scratch (based on standard hand-crafted features for this task), with performance reaching 74.4% F1-score for prediction of links between messages and 45.3% F1-score for prediction of perfectly matching conversations. However, the fine-tuned GPT-4o Mini model underperforms when compared to models that utilize complex structural information. We also provide a new method for detailed analysis of the successes and failures of our models, and a new visualization method.

Co-authors

Venues

LREC2

Fix author