Winston Chan
2025
GEAR: A Scalable and Interpretable Evaluation Framework for RAG-Based Car Assistant Systems
Niloufar Beyranvand
|
Hamidreza Dastmalchi
|
Aijun An
|
Heidar Davoudi
|
Winston Chan
|
Ron DiCarlantonio
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) increasingly power car assistants, enabling natural language interaction for tasks such as maintenance, troubleshooting, and operational guidance. While retrieval-augmented generation (RAG) improves grounding using vehicle manuals, evaluating response quality remains a key challenge. Traditional metrics like BLEU and ROUGE fail to capture critical aspects such as factual accuracy and information coverage. We propose GEAR, a fully automated, reference-based evaluation framework for car assistant systems. GEAR uses LLMs as evaluators to compare assistant responses against ground-truth counterparts, assessing coverage, correctness, and other dimensions of answer quality. To enable fine-grained evaluation, both responses are decomposed into key facts and labeled as essential, optional, or safety-critical using LLMs. The evaluator then determines which of these facts are correct and covered. Experiments show that GEAR aligns closely with human annotations, offering a scalable and reliable solution for evaluating car assistants.