How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts?

Xiliang Zhu, Shi Zong, David Rossouw


Abstract
Deploying Large Language Models (LLMs) for question answering (QA) over lengthy contexts is a significant challenge. In industrial settings, this process is often hindered by high computational costs and latency, especially when multiple questions must be answered based on the same context. In this work, we explore the capabilities of LLMs to answer multiple questions based on the same conversational context. We conduct extensive experiments and benchmark a range of both proprietary and public models on this challenging task. Our findings highlight that while strong proprietary LLMs like GPT-4o achieve the best overall performance, fine-tuned public LLMs with up to 8 billion parameters can surpass GPT-4o in accuracy, which demonstrates their potential for transparent and cost-effective deployment in real-world applications.
Anthology ID:
2025.emnlp-industry.129
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2025
Address:
Suzhou (China)
Editors:
Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1848–1855
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.129/
DOI:
Bibkey:
Cite (ACL):
Xiliang Zhu, Shi Zong, and David Rossouw. 2025. How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1848–1855, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):
How Accurate Are LLMs at Multi-Question Answering on Conversational Transcripts? (Zhu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.129.pdf