Elad Kravi
2025
Ambiguity Detection and Uncertainty Calibration for Question Answering with Large Language Models
Zhengyan Shi
|
Giuseppe Castellucci
|
Simone Filice
|
Saar Kuzi
|
Elad Kravi
|
Eugene Agichtein
|
Oleg Rokhlenko
|
Shervin Malmasi
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Large Language Models (LLMs) have demonstrated excellent capabilities in Question Answering (QA) tasks, yet their ability to identify and address ambiguous questions remains underdeveloped. Ambiguities in user queries often lead to inaccurate or misleading answers, undermining user trust in these systems. Despite prior attempts using prompt-based methods, performance has largely been equivalent to random guessing, leaving a significant gap in effective ambiguity detection. To address this, we propose a novel framework for detecting ambiguous questions within LLM-based QA systems. We first prompt an LLM to generate multiple answers to a question, and then analyze them to infer the ambiguity. We propose to use a lightweight Random Forest model, trained on a bootstrapped and shuffled 6-shot examples dataset. Experimental results on ASQA, PACIFIC, and ABG-COQA datasets demonstrate the effectiveness of our approach, with accuracy up to 70.8%. Furthermore, our framework enhances the confidence calibration of LLM outputs, leading to more trustworthy QA systems able to handle complex questions.
2023
Multi Document Summarization Evaluation in the Presence of Damaging Content
Avshalom Manevich
|
David Carmel
|
Nachshon Cohen
|
Elad Kravi
|
Ori Shapira
Findings of the Association for Computational Linguistics: EMNLP 2023
In the Multi-document summarization (MDS) task, a summary is produced for a given set of documents. A recent line of research introduced the concept of damaging documents, denoting documents that should not be exposed to readers due to various reasons. In the presence of damaging documents, a summarizer is ideally expected to exclude damaging content in its output. Existing metrics evaluate a summary based on aspects such as relevance and consistency with the source documents. We propose to additionally measure the ability of MDS systems to properly handle damaging documents in their input set. To that end, we offer two novel metrics based on lexical similarity and language model likelihood. A set of experiments demonstrates the effectiveness of our metrics in measuring the ability of MDS systems to summarize a set of documents while eliminating damaging content from their summaries.
Search
Fix data
Co-authors
- Eugene Agichtein 1
- David Carmel 1
- Giuseppe Castellucci 1
- Nachshon Cohen 1
- Simone Filice 1
- show all...