Ishan Upadhyay
2025
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Yoo Yeon Sung
|
Eve Fleisig
|
Yu Hou
|
Ishan Upadhyay
|
Jordan Lee Boyd-Graber
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
2021
TEASER: Towards Efficient Aspect-based SEntiment Analysis and Recognition
Vaibhav Bajaj
|
Kartikey Pant
|
Ishan Upadhyay
|
Srinath Nair
|
Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Sentiment analysis aims to detect the overall sentiment, i.e., the polarity of a sentence, paragraph, or text span, without considering the entities mentioned and their aspects. Aspect-based sentiment analysis aims to extract the aspects of the given target entities and their respective sentiments. Prior works formulate this as a sequence tagging problem or solve this task using a span-based extract-then-classify framework where first all the opinion targets are extracted from the sentence, and then with the help of span representations, the targets are classified as positive, negative, or neutral. The sequence tagging problem suffers from issues like sentiment inconsistency and colossal search space. Whereas, Span-based extract-then-classify framework suffers from issues such as half-word coverage and overlapping spans. To overcome this, we propose a similar span-based extract-then-classify framework with a novel and improved heuristic. Experiments on the three benchmark datasets (Restaurant14, Laptop14, Restaurant15) show our model consistently outperforms the current state-of-the-art. Moreover, we also present a novel supervised movie reviews dataset (Movie20) and a pseudo-labeled movie reviews dataset (moviesLarge) made explicitly for this task and report the results on the novel Movie20 dataset as well.
Search
Fix author
Co-authors
- Vaibhav Bajaj 1
- Jordan Lee Boyd-Graber 1
- Eve Fleisig 1
- Yu Hou 1
- Radhika Mamidi 1
- show all...