Hang Gao

Other people with similar names: Hang Gao

Unverified author pages with similar names: Hang Gao


2025

pdf bib
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Xunzhi Wang | Zhuowei Zhang | Gaonan Chen | Qiongyu Li | Bitong Luo | Zhixin Han | Haotian Wang | Zhiyu Li | Hang Gao | Mengting Hu
Findings of the Association for Computational Linguistics: ACL 2025

Despite recent progress in systematic evaluation frameworks, benchmarking the uncertainty of large language models (LLMs) remains a highly challenging task. Existing methods for benchmarking the uncertainty of LLMs face three key challenges: the need for internal model access, additional training, or high computational costs. This is particularly unfavorable for closed-source models. To this end, we introduce UBench, a new benchmark for evaluating the uncertainty of LLMs. Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities. Based on this, we conduct extensive experiments. This includes comparisons with other advanced uncertainty estimation methods, the assessment of the uncertainty of 20 LLMs, and an exploration of the effects of Chain-of-Thought (CoT) prompts, role-playing (RP) prompts, and temperature on model uncertainty. Our analysis reveals several crucial insights: 1) Our confidence interval-based methods are highly effective for uncertainty quantification; 2) Regarding uncertainty, outstanding open-source models show competitive performance versus closed-source models; 3) CoT and RP prompts present potential ways to improve model reliability, while the influence of temperature changes follows no universal rule. Our implementation is available at https://github.com/Cyno2232/UBENCH.

pdf bib
Towards Robust Few-Shot Relation Classification: Incorporating Relation Description with Agreement
Mengting Hu | Jianfeng Wu | Ming Jiang | Yalan Xie | Zhunheng Wang | Rui Ying | Xiaoyi Liu | Ruixuan Xu | Hang Gao | Renhong Cheng
Findings of the Association for Computational Linguistics: EMNLP 2025

Few-shot relation classification aims to recognize the relation between two mentioned entities, with the help of only a few support samples. However, a few samples tend to be limited for tackling unlimited queries. If a query cannot find references from the support samples, it is defined as none-of-the-above (NOTA). Previous works mainly focus on how to distinguish N+1 categories, including N known relations and one NOTA class, to accurately recognize relations. However, the robustness towards various NOTA rates, i.e. the proportion of NOTA among queries, is under investigation. In this paper, we target the robustness and propose a simple but effective framework. Specifically, we introduce relation descriptions as external knowledge to enhance the model’s comprehension of the relation semantics. Moreover, we further promote robustness by proposing a novel agreement loss. It is designed for seeking decision consistency between the instance-level decision, i.e. support samples, and relation-level decision, i.e. relation descriptions. Extensive experimental results demonstrate that the proposed framework outperforms strong baselines while being robust against various NOTA rates. The code is released on GitHub at https://github.com/Pisces-29/RoFRC.

2024

pdf bib
ECoK: Emotional Commonsense Knowledge Graph for Mining Emotional Gold
Zhunheng Wang | Xiaoyi Liu | Mengting Hu | Rui Ying | Ming Jiang | Jianfeng Wu | Yalan Xie | Hang Gao | Renhong Cheng
Findings of the Association for Computational Linguistics: ACL 2024

The demand for understanding and expressing emotions in the field of natural language processing is growing rapidly. Knowledge graphs, as an important form of knowledge representation, have been widely utilized in various emotion-related tasks. However, existing knowledge graphs mainly focus on the representation and reasoning of general factual knowledge, while there are still significant deficiencies in the understanding and reasoning of emotional knowledge. In this work, we construct a comprehensive and accurate emotional commonsense knowledge graph, ECoK. We integrate cutting-edge theories from multiple disciplines such as psychology, cognitive science, and linguistics, and combine techniques such as large language models and natural language processing. By mining a large amount of text, dialogue, and sentiment analysis data, we construct rich emotional knowledge and establish the knowledge generation model COMET-ECoK. Experimental results show that ECoK contains high-quality emotional reasoning knowledge, and the performance of our knowledge generation model surpasses GPT-4-Turbo, which can help downstream tasks better understand and reason about emotions. Our data and code is available from https://github.com/ZornWang/ECoK.

pdf bib
Is Compound Aspect-Based Sentiment Analysis Addressed by LLMs?
Yinhao Bai | Zhixin Han | Yuhua Zhao | Hang Gao | Zhuowei Zhang | Xunzhi Wang | Mengting Hu
Findings of the Association for Computational Linguistics: EMNLP 2024

Aspect-based sentiment analysis (ABSA) aims to predict aspect-based elements from the given text, mainly including four elements, i.e., aspect category, sentiment polarity, aspect term, and opinion term. Extracting pair, triple, or quad of elements is defined as compound ABSA. Due to its challenges and practical applications, such a compound scenario has become an emerging topic. Recently, large language models (LLMs), e.g. ChatGPT and LLaMA, present impressive abilities in tackling various human instructions. In this work, we are particularly curious whether LLMs still possess superior performance in handling compound ABSA tasks. To assess the performance of LLMs, we design a novel framework, called ChatABSA. Concretely, we design two strategies: constrained prompts, to automatically organize the returned predictions; post-processing, to better evaluate the capability of LLMs in recognition of implicit information. The overall evaluation involves 5 compound ABSA tasks and 8 publicly available datasets. We compare LLMs with few-shot supervised baselines and fully supervised baselines, including corresponding state-of-the-art (SOTA) models on each task. Experimental results show that ChatABSA exhibits excellent aspect-based sentiment analysis capabilities and overwhelmingly beats few-shot supervised methods under the same few-shot settings. Surprisingly, it can even outperform fully supervised methods in some cases. However, in most cases, it underperforms fully supervised methods, and there is still a huge gap between its performance and the SOTA method. Moreover, we also conduct more analyses to gain a deeper understanding of its sentiment analysis capabilities.