Yu Huang
2024
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medial Hallucination Evaluation
Chengfeng Dou
|
Ying Zhang
|
Yanyuan Chen
|
Zhi Jin
|
Wenpin Jiao
|
Haiyan Zhao
|
Yu Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large Language Models (LLMs) have made significant progress recently. However, their practical use in healthcare is hindered by their tendency to generate hallucinations. One specific type, called snowballing hallucination, occurs when LLMs encounter misleading information, and poses a security threat to LLMs. To understand how well LLMs can resist these hallucination, we create the Chinese Medical Hallucination Evaluation benchmark (CMHE). This benchmark can be used to evaluate LLMs’ ability to detect medical hallucinations, make accurate diagnoses in noisy conditions, and provide plausible explanations. The creation of this benchmark involves a combination of manual and model-based approaches. In addition, we use ICD-10 as well as MeSH, two specialized glossaries, to aid in the evaluation. Our experiments show that the LLM struggles to identify fake medical terms and makes poor diagnoses in distracting environments. However, improving the model’s understanding of medical concepts can help it resist interference to some extent.
2023
CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Hanchong Zhang
|
Jieyu Li
|
Lu Chen
|
Ruisheng Cao
|
Yunyan Zhang
|
Yu Huang
|
Yefeng Zheng
|
Kai Yu
Findings of the Association for Computational Linguistics: ACL 2023
The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at https://huggingface.co/datasets/zhanghanchong/css.
Search
Co-authors
- Chengfeng Dou 1
- Ying Zhang 1
- Yanyuan Chen 1
- Zhi Jin 1
- Wenpin Jiao 1
- show all...