2025
pdf
bib
abs
Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Zhuo Li
|
Yuhao Du
|
Jinpeng Hu
|
Xiang Wan
|
Anningzhe Gao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Improving prompt quality is crucial for enhancing the performance of large language models (LLMs), particularly for Black-Box models like GPT4. Existing prompt refinement methods, while effective, often suffer from semantic inconsistencies between refined and original prompts, and fail to maintain users’ real intent. To address these challenges, we propose a self-instructed in-context learning framework that generates reliable derived prompts, keeping semantic consistency with the original prompts. Specifically, our framework incorporates a reinforcement learning mechanism, enabling direct interaction with the response model during prompt generation to better align with human preferences. We then formulate the querying as an in-context learning task, combining responses from LLMs with derived prompts to create a contextual demonstration for the original prompt. This approach effectively enhances alignment, reduces semantic discrepancies, and activates the LLM’s in-context learning ability for generating more beneficial response. Extensive experiments demonstrate that the proposed method not only generates better derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, particularly for Black-Box models like GPT4.
pdf
bib
LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages
Xuhan Huang
|
Qingning Shen
|
Yan Hu
|
Anningzhe Gao
|
Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
pdf
bib
abs
Atoxia: Red-teaming Large Language Models with Target Toxic Answers
Yuhao Du
|
Zhuo Li
|
Pengyu Cheng
|
Xiang Wan
|
Anningzhe Gao
Findings of the Association for Computational Linguistics: NAACL 2025
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that **A**ttacks LLMs with **T**arget **Toxi**c **A**nswers (**Atoxia**). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
pdf
bib
abs
Huatuo-26M, a Large-scale Chinese Medical QA Dataset
Xidong Wang
|
Jianquan Li
|
Shunian Chen
|
Yuxuan Zhu
|
Xiangbo Wu
|
Zhiyi Zhang
|
Xiaolong Xu
|
Junying Chen
|
Jie Fu
|
Xiang Wan
|
Anningzhe Gao
|
Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.
pdf
bib
abs
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Junying Chen
|
Chi Gui
|
Anningzhe Gao
|
Ke Ji
|
Xidong Wang
|
Xiang Wan
|
Benyou Wang
Findings of the Association for Computational Linguistics: ACL 2025
The field of AI healthcare has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces **Chain-of-Diagnosis (CoD)** to enhance the interpretability of medical automatic diagnosis. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician’s thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed **DiagnosisGPT**, capable of diagnosing 9,604 diseases for validating CoD. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on automatic diagnostic tasks across three real-world benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.
pdf
bib
abs
Unlocking LLMs’ Self-Improvement Capacity with Autonomous Learning for Domain Adaptation
Ke Ji
|
Junying Chen
|
Anningzhe Gao
|
Wenya Xie
|
Xiang Wan
|
Benyou Wang
Findings of the Association for Computational Linguistics: ACL 2025
Self-supervised pre-training and instruction fine-tuning demonstrate the potential of large language models (LLMs) for domain adaptation (DA). In pursuit of superhuman performance, LLMs have demonstrated significant potential in math and coding through self-improvement algorithms that rely on iterative training with self-generated data. This success stems from the clear reward signals in these environments, which provide a solid foundation for self-improvement. However, when it comes to general DA scenarios, two main challenges emerge: 1) ambiguous self-improvement reward signals and 2) lack of high-quality instruction fine-tuning datasets. This motivates this paper addresses how LLMs can adapt autonomously to new domains using only a large amount of unlabeled target corpora. Inspired by the human practice of self-reflection through open- and closed-book exercises to achieve domain generalization, we propose autonomous learning, which creates a self-improvement learning environment for DA. Here, the model generates questions from documents and conducts two explorations—one with the original document and one with a masked version. By comparing these explorations, the LLMs can independently identify and enhance its policy for reducing knowledge gaps. Experiments across various DA tasks demonstrate that autonomous learning enhances the DA performance of existing models, outperforming traditional fine-tuning and self-improvement methods. Our code is publicly available at https://github.com/FreedomIntelligence/AL.
pdf
bib
abs
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Wentao Ge
|
Shunian Chen
|
Hardy Chen
|
Nuo Chen
|
Junying Chen
|
Zhihong Chen
|
Wenya Xie
|
Shuo Yan
|
Chenghao Zhu
|
Ziyue Lin
|
Dingjie Song
|
Xidong Wang
|
Anningzhe Gao
|
Zhang Zhiyi
|
Jianquan Li
|
Xiang Wan
|
Benyou Wang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.
2024
pdf
bib
abs
Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen
|
Chi Gui
|
Ruyi Ouyang
|
Anningzhe Gao
|
Shunian Chen
|
Guiming Hardy Chen
|
Xidong Wang
|
Zhenyang Cai
|
Ke Ji
|
Xiang Wan
|
Benyou Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-identified medical image-text pairs to address these limitations, they often fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ‘unblinded’ capacity to denoise and reformat the data, resulting in the creation of the **PubMedVision** dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM **HuatuoGPT-Vision**, which shows superior performance in medical multimodal scenarios among open-source MLLMs. Our code and data are available at https://github.com/FreedomIntelligence/HuatuoGPT-Vision.
pdf
bib
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
Fei Yu
|
Anningzhe Gao
|
Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2024