Rujing Yao

2025

Active Domain Knowledge Acquisition with 100-Dollar Budget: Enhancing LLMs via Cost-Efficient, Expert-Involved Interaction in Sensitive Domains
Yang Wu | Raha Moraffah | Rujing Yao | Jinhong Yu | Zhimin Tao | Xiaozhong Liu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) have demonstrated an impressive level of general knowledge. However, they often struggle in highly specialized and sensitive domains such as drug discovery and rare disease research due to the lack of expert knowledge, which is often costly to obtain. In this paper, we propose a novel framework (PU-ADKA) designed to efficiently enhance domain-specific LLMs by actively engaging domain experts within a fixed budget. Unlike traditional fine-tuning approaches, PU-ADKA proactively identifies and queries the most appropriate expert from a team, taking into account each expert’s availability, competency, knowledge boundaries, and consultation cost. We train PU-ADKA using simulations on PubMed publication data and validate it through domain expert interactions, showing promising improvements in LLM domain knowledge acquisition. Furthermore, our experiments with a real-world drug development team validate that PU-ADKA can significantly enhance LLM performance in specialized domains while adhering to strict budget constraints. In addition to outlining our methodological innovations and experimental results, we release a new benchmark dataset, CKAD, for cost-effective LLM domain knowledge acquisition to foster further research in this challenging area.

pdf bib abs

Elevating Legal LLM Responses: Harnessing Trainable Logical Structures and Semantic Knowledge with Legal Reasoning
Rujing Yao | Yang Wu | Chenghao Wang | Jingwei Xiong | Fang Wang | Xiaozhong Liu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have achieved impressive results across numerous domains, yet they experience notable deficiencies in legal question-answering tasks. LLMs often generate generalized responses that lack the logical specificity required for expert legal advice and are prone to hallucination, providing answers that appear correct but are unreliable. Retrieval-Augmented Generation (RAG) techniques offer partial solutions to address this challenge, but existing approaches typically focus only on semantic similarity, neglecting the logical structure essential to legal reasoning. In this paper, we propose the Logical-Semantic Integration Model (LSIM), a novel supervised framework that bridges semantic and logical coherence. LSIM comprises three components: reinforcement learning predicts a structured fact-rule chain for each question, a trainable Deep Structured Semantic Model (DSSM) retrieves the most relevant candidate questions by integrating semantic and logical features, and in-context learning generates the final answer using the retrieved content. Our experiments on a real-world legal QA dataset-validated through both automated metrics and human evaluation-demonstrate that LSIM significantly enhances accuracy and reliability compared to existing methods.

Co-authors

Fang Wang 1

Jingwei Xiong 1

Jinhong Yu 1

Venues

findings1
naacl1

Fix author