Yinghao Hu
2026
SplitThenMerge: Token-Level Skill-Compositional Sparse Mixture-of-Experts for Complex Domain-Specific Tasks
Yuting Huang | Jiawen Zhang | Yiquan Wu | Yinghao Hu | Fei Wu | Kun Kuang
Findings of the Association for Computational Linguistics: ACL 2026
Yuting Huang | Jiawen Zhang | Yiquan Wu | Yinghao Hu | Fei Wu | Kun Kuang
Findings of the Association for Computational Linguistics: ACL 2026
Large language models have demonstrated strong performance on general-purpose tasks but often fail to satisfy the accuracy requirements of knowledge-intensive domains such as law, medicine, and finance. Complex domain-specific generation is inherently compositional, involving multiple atomic skills such as reasoning, knowledge grounding, and numerical computation that are frequently interleaved at the token level. Existing domain adaptation methods typically train these heterogeneous skills jointly within a single objective, which makes it difficult for models to reliably coordinate multiple skills when solving complex tasks. In this work, we explicitly incorporate atomic skills into domain-specific model training and propose SplitThenMerge, a framework that decomposes domain competence into atomic skills, trains them independently, and composes them dynamically during generation. SplitThenMerge adopts a token-level sparse Mixture-of-Experts architecture to enable fine-grained skill routing and coordination while implementing each skill as a lightweight LoRA expert to achieve parameter-efficient specialization. Experimental results demonstrate that our method consistently achieves superior performance in both legal and medical domains under the same training parameter budget.
2025
CoEvo: Coevolution of LLM and Retrieval Model for Domain-Specific Information Retrieval
Ang Li | Yiquan Wu | Yinghao Hu | Lizhi Qing | Shihang Wang | Chengyuan Liu | Tao Wu | Adam Jatowt | Ming Cai | Fei Wu | Kun Kuang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ang Li | Yiquan Wu | Yinghao Hu | Lizhi Qing | Shihang Wang | Chengyuan Liu | Tao Wu | Adam Jatowt | Ming Cai | Fei Wu | Kun Kuang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Information retrieval in specialized domains (e.g., legal and medical) faces challenges in aligning user queries, often expressed in colloquial language, with highly structured, terminology-rich documents. This discrepancy creates a distribution gap in the text representation. Recent methods aim to enhance queries by generating intermediary elements (e.g., keywords, pseudo-documents) before performing retrieval with large language models (LLMs). However, by treating LLMs and retrievers separately, these approaches risk producing unreliable or irrelevant intermediaries, which can significantly degrade retrieval performance. To address this issue, we propose CoEvo, an alternating optimization framework that facilitates the coevolution of LLMs and retrieval models. CoEvo operates through two key steps: L-step directs the LLM in generating intermediaries by leveraging an archive of historical examples known to enhance retrieval. R-step trains the retriever using contrastive learning on the intermediaries produced by the LLM. Finally, we evaluate and flexibly leverage content generated by the LLM to amplify the effectiveness of coevolution. Experimental results demonstrate significant improvements in retrieval performance across both legal and medical domains.
Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond
Yinghao Hu | Yaoyao Yu | Leilei Gan | Bin Wei | Kun Kuang | Fei Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
Yinghao Hu | Yaoyao Yu | Leilei Gan | Bin Wei | Kun Kuang | Fei Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in test-time scaling of large language models (LLMs), exemplified by DeepSeek-R1 and OpenAI’s o1, show that extending the chain of thought during inference can significantly improve general reasoning performance. However, the impact of this paradigm on legal reasoning remains insufficiently explored. To address this gap, we present the first systematic evaluation of 12 LLMs, including both reasoning-focused and general-purpose models, across 17 Chinese and English legal tasks spanning statutory and case-law traditions. In addition, we curate a bilingual chain-of-thought dataset for legal reasoning through distillation from DeepSeek-R1 and develop Legal-R1, an open-source model specialized for the legal domain. Experimental results show that Legal-R1 delivers competitive performance across diverse tasks. DeepSeek-R1 exhibits clear advantages in Chinese legal reasoning, while OpenAI’s o1 achieves comparable results on English tasks. We further conduct a detailed error analysis, which reveals recurring issues such as outdated legal knowledge, limited capacity for legal interpretation, and susceptibility to factual hallucinations. These findings delineate the main obstacles confronting legal-domain LLMs and suggest promising directions for future research. We release the dataset and model at https://github.com/YinghaoHu/Legal-R1-14B.
Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
Yinghao Hu | Leilei Gan | Wenyi Xiao | Kun Kuang | Fei Wu
Proceedings of the 31st International Conference on Computational Linguistics
Yinghao Hu | Leilei Gan | Wenyi Xiao | Kun Kuang | Fei Wu
Proceedings of the 31st International Conference on Computational Linguistics
Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.