Yao-Ching Yu

2025

pdf bib abs
Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Yao-Ching Yu | Tsun-Han Chiang | Cheng-Wei Tsai | Chien-Ming Huang | Wen-Kwang Tsao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continued pre-training on our dataset yields a **15.9%** improvement in the aggregate score, while reasoning distillation leads to a **15.8%** gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community.

Conventional retrieval-augmented generation(RAG) systems employ rigid retrieval strategies that create: (1) knowledge blind spots across domain boundaries, (2) reasoning fragmentation when processing interdependent concepts, and (3) contradictions from conflicting evidence sources. Motivated by these limitations, we introduce PathwiseRAG, which addresses these challenges through: intent-aware strategy selection to eliminate blind spots, dynamic reasoning networks that capture sub-problem interdependencies to overcome fragmentation, and parallel path exploration with adaptive refinement to resolve conflicts. The framework models query intent across semantic and reasoning dimensions, constructs a directed acyclic graph of interconnected sub-problems, and explores multiple reasoning trajectories while continuously adapting to emerging evidence. Evaluation across challenging benchmarks demonstrates significant improvements over state-of-the-art RAG systems, with average accuracy gains of 4.9% and up to 6.9% on complex queries, establishing a new paradigm for knowledge-intensive reasoning by transforming static retrieval into dynamic, multi-dimensional exploration.

pdf bib abs
Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates
Wen-Kwang Tsao | Yao-Ching Yu | Chien-Ming Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent first identifies ambiguous field-mapping attempts, then generates targeted web-search queries to gather external evidence, and finally applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 56.4% (LLM-only) to 72.73% (RAG) to 93.94% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.

2024

pdf bib abs
Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Yao-Ching Yu | Chun Chih Kuo | Ye Ziqi | Chang Yucheng | Yueh-Se Li
Findings of the Association for Computational Linguistics: EMNLP 2024

Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the **G**eneration of each token by LLMs **a**s a **C**lassification (**GaC**) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks.