Kaike Zhang

2026

Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5.

2024

pdf bib abs

PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks
Yu Chen | Qi Cao | Kaike Zhang | Xuchao Liu | Huawei Shen
Findings of the Association for Computational Linguistics: EMNLP 2024

In textual backdoor attacks, attackers insert poisoned samples with triggered inputs and target labels into training datasets to manipulate model behavior, threatening the model’s security and reliability. Current defense methods can generally be categorized into inference-time and training-time ones. The former often requires a part of clean samples to set detection thresholds, which may be hard to obtain in practical application scenarios, while the latter usually requires an additional retraining or unlearning process to get a clean model, significantly increasing training costs. To avoid these drawbacks, we focus on developing a practical defense method before model training without using any clean samples. Our analysis reveals that with the help of a pre-trained language model (PLM), poisoned samples, different from clean ones, exhibit mismatched relationship and shared characteristics. Based on these observations, we further propose a two-stage poison detection strategy solely leveraging insights from PLM before model training. Extensive experiments confirm our approach’s effectiveness, achieving better performance than current leading methods more swiftly. Our code is available at https://github.com/Ascian/PKAD.

Co-authors

Venues

ACL1
Findings1

Fix author