Jialin Liu


2026

Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.
Although the effectiveness of Large Language Models as judges has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing reward signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it enhances the quality of generated stories, thereby validating the superiority of our self-evolving approach.

2025

Recent studies have increasingly explored the combination of existing LoRA modules for effective adaptation to unseen tasks in data-scarce scenarios. However, current LoRA selection methods typically rely on a few task samples, making it difficult to capture the full scope of task-relevant information. Furthermore, even after selection, a knowledge gap remains between the selected LoRA modules and the target task, which existing coarse-grained LoRA aggregation strategies struggle to bridge. To address these challenges, we propose Selection and Convolution for LoRA aggregation (SC-LoRA), a two-stage framework that first selects appropriate LoRA modules based on parameter clustering and then aggregates them using a convolutional LoRA aggregator. Our LoRA selection strategy ensures comprehensive coverage of task-relevant LoRA modules by leveraging their distance in the parameter space. Building on this, the convolutional LoRA aggregator extracts useful knowledge in a fine-grained manner, seamlessly bridging the gap to the target task. Our experiments demonstrate that SC-LoRA excels in aggregating multiple LoRA modules for effective adaptation to unseen tasks.