The reasoning capabilities of large language reasoning models (LRMs), such as OpenAI’s o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need for training effective small reasoning models. A critical challenge is that small models possess different reasoning capacities and cognitive trajectories compared with their larger counterparts. Hence, directly distilling chain-of-thought (CoT) results from large LRMs to smaller ones can sometimes be ineffective and often requires a substantial amount of annotated data. In this paper, we first introduce a novel Critique-Rethink-Verify (CRV) system, designed for training smaller yet powerful LRMs. Our CRV system consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoT qualities according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. Based on the CRV system, we further propose the Cognitive Preference Optimization (CogPO) algorithm to continuously enhance the reasoning abilities of smaller models by aligning their reasoning processes with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of our CRV+CogPO framework, which outperforms other methods by a large margin.
Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.
Hierarchical text classification (HTC) is a significant but challenging task in natural language processing (NLP) due to its complex taxonomic label hierarchy. Recently, there have been a number of approaches that applied prompt learning to HTC problems, demonstrating impressive efficacy. The majority of prompt-based studies emphasize global hierarchical features by employing graph networks to represent the hierarchical structure as a whole, with limited research on maintaining path consistency within the internal hierarchy of the structure. In this paper, we formulate prompt-based HTC as a named entity recognition (NER) task and introduce conditional random fields (CRF) and Global Pointer to establish hierarchical dependencies. Specifically, we approach single- and multi-path HTC as flat and nested entity recognition tasks and model them using span- and token-based methods. By narrowing the gap between HTC and NER, we maintain the consistency of internal paths within the hierarchical structure through a simple and effective way. Extensive experiments on three public datasets show that our method achieves state-of-the-art (SoTA) performance.