2025
pdf
bib
abs
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Liang Wen
|
Yunke Cai
|
Fenrui Xiao
|
Xin He
|
Qi An
|
Zhenyu Duan
|
Yimin Du
|
Junchen Liu
|
Tanglifu Tanglifu
|
Xiaowei Lv
|
Haosheng Zou
|
Yongchao Deng
|
Shousheng Jia
|
Xiangzheng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
This paper introduces Light-R1, an opensource suite for training long reasoning modelsusing reproducible and cost-effective methodology. Given the proprietary nature of data usedin the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively publicdata and models. Our curriculum training progressively increases data difficulty, combinedwith multi-staged post-training. Our LightR1-32B model, trained from Qwen2.5-32BInstruct, outperforms DeepSeek-R1-DistillQwen-32B in math reasoning. Experimental results show that this curriculum approachbecomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilledmodels (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examplesfrom our curriculum dataset yielded state-ofthe-art 7B and 14B models, while the 32Bmodel, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPOon long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among14B models in math, with AIME24 & 25 scoresof 74.0 and 60.2 respectively, surpassing many32B models and DeepSeek-R1-Distill-Llama70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significantadvancement in making sophisticated reasoning models more accessible and implementablein real-world applications. Our models, training data and code have been made available.
2020
pdf
bib
abs
Factorized Transformer for Multi-Domain Neural Machine Translation
Yongchao Deng
|
Hongfei Yu
|
Heng Yu
|
Xiangyu Duan
|
Weihua Luo
Findings of the Association for Computational Linguistics: EMNLP 2020
Multi-Domain Neural Machine Translation (NMT) aims at building a single system that performs well on a range of target domains. However, along with the extreme diversity of cross-domain wording and phrasing style, the imperfections of training data distribution and the inherent defects of the current sequential learning process all contribute to making the task of multi-domain NMT very challenging. To mitigate these problems, we propose the Factorized Transformer, which consists of an in-depth factorization of the parameters of an NMT model, namely Transformer in this paper, into two categories: domain-shared ones that encode common cross-domain knowledge and domain-specific ones that are private for each constituent domain. We experiment with various designs of our model and conduct extensive validations on English to French open multi-domain dataset. Our approach achieves state-of-the-art performance and opens up new perspectives for multi-domain and open-domain applications.
2018
pdf
bib
abs
Alibaba’s Neural Machine Translation Systems for WMT18
Yongchao Deng
|
Shanbo Cheng
|
Jun Lu
|
Kai Song
|
Jingang Wang
|
Shenglan Wu
|
Liang Yao
|
Guchun Zhang
|
Haibo Zhang
|
Pei Zhang
|
Changfeng Zhu
|
Boxing Chen
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
This paper describes the submission systems of Alibaba for WMT18 shared news translation task. We participated in 5 translation directions including English ↔ Russian, English ↔ Turkish in both directions and English → Chinese. Our systems are based on Google’s Transformer model architecture, into which we integrated the most recent features from the academic research. We also employed most techniques that have been proven effective during the past WMT years, such as BPE, back translation, data selection, model ensembling and reranking, at industrial scale. For some morphologically-rich languages, we also incorporated linguistic knowledge into our neural network. For the translation tasks in which we have participated, our resulting systems achieved the best case sensitive BLEU score in all 5 directions. Notably, our English → Russian system outperformed the second reranked system by 5 BLEU score.
2017
pdf
bib
abs
Conception d’une solution de détection d’événements basée sur Twitter (Design of a solution for event detection from Tweeter)
Christophe Servan
|
Catherine Kobus
|
Yongchao Deng
|
Cyril Touffet
|
Jungi Kim
|
Inès Kapp
|
Djamel Mostefa
|
Josep Crego
|
Aurélien Coquard
|
Jean Senellart
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations
Cet article présente un système d’alertes fondé sur la masse de données issues de Tweeter. L’objectif de l’outil est de surveiller l’actualité, autour de différents domaines témoin incluant les événements sportifs ou les catastrophes naturelles. Cette surveillance est transmise à l’utilisateur sous forme d’une interface web contenant la liste d’événements localisés sur une carte.
pdf
bib
SYSTRAN Purely Neural MT Engines for WMT2017
Yongchao Deng
|
Jungi Kim
|
Guillaume Klein
|
Catherine Kobus
|
Natalia Segal
|
Christophe Servan
|
Bo Wang
|
Dakun Zhang
|
Josep Crego
|
Jean Senellart
Proceedings of the Second Conference on Machine Translation