Shuhan Zhou


ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation
Bei Li | Quan Du | Tao Zhou | Yi Jing | Shuhan Zhou | Xin Zeng | Tong Xiao | JingBo Zhu | Xuebo Liu | Min Zhang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, ODE Transformer, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT’14 English-German and English-French benchmarks) at a slight cost in inference efficiency.

Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
Ye Lin | Shuhan Zhou | Yanyang Li | Anxiang Ma | Tong Xiao | Jingbo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2022

For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.


The NiuTrans Machine Translation Systems for WMT21
Shuhan Zhou | Tao Zhou | Binghao Wei | Yingfeng Luo | Yongyu Mu | Zefan Zhou | Chenglong Wang | Xuanjun Zhou | Chuanhao Lv | Yi Jing | Laohu Wang | Jingnan Zhang | Canan Huang | Zhongxiang Yan | Chi Hu | Bei Li | Tong Xiao | Jingbo Zhu
Proceedings of the Sixth Conference on Machine Translation

This paper describes NiuTrans neural machine translation systems of the WMT 2021 news translation tasks. We made submissions to 9 language directions, including English2Chinese, Japanese, Russian, Icelandic and English2Hausa tasks. Our primary systems are built on several effective variants of Transformer, e.g., Transformer-DLCL, ODE-Transformer. We also utilize back-translation, knowledge distillation, post-ensemble, and iterative fine-tuning techniques to enhance the model performance further.


The NiuTrans Machine Translation Systems for WMT20
Yuhao Zhang | Ziyang Wang | Runzhe Cao | Binghao Wei | Weiqiao Shan | Shuhan Zhou | Abudurexiti Reheman | Tao Zhou | Xin Zeng | Laohu Wang | Yongyu Mu | Jingnan Zhang | Xiaoqian Liu | Xuanjun Zhou | Yinqiao Li | Bei Li | Tong Xiao | Jingbo Zhu
Proceedings of the Fifth Conference on Machine Translation

This paper describes NiuTrans neural machine translation systems of the WMT20 news translation tasks. We participated in Japanese<->English, English->Chinese, Inuktitut->English and Tamil->English total five tasks and rank first in Japanese<->English both sides. We mainly utilized iterative back-translation, different depth and widen model architectures, iterative knowledge distillation and iterative fine-tuning. And we find that adequately widened and deepened the model simultaneously, the performance will significantly improve. Also, iterative fine-tuning strategy we implemented is effective during adapting domain. For Inuktitut->English and Tamil->English tasks, we built multilingual models separately and employed pretraining word embedding to obtain better performance.