Hui Zhang


2022

pdf
CSL: A Large-scale Chinese Scientific Literature Dataset
Yudong Li | Yuqing Zhang | Zhe Zhao | Linlin Shen | Weijie Liu | Weiquan Mao | Hui Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code will be publicly available.

pdf
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
Hui Zhang | Tian Yuan | Junkun Chen | Xintong Li | Renjie Zheng | Yuxin Huang | Xiaojie Chen | Enlei Gong | Zeyu Chen | Xiaoguang Hu | Dianhai Yu | Yanjun Ma | Liang Huang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves competitive or state-of-the-art performance on various speech datasets and implements the most popular methods. It also provides recipes and pretrained models to quickly reproduce the experimental results in this paper. PaddleSpeech is publicly avaiable at https://github.com/PaddlePaddle/PaddleSpeech.

2021

pdf
基于多层次预训练策略和多任务学习的端到端蒙汉语音翻译(End-to-end Mongolian-Chinese Speech Translation Based on Multi-level Pre-training Strategies and Multi-task Learning)
Ningning Wang (王宁宁) | Long Fei (飞龙) | Hui Zhang (张晖)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

端到端语音翻译将源语言语音直接翻译为目标语言文本,它需要“源语言语音-目标语言文本”作为训练数据,然而这类数据极其稀缺,本文提出了一种多层次预训练策略和多任务学习相结合的训练方法,首先分别对语音识别和机器翻译模型的各个模块进行多层次预训练,接着将语音识别和机器翻译模型连接起来构成语音翻译模型,然后使用迁移学习对预训练好的模型进行多步骤微调,在此过程中又运用多任务学习的方法,将语音识别作为语音翻译的一个辅助任务来组织训练,充分利用了已经存在的各种不同形式的数据来训练端到端模型,首次将端到端技术应用于资源受限条件下的蒙汉语音翻译,构建了首个翻译质量较高、实际可用的端到端蒙汉语音翻译系统。

2018

pdf
A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction
Rui Liu | Feilong Bao | Guanglai Gao | Hui Zhang | Yonghe Wang
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we first utilize the word embedding that focuses on sub-word units to the Mongolian Phrase Break (PB) prediction task by using Long-Short-Term-Memory (LSTM) model. Mongolian is an agglutinative language. Each root can be followed by several suffixes to form probably millions of words, but the existing Mongolian corpus is not enough to build a robust entire word embedding, thus it suffers a serious data sparse problem and brings a great difficulty for Mongolian PB prediction. To solve this problem, we look at sub-word units in Mongolian word, and encode their information to a meaningful representation, then fed it to LSTM to decode the best corresponding PB label. Experimental results show that the proposed model significantly outperforms traditional CRF model using manually features and obtains 7.49% F-Measure gain.

2014

pdf
Kneser-Ney Smoothing on Expected Counts
Hui Zhang | David Chiang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Observational Initialization of Type-Supervised Taggers
Hui Zhang | John DeNero
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Beyond Left-to-Right: Multiple Decomposition Structures for SMT
Hui Zhang | Kristina Toutanova | Chris Quirk | Jianfeng Gao
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf
An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?
Hui Zhang | David Chiang
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf
Convolution Kernel over Packed Parse Forest
Min Zhang | Hui Zhang | Haizhou Li
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Non-Isomorphic Forest Pair Translation
Hui Zhang | Min Zhang | Haizhou Li | Eng Siong Chng
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf
Forest-based Tree Sequence to String Translation Model
Hui Zhang | Min Zhang | Haizhou Li | Aiti Aw | Chew Lim Tan
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Fast Translation Rule Matching for Syntax-based Statistical Machine Translation
Hui Zhang | Min Zhang | Haizhou Li | Chew Lim Tan
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
K-Best Combination of Syntactic Parsers
Hui Zhang | Min Zhang | Chew Lim Tan | Haizhou Li
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
I2R’s machine translation system for IWSLT 2009
Xiangyu Duan | Deyi Xiong | Hui Zhang | Min Zhang | Haizhou Li
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we describe the system and approach used by the Institute for Infocomm Research (I2R) for the IWSLT 2009 spoken language translation evaluation campaign. Two kinds of machine translation systems are applied, namely, phrase-based machine translation system and syntax-based machine translation system. To test syntax-based machine translation system on spoken language translation, variational systems are explored. On top of both phrase-based and syntax-based single systems, we further use rescoring method to improve the individual system performance and use system combination method to combine the strengths of the different individual systems. Rescoring is applied on each single system output, and system combination is applied on all rescoring outputs. Finally, our system combination framework shows better performance in Chinese-English BTEC task.