ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation

Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, Lingling Mu


Abstract
Paraphrasing, i.e., restating the same meaning in different ways, is an important data augmentation approach for natural language processing (NLP). Zhang et al. (2019b) propose to extract sentence-level paraphrases from multiple Chinese translations of the same source texts, and construct the PKU Paraphrase Bank of 0.5M sentence pairs. However, despite being the largest Chinese parabank to date, the size of PKU parabank is limited by the availability of one-to-many sentence translation data, and cannot well support the training of large Chinese paraphrasers. In this paper, we relieve the restriction with one-to-many sentence translation data, and construct ParaZh-22M, a larger Chinese parabank that is composed of 22M sentence pairs, based on one-to-one bilingual sentence translation data and machine translation (MT). In our data augmentation experiments, we show that paraphrasing based on ParaZh-22M can bring about consistent and significant improvements over several strong baselines on a wide range of Chinese NLP tasks, including a number of Chinese natural language understanding benchmarks (CLUE) and low-resource machine translation.
Anthology ID:
2022.coling-1.341
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3885–3897
Language:
URL:
https://aclanthology.org/2022.coling-1.341
DOI:
Bibkey:
Cite (ACL):
Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, and Lingling Mu. 2022. ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3885–3897, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation (Hao et al., COLING 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.coling-1.341.pdf
Data
CLUECMNLIParaBankParaCrawlUnited Nations Parallel Corpus