ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation
Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, Lingling Mu
Abstract
Paraphrasing, i.e., restating the same meaning in different ways, is an important data augmentation approach for natural language processing (NLP). Zhang et al. (2019b) propose to extract sentence-level paraphrases from multiple Chinese translations of the same source texts, and construct the PKU Paraphrase Bank of 0.5M sentence pairs. However, despite being the largest Chinese parabank to date, the size of PKU parabank is limited by the availability of one-to-many sentence translation data, and cannot well support the training of large Chinese paraphrasers. In this paper, we relieve the restriction with one-to-many sentence translation data, and construct ParaZh-22M, a larger Chinese parabank that is composed of 22M sentence pairs, based on one-to-one bilingual sentence translation data and machine translation (MT). In our data augmentation experiments, we show that paraphrasing based on ParaZh-22M can bring about consistent and significant improvements over several strong baselines on a wide range of Chinese NLP tasks, including a number of Chinese natural language understanding benchmarks (CLUE) and low-resource machine translation.- Anthology ID:
- 2022.coling-1.341
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 3885–3897
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.341
- DOI:
- Cite (ACL):
- Wenjie Hao, Hongfei Xu, Deyi Xiong, Hongying Zan, and Lingling Mu. 2022. ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3885–3897, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- ParaZh-22M: A Large-Scale Chinese Parabank via Machine Translation (Hao et al., COLING 2022)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2022.coling-1.341.pdf
- Data
- CLUE, ParaBank, ParaCrawl, United Nations Parallel Corpus