A Large-Scale Benchmark for Vietnamese Sentence Paraphrases

Sang Quang Nguyen, Kiet Van Nguyen


Abstract
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original–paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks. The dataset is available for research purposes at https://github.com/ngwgsang/ViSP.
Anthology ID:
2025.findings-naacl.59
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1045–1060
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.59/
DOI:
Bibkey:
Cite (ACL):
Sang Quang Nguyen and Kiet Van Nguyen. 2025. A Large-Scale Benchmark for Vietnamese Sentence Paraphrases. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1045–1060, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
A Large-Scale Benchmark for Vietnamese Sentence Paraphrases (Nguyen & Nguyen, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.59.pdf