Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang, Bei Peng, Danushka Bollegala


Abstract
Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)
Anthology ID:
2026.acl-long.1520
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32916–32937
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1520/
DOI:
Bibkey:
Cite (ACL):
Tianhui Zhang, Bei Peng, and Danushka Bollegala. 2026. Synthetic Data Generation for Training Diversified Commonsense Reasoning Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32916–32937, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Synthetic Data Generation for Training Diversified Commonsense Reasoning Models (Zhang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1520.pdf
Checklist:
 2026.acl-long.1520.checklist.pdf