Dually Self-Improved Counterfactual Data Augmentation Using Large Language Model

Luhao Zhang, Xinyu Zhang, Linmei Hu, Dandan Song, Liqiang Nie


Abstract
Counterfactual data augmentation, which generates minimally edited tokens to alter labels, has become a key approach to improving model robustness in natural language processing (NLP). It is usually implemented by first identifying the causal terms and then modifying these terms to create counterfactual candidates. The emergence of large language models (LLMs) has effectively facilitated the task of counterfactual data augmentation. However, existing LLM-based approaches still face some challenges in 1) accurately extracting the task-specific causal terms, and 2) the quality of LLM-generated counterfacts. To address the issues, we propose a dually self-improved counterfactual data augmentation method using LLM for the Natural Language Inference (NLI) task. On the one hand, we design a self-improved strategy employing the attention distribution of the task model to identify the task-specific causal terms, which is lightweight and task-specific. On the other hand, a second self-improved strategy based on direct preference optimization is utilized to refine LLM-generated counterfacts, achieving high-quality counterfacts. Finally, a balanced loss preventing over-emphasis on augmented data is proposed to retrain the task model on the fusion of existing data and generated counterfacts. Extensive experiments on NLI benchmarks demonstrate the effectiveness of our proposed method in generating high-quality counterfacts for improving task performance.
Anthology ID:
2025.acl-long.260
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5216–5227
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.260/
DOI:
Bibkey:
Cite (ACL):
Luhao Zhang, Xinyu Zhang, Linmei Hu, Dandan Song, and Liqiang Nie. 2025. Dually Self-Improved Counterfactual Data Augmentation Using Large Language Model. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5216–5227, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Dually Self-Improved Counterfactual Data Augmentation Using Large Language Model (Zhang et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.260.pdf