Rule Discovery for Natural Language Inference Data Generation Using Out-of-Distribution Detection

Juyoung Han, Hyunsun Hwang, Changki Lee


Abstract
Natural Language Inference (NLI) is a fundamental task in Natural Language Processing (NLP), yet adapting NLI models to new domains remains challenging due to the high cost of collecting domain-specific training data. While prior work proposed 15 sentence transformation rules to automate training data generation, these rules insufficiently capture the diversity of natural language. We propose a novel framework that combines Out-of-Distribution (OOD) detection and BERT-based clustering to identify premise–hypothesis pairs in the SNLI dataset that are not covered by existing rules and to discover four new transformation rules from them. Using these rules with Chain-of-Thought (CoT) prompting and Large Language Models (LLMs), we generate high-quality training data and augment the SNLI dataset. Our method yields consistent performance improvements across dataset sizes, achieving +0.85%p accuracy on 2k and +0.15%p on 550k samples. Furthermore, a distribution-aware augmentation strategy enhances performance across all scales. Beyond manual explanations, we extend our framework to automatically generated explanations (CoT-Ex), demonstrating that they provide a scalable alternative to human-written explanations and enable reliable rule discovery.
Anthology ID:
2025.emnlp-main.1319
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25982–26002
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1319/
DOI:
10.18653/v1/2025.emnlp-main.1319
Bibkey:
Cite (ACL):
Juyoung Han, Hyunsun Hwang, and Changki Lee. 2025. Rule Discovery for Natural Language Inference Data Generation Using Out-of-Distribution Detection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25982–26002, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Rule Discovery for Natural Language Inference Data Generation Using Out-of-Distribution Detection (Han et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.emnlp-main.1319.pdf
Checklist:
 2025.emnlp-main.1319.checklist.pdf