GALA: Geometric Data Selection with Strategic Prospecting for Large Language Model Self-training

Zhongwei Xie, Ruihao Liao, Zimo Wang, Chong Chen, Xian-Sheng Hua, Xiao Luo


Abstract
Self-training has emerged as a promising direction for autonomously improving large language models (LLMs). Existing approaches typically adopt a generate-and-filter paradigm based on rejection sampling, which could suffer from inefficiency and low-quality reasoning paths. Towards this end, this paper proposes a novel framework named  ̲Geometric D ̲ata Se ̲lection with Str ̲ategic Prospecting (GALA) for LLM self-training. The core of our GALA is to identify diverse and informative samples from redundant data and exploit them more strategically. In particular, our proposed GALA first conducts clustering on latent sentence embeddings and then selects an anchor sample from each cluster based on the geometric distance to reduce data redundancy. To further exploit these samples, we conduct strategic brainstorming and reflection for high-quality reasoning trajectory prospecting. In addition, we introduce a lightweight dynamic validation module to validate the reliability of mini-batches to ensure the overall quality of the data. Extensive experiments on various benchmarks validate the effectiveness of the proposed GALA against several competing baselines.
Anthology ID:
2026.findings-acl.500
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10281–10293
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.500/
DOI:
Bibkey:
Cite (ACL):
Zhongwei Xie, Ruihao Liao, Zimo Wang, Chong Chen, Xian-Sheng Hua, and Xiao Luo. 2026. GALA: Geometric Data Selection with Strategic Prospecting for Large Language Model Self-training. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10281–10293, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
GALA: Geometric Data Selection with Strategic Prospecting for Large Language Model Self-training (Xie et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.500.pdf
Checklist:
 2026.findings-acl.500.checklist.pdf