The Data Frontier for Large Language Models: Selection, Synthesis, and Tools

Lijun Wu; Wentao Zhang; Conghui He

The Data Frontier for Large Language Models: Selection, Synthesis, and Tools

Abstract

As the development of Large Language Models (LLMs) matures, the focus of the research community is undergoing a critical shift from a purely model-centric to a data-centric paradigm. It is now evident that the quality, diversity, and composition of training data—not merely its scale—are the primary drivers of a model’s advanced capabilities, from complex reasoning to reliable instruction following. However, acquiring and curating such high-quality data remains a significant bottleneck. This tutorial provides a comprehensive and practical guide to the state-of-the-art in data research directions for LLMs. We structure the tutorial around the two core pillars of modern data strategy: intelligent data selection and advanced data synthesis. In the first part, we delve into methods for curating the most valuable information from vast, noisy datasets, covering techniques like LLM-as-a-judge for automated quality filtering and active learning for maximizing annotation efficiency. The second part explores the synthetic data revolution, detailing paradigms that range from generating complex reasoning traces (e.g., Chain-of-Thought) to deploying sophisticated multi-agent workflows that can autonomously create high-quality, diverse instruction data from raw seeds. Finally, we will conclude with a practical overview of open-source tools and platforms that facilitate these data-centric workflows, empowering researchers and practitioners to build better models through better data. Attendees will leave with a principled framework and actionable insights for designing and implementing the advanced data strategies required to build the next generation of powerful, specialized, and aligned LLMs.

Anthology ID:: 2026.acl-1.2
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Jacob Andreas, Kenton Murray
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3–4
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-1.2/
DOI:
Bibkey:
Cite (ACL):: Lijun Wu, Wentao Zhang, and Conghui He. 2026. The Data Frontier for Large Language Models: Selection, Synthesis, and Tools. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), pages 3–4, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: The Data Frontier for Large Language Models: Selection, Synthesis, and Tools (Wu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-1.2.pdf

PDF Cite Search Fix data