@inproceedings{wu-etal-2026-data,
title = "The Data Frontier for Large Language Models: Selection, Synthesis, and Tools",
author = "Wu, Lijun and
Zhang, Wentao and
He, Conghui",
editor = "Andreas, Jacob and
Murray, Kenton",
booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 5: Tutorial Abstracts)",
month = jul,
year = "2026",
address = "San Diego, California, USA",
publisher = "Association for Computational Linguistics",
url = "https://preview.aclanthology.org/ingest-acl/2026.acl-1.2/",
pages = "3--4",
ISBN = "979-8-89176-394-4",
abstract = "As the development of Large Language Models (LLMs) matures, the focus of the research community is undergoing a critical shift from a purely model-centric to a data-centric paradigm. It is now evident that the quality, diversity, and composition of training data{---}not merely its scale{---}are the primary drivers of a model{'}s advanced capabilities, from complex reasoning to reliable instruction following. However, acquiring and curating such high-quality data remains a significant bottleneck. This tutorial provides a comprehensive and practical guide to the state-of-the-art in data research directions for LLMs. We structure the tutorial around the two core pillars of modern data strategy: intelligent data selection and advanced data synthesis. In the first part, we delve into methods for curating the most valuable information from vast, noisy datasets, covering techniques like LLM-as-a-judge for automated quality filtering and active learning for maximizing annotation efficiency. The second part explores the synthetic data revolution, detailing paradigms that range from generating complex reasoning traces (e.g., Chain-of-Thought) to deploying sophisticated multi-agent workflows that can autonomously create high-quality, diverse instruction data from raw seeds. Finally, we will conclude with a practical overview of open-source tools and platforms that facilitate these data-centric workflows, empowering researchers and practitioners to build better models through better data. Attendees will leave with a principled framework and actionable insights for designing and implementing the advanced data strategies required to build the next generation of powerful, specialized, and aligned LLMs."
}Markdown (Informal)
[The Data Frontier for Large Language Models: Selection, Synthesis, and Tools](https://preview.aclanthology.org/ingest-acl/2026.acl-1.2/) (Wu et al., ACL 2026)
ACL