Collaborative Co-Design Practices for Supporting Synthetic Data Generation in Large Language Models: A Pilot Study
Heloisa Candello, Raya Horesh, Aminat Adebiyi, Muneeza Azmat, Rogério Abreu de Paula, Lamogha Chiazor
Abstract
Large language models (LLMs) are increasingly embedded in development pipelines and the daily workflows of AI practitioners. However, their effectiveness depends on access to high-quality datasets that are sufficiently large, diverse, and contextually relevant. Existing datasets often fall short of these requirements, prompting the use of synthetic data (SD) generation. A critical step in this process is the creation of human seed examples, which guide the generation of SD tailored to specific tasks. We propose a participatory methodology for seed example generation, involving multidisciplinary teams in structured workshops to co-create examples aligned with Responsible AI principles. In a pilot study with a Responsible AI team, we facilitated hands-on activities to produce seed examples and evaluated the resulting data across three dimensions: diversity, sensibility, and relevance. Our findings suggest that participatory approaches can enhance the representativeness and contextual fidelity of synthetic datasets. We provide a reproducible framework to support NLP practitioners in generating high-quality seed data for LLM development and deployment- Anthology ID:
- 2025.hcinlp-1.11
- Volume:
- Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Su Lin Blodgett, Amanda Cercas Curry, Sunipa Dev, Siyan Li, Michael Madaio, Jack Wang, Sherry Tongshuang Wu, Ziang Xiao, Diyi Yang
- Venues:
- HCINLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 129–147
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.hcinlp-1.11/
- DOI:
- Cite (ACL):
- Heloisa Candello, Raya Horesh, Aminat Adebiyi, Muneeza Azmat, Rogério Abreu de Paula, and Lamogha Chiazor. 2025. Collaborative Co-Design Practices for Supporting Synthetic Data Generation in Large Language Models: A Pilot Study. In Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP), pages 129–147, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Collaborative Co-Design Practices for Supporting Synthetic Data Generation in Large Language Models: A Pilot Study (Candello et al., HCINLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.hcinlp-1.11.pdf