Automating Dataset Production Using Generative Text and Image Models

Christopher Thierauf; Mitchell Abrams; Matthias Scheutz

Automating Dataset Production Using Generative Text and Image Models

Christopher Thierauf, Mitchell Abrams, Matthias Scheutz

Abstract

Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. Based on existing literature that has struggled with quantitative evaluation (due to difficulty of data collection), we present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets. We provide sample data generated with this technique, the source code used to produce it, and discuss applicability and limitations.

Anthology ID:: 2024.lrec-main.179
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 1988–1995
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.lrec-main.179/
DOI:
Bibkey:
Cite (ACL):: Christopher Thierauf, Mitchell Abrams, and Matthias Scheutz. 2024. Automating Dataset Production Using Generative Text and Image Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1988–1995, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Automating Dataset Production Using Generative Text and Image Models (Thierauf et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.lrec-main.179.pdf

PDF Cite Search Fix data