Abstract
The quality of artificially generated texts has considerably improved with the advent of transformers. The question of using these models to generate learning data for supervised learning tasks naturally arises, especially when the original language resource cannot be distributed, or when it is small. In this article, this question is explored under 3 aspects: (i) are artificial data an efficient complement? (ii) can they replace the original data when those are not available or cannot be distributed for confidentiality reasons? (iii) can they improve the explainability of classifiers? Different experiments are carried out on classification tasks - namely sentiment analysis on product reviews and Fake News detection - using artificially generated data by fine-tuned GPT-2 models. The results show that such artificial data can be used in a certain extend but require pre-processing to significantly improve performance. We also show that bag-of-words approaches benefit the most from such data augmentation.- Anthology ID:
- 2022.lrec-1.453
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 4260–4269
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.453
- DOI:
- Cite (ACL):
- Vincent Claveau, Antoine Chaffin, and Ewa Kijak. 2022. Generating Artificial Texts as Substitution or Complement of Training Data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4260–4269, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Generating Artificial Texts as Substitution or Complement of Training Data (Claveau et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2022.lrec-1.453.pdf
- Data
- FLUE