A Taxonomy for Data Contamination in Large Language Models

Medha Palavalli; Amanda Bertsch; Matthew R. Gormley

doi:10.18653/v1/2024.conda-1.3

A Taxonomy for Data Contamination in Large Language Models

Medha Palavalli, Amanda Bertsch, Matthew Gormley

Abstract

Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may unintentionally be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks—summarization and question answering—revealing how different types of contamination influence task performance during evaluation.

Anthology ID:: 2024.conda-1.3
Volume:: Proceedings of the 1st Workshop on Data Contamination (CONDA)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Oscar Sainz, Iker García Ferrero, Eneko Agirre, Jon Ander Campos, Alon Jacovi, Yanai Elazar, Yoav Goldberg
Venues:: CONDA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22–40
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.conda-1.3/
DOI:: 10.18653/v1/2024.conda-1.3
Bibkey:
Cite (ACL):: Medha Palavalli, Amanda Bertsch, and Matthew Gormley. 2024. A Taxonomy for Data Contamination in Large Language Models. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 22–40, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: A Taxonomy for Data Contamination in Large Language Models (Palavalli et al., CONDA 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.conda-1.3.pdf

PDF Cite Search Fix data