DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift
Myeongjun Jang, Antonios Georgiadis, Yiyun Zhao, Fran Silavong
Abstract
Data drift, which denotes a misalignment between the distribution of reference (i.e., training) and production data, constitutes a significant challenge for AI applications, as it undermines the generalisation capacity of machine learning (ML) models. Therefore, it is imperative to proactively identify data drift before users meet with performance degradation. Moreover, to ensure the successful execution of AI services, endeavours should be directed not only toward detecting the occurrence of drift but also toward effectively addressing this challenge. % considering the limited resources prevalent in practical industrial domains. In this work, we introduce a tool designed to detect data drift in text data. In addition, we propose an unsupervised sampling technique for extracting representative examples from drifted instances. This approach bestows a practical advantage by significantly reducing expenses associated with annotating the labels for drifted instances, an essential prerequisite for retraining the model to sustain its performance on production data.- Anthology ID:
- 2024.naacl-industry.28
- Volume:
- Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Yi Yang, Aida Davani, Avi Sil, Anoop Kumar
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 335–346
- Language:
- URL:
- https://aclanthology.org/2024.naacl-industry.28
- DOI:
- Cite (ACL):
- Myeongjun Jang, Antonios Georgiadis, Yiyun Zhao, and Fran Silavong. 2024. DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 335–346, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- DriftWatch: A Tool that Automatically Detects Data Drift and Extracts Representative Examples Affected by Drift (Jang et al., NAACL 2024)
- PDF:
- https://preview.aclanthology.org/ingestion-checklist/2024.naacl-industry.28.pdf