A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets
Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy Phung, Mathieu Ravaut, Shafiq Joty, Josip Car
Abstract
Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. (Code and dataset are available at https://github.com/IvaBojic/framework). We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.- Anthology ID:
- 2023.insights-1.3
- Volume:
- Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
- Venues:
- insights | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19–32
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2023.insights-1.3/
- DOI:
- 10.18653/v1/2023.insights-1.3
- Cite (ACL):
- Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy Phung, Mathieu Ravaut, Shafiq Joty, and Josip Car. 2023. A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 19–32, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets (Bojic et al., insights 2023)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2023.insights-1.3.pdf