Abstract
Most transformer models are trained on English language corpus that contain text from forums like Wikipedia and Reddit. While these models are being used in many specialized domains such as scientific peer review, legal, and healthcare, their performance is subpar because they do not contain the information present in data relevant to such specialized domains. To help these models perform as well as possible on specialized domains, one of the approaches is to collect labeled data of that particular domain and fine-tune the transformer model of choice on such data. While a good approach, it suffers from the challenge of collecting a lot of labeled data which requires significant manual effort. Another way is to use unlabeled domain-specific data to pre-train these transformer model and then fine-tune this model on labeled data. We evaluate how transformer models perform when fine-tuned on labeled data after initial pre-training with unlabeled data. We compare their performance with a transformer model fine-tuned on labeled data without initial pre-training with unlabeled data. We perform this comparison on a dataset of Scientific Peer Reviews provided by organizers of PragTag-2023 Shared Task and observe that a transformer model fine-tuned on labeled data after initial pre-training on unlabeled data using Masked Language Modelling outperforms a transformer model fine-tuned only on labeled data without initial pre-training with unlabeled data using Masked Language Modelling.- Anthology ID:
- 2023.argmining-1.26
- Volume:
- Proceedings of the 10th Workshop on Argument Mining
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Milad Alshomary, Chung-Chi Chen, Smaranda Muresan, Joonsuk Park, Julia Romberg
- Venues:
- ArgMining | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 218–222
- Language:
- URL:
- https://aclanthology.org/2023.argmining-1.26
- DOI:
- 10.18653/v1/2023.argmining-1.26
- Cite (ACL):
- Kunal Suri, Prakhar Mishra, and Albert Nanda. 2023. SuryaKiran at PragTag 2023 - Benchmarking Domain Adaptation using Masked Language Modeling in Natural Language Processing For Specialized Data. In Proceedings of the 10th Workshop on Argument Mining, pages 218–222, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- SuryaKiran at PragTag 2023 - Benchmarking Domain Adaptation using Masked Language Modeling in Natural Language Processing For Specialized Data (Suri et al., ArgMining-WS 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2023.argmining-1.26.pdf