Domain-Specific Japanese ELECTRA Model Using a Small Corpus

Youki Itoh; Hiroyuki Shinnou

Domain-Specific Japanese ELECTRA Model Using a Small Corpus

Abstract

Recently, domain shift, which affects accuracy due to differences in data between source and target domains, has become a serious issue when using machine learning methods to solve natural language processing tasks. With additional pretraining and fine-tuning using a target domain corpus, pretraining models such as BERT (Bidirectional Encoder Representations from Transformers) can address this issue. However, the additional pretraining of the BERT model is difficult because it requires significant computing resources. The efficiently learning an encoder that classifies token replacements accurately (ELECTRA) pretraining model replaces the BERT pretraining method’s masked language modeling with a method called replaced token detection, which improves the computational efficiency and allows the additional pretraining of the model to a practical extent. Herein, we propose a method for addressing the computational efficiency of pretraining models in domain shift by constructing an ELECTRA pretraining model on a Japanese dataset and additional pretraining this model in a downstream task using a corpus from the target domain. We constructed a pretraining model for ELECTRA in Japanese and conducted experiments on a document classification task using data from Japanese news articles. Results show that even a model smaller than the pretrained model performs equally well.

Anthology ID:: 2021.ranlp-1.72
Volume:: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:: September
Year:: 2021
Address:: Held Online
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 640–646
Language:
URL:: https://aclanthology.org/2021.ranlp-1.72
DOI:
Bibkey:
Cite (ACL):: Youki Itoh and Hiroyuki Shinnou. 2021. Domain-Specific Japanese ELECTRA Model Using a Small Corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 640–646, Held Online. INCOMA Ltd..
Cite (Informal):: Domain-Specific Japanese ELECTRA Model Using a Small Corpus (Itoh & Shinnou, RANLP 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/update-css-js/2021.ranlp-1.72.pdf

PDF Cite Search