Cross-domain Analysis on Japanese Legal Pretrained Language Models

Keisuke Miyazaki, Hiroaki Yamada, Takenobu Tokunaga


Abstract
This paper investigates the pretrained language model (PLM) specialised in the Japanese legal domain. We create PLMs using different pretraining strategies and investigate their performance across multiple domains. Our findings are (i) the PLM built with general domain data can be improved by further pretraining with domain-specific data, (ii) domain-specific PLMs can learn domain-specific and general word meanings simultaneously and can distinguish them, (iii) domain-specific PLMs work better on its target domain; still, the PLMs retain the information learnt in the original PLM even after being further pretrained with domain-specific data, (iv) the PLMs sequentially pretrained with corpora of different domains show high performance for the later learnt domains.
Anthology ID:
2022.findings-aacl.26
Volume:
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Month:
November
Year:
2022
Address:
Online only
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
274–281
Language:
URL:
https://aclanthology.org/2022.findings-aacl.26
DOI:
Bibkey:
Cite (ACL):
Keisuke Miyazaki, Hiroaki Yamada, and Takenobu Tokunaga. 2022. Cross-domain Analysis on Japanese Legal Pretrained Language Models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 274–281, Online only. Association for Computational Linguistics.
Cite (Informal):
Cross-domain Analysis on Japanese Legal Pretrained Language Models (Miyazaki et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.findings-aacl.26.pdf