PhoBERT: Pre-trained language models for Vietnamese

Dat Quoc Nguyen; Anh-Tuan Nguyen

doi:10.18653/v1/2020.findings-emnlp.92

PhoBERT: Pre-trained language models for Vietnamese

Abstract

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at https://github.com/VinAIResearch/PhoBERT

Anthology ID:: 2020.findings-emnlp.92
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1037–1042
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.findings-emnlp.92/
DOI:: 10.18653/v1/2020.findings-emnlp.92
Bibkey:
Cite (ACL):: Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037–1042, Online. Association for Computational Linguistics.
Cite (Informal):: PhoBERT: Pre-trained language models for Vietnamese (Nguyen & Tuan Nguyen, Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.findings-emnlp.92.pdf
Code: VinAIResearch/PhoBERT
Data: 105,941 Images Natural Scenes OCR Data of 12 Languages

PDF Cite Search Code Fix data