AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level

Amit Seker; Elron Bandel; Dan Bareket; Idan Brusilovsky; Refael Greenfeld; Reut Tsarfaty

doi:10.18653/v1/2022.acl-long.4

AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level

Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Greenfeld, Reut Tsarfaty

Abstract

Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between. The problem is twofold. First, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts. Second, most benchmarks available to evaluate progress in Hebrew NLP require morphological boundaries which are not available in the output of standard PLMs. In this work we remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel neural architecture that recovers the morphological segments encoded in contextualized embedding vectors. Based on this new morphological component we offer an evaluation suite consisting of multiple tasks and benchmarks that cover sentence-level, word-level and sub-word level analyses. On all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. We make our AlephBERT model, the morphological extraction model, and the Hebrew evaluation suite publicly available, for evaluating future Hebrew PLMs.

Anthology ID:: 2022.acl-long.4
Volume:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 46–56
Language:
URL:: https://aclanthology.org/2022.acl-long.4
DOI:: 10.18653/v1/2022.acl-long.4
Bibkey:
Cite (ACL):: Amit Seker, Elron Bandel, Dan Bareket, Idan Brusilovsky, Refael Greenfeld, and Reut Tsarfaty. 2022. AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46–56, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level (Seker et al., ACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2022.acl-long.4.pdf
Video:: https://preview.aclanthology.org/ingestion-script-update/2022.acl-long.4.mp4
Data: OSCAR

PDF Search Video