Abstract
While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA-class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA-zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.- Anthology ID:
- 2023.nodalida-1.60
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 610–617
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2023.nodalida-1.60/
- DOI:
- Cite (ACL):
- Matias Jentoft and David Samuel. 2023. NoCoLA: The Norwegian Corpus of Linguistic Acceptability. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 610–617, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- NoCoLA: The Norwegian Corpus of Linguistic Acceptability (Jentoft & Samuel, NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2023.nodalida-1.60.pdf