CEPOC: The Cambridge Exams Publishing Open Cloze dataset

Mariano Felice, Shiva Taslimipoor, Øistein E. Andersen, Paula Buttery


Abstract
Open cloze tests are a standard type of exercise where examinees must complete a text by filling in the gaps without any given options to choose from. This paper presents the Cambridge Exams Publishing Open Cloze (CEPOC) dataset, a collection of open cloze tests from world-renowned English language proficiency examinations. The tests in CEPOC have been expertly designed and validated using standard principles in language research and assessment. They are prepared for language learners at different proficiency levels and hence classified into different CEFR levels (A2, B1, B2, C1, C2). This resource can be a valuable testbed for various NLP tasks. We perform a complete set of experiments on three tasks: gap filling, gap prediction, and CEFR text classification. We implement transformer-based systems based on pre-trained language models to model each task and use our dataset as a test set, providing promising benchmark results.
Anthology ID:
2022.lrec-1.456
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4285–4290
Language:
URL:
https://aclanthology.org/2022.lrec-1.456
DOI:
Bibkey:
Cite (ACL):
Mariano Felice, Shiva Taslimipoor, Øistein E. Andersen, and Paula Buttery. 2022. CEPOC: The Cambridge Exams Publishing Open Cloze dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4285–4290, Marseille, France. European Language Resources Association.
Cite (Informal):
CEPOC: The Cambridge Exams Publishing Open Cloze dataset (Felice et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-url/2022.lrec-1.456.pdf
Code
 cambridgealta/cepoc