Ralph Rose


2023

A common way of assessing language learners’ mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT’s capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.

2020

Multiple-choice cloze (fill-in-the-blank) questions are widely used in knowledge testing and are commonly used for testing vocabulary knowledge. Word Quiz Constructor (WQC) is a Java application that is designed to produce such test items automatically from the Academic Word List (Coxhead, 2000) and using various online and offline resources. The present work evaluates recently added features of WQC to see whether they improve the production quality and well-formedness of vocabulary quiz items over previously implemented features in WQC. Results of a production test and a well-formedness survey using Amazon Mechanical Turk show that newly-introduced features (Linsear Write readability formula and Google Books NGrams frequency list) significantly improve the production quality of items over previous features (Automated Readability Index and frequency list derived from the British Academic Written English corpus). Items are produced faster and stem sentences are shorter in length without any degradation in their well-formedness. Approximately 90% of such items are judged well-formed, surpassing the rate of manually-produced items.