Workshop on Innovative Use of NLP for Building Educational Applications (2023)

Volumes

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) 66 papers

pdf (full)
bib (full) Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

pdf bib abs
LFTK: Handcrafted Features in Computational Linguistics
Bruce W. Lee | Jason Lee

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, no actively-maintained open-source library extracts a wide variety of handcrafted features. The current handcrafted feature extraction practices have several inefficiencies, and a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system to give the community a rich set of pre-implemented handcrafted features.

pdf bib abs
Improving Mathematics Tutoring With A Code Scratchpad
Shriyash Upadhyay | Etan Ginsberg | Chris Callison-Burch

Large language models can solve reasoning tasks (like math problems) more effectively when they are allowed to generate rationales. However, a good tutoring system should not just generate solutions, but should also generate explanations and should be able to correct and guide students. We show that providing a code scratchpad improves performance on each tutoring step with a gradeschool mathematics dataset. On these tutoring tasks, GPT-3 models provided with a code scratchpad significantly outperform those given only a language scratchpad (77.7% vs 48.7% cumulative accuracy).

Effective human learning depends on a wide selection of educational materials that align with the learner’s current understanding of the topic. While the Internet has revolutionized human learning or education, a substantial resource accessibility barrier still exists. Namely, the excess of online information can make it challenging to navigate and discover high-quality learning materials in a given subject area. In this paper, we propose an automatic pipeline for building an educational resource discovery system for new domains. The pipeline consists of three main steps: resource searching, feature extraction, and resource classification. We first collect frequent queries from a set of seed documents, and search the web with these queries to obtain candidate resources such as lecture slides and introductory blog posts. Then, we process these resources for BERT-based features and meta-features. Next, we train a tree-based classifier to decide whether they are suitable learning materials. The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel domains. Finally, we demonstrate how this pipeline can benefit two applications: prerequisite chain learning and leading paragraph generation for surveys. We also release a corpus of 39,728 manually labeled web resources and 659 queries from NLP, Computer Vision (CV), and Statistics (STATS).

pdf abs
Using Learning Analytics for Adaptive Exercise Generation
Tanja Heck | Detmar Meurers

Single Choice exercises constitute a central exercise type for language learning in a learner’s progression from mere implicit exposure through input enhancement to productive language use in open exercises. Distractors that support learning in the individual zone of proximal development should not be derived from static analyses of learner corpora, but rely on dynamic learning analytics based on half-open exercises. We demonstrate how a system’s error diagnosis module can be re-used for automatic and dynamic generation and adaptation of distractors, as well as to inform exercise generation in terms of relevant learning goals and reasonable chunking in Jumbled Sentences exercises.

pdf abs
Reviewriter: AI-Generated Instructions For Peer Review Writing
Xiaotian Su | Thiemo Wambsganss | Roman Rietsche | Seyed Parsa Neshaei | Tanja Käser

Large Language Models (LLMs) offer novel opportunities for educational applications that have the potential to transform traditional learning for students. Despite AI-enhanced applications having the potential to provide personalized learning experiences, more studies are needed on the design of generative AI systems and evidence for using them in real educational settings. In this paper, we design, implement and evaluate \texttt{Reviewriter}, a novel tool to provide students with AI-generated instructions for writing peer reviews in German. Our study identifies three key aspects: a) we provide insights into student needs when writing peer reviews with generative models which we then use to develop a novel system to provide adaptive instructions b) we fine-tune three German language models on a selected corpus of 11,925 student-written peer review texts in German and choose German-GPT2 based on quantitative measures and human evaluation, and c) we evaluate our tool with fourteen students, revealing positive technology acceptance based on quantitative measures. Additionally, the qualitative feedback presents the benefits and limitations of generative AI in peer review writing.

pdf abs
Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners
Hakyung Sung | Gyu-Ho Shin

We introduce the Korean-Learner-Morpheme (KLM) corpus, a manually annotated dataset consisting of 129,784 morphemes from second language (L2) learners of Korean, featuring morpheme tokenization and part-of-speech (POS) tagging. We evaluate the performance of four Korean morphological analyzers in tokenization and POS tagging on the L2- Korean corpus. Results highlight the analyzers’ reduced performance on L2 data, indicating the limitation of advanced deep-learning models when dealing with L2-Korean corpora. We further show that fine-tuning one of the models with the KLM corpus improves its accuracy of tokenization and POS tagging on L2-Korean dataset.

The increasing use of AI chatbots as conversation partners for second-language learners highlights the importance of providing effective feedback. To ensure a successful learning experience, it is essential for researchers and practitioners to understand the optimal timing, methods of delivery, and types of feedback that are most beneficial to learners. Synchronous grammar corrective feedback (CF) has been shown to be more effective than asynchronous methods in online writing tasks. Additionally, self-correction by language learners has proven more beneficial than teacher-provided correction, particularly for spoken language skills and non-novice learners. However, existing language-learning AI chatbots often lack synchronous CF and self-correction capabilities. To address this, we propose a synchronous conversational corrective feedback (CCF) method, which allows self-correction and provides metalinguistic explanations (ME). Our study suggests that in chatbot-driven language-learning tools, corrective feedback is more effectively delivered through means other than the social chatbot, such as a GUI interface. Furthermore, we found that guided self-correction offers a superior learning experience compared to providing explicit corrections, particularly for learners with high learning motivation or lower linguistic ability.

pdf abs
Enhancing Video-based Learning Using Knowledge Tracing: Personalizing Students’ Learning Experience with ORBITS
Shady Shehata | David Santandreu Calonge | Philip Purnell | Mark Thompson

As the world regains its footing following the COVID-19 pandemic, academia is striving to consolidate the gains made in students’ education experience. New technologies such as video-based learning have shown some early improvement in student learning and engagement. In this paper, we present ORBITS predictive engine at YOURIKA company, a video-based student support platform powered by knowledge tracing. In an exploratory case study of one master’s level Speech Processing course at the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi, half the students used the system while the other half did not. Student qualitative feedback was universally positive and compared the system favorably against current available methods. These findings support the use of artificial intelligence techniques to improve the student learning experience.

We address the problem of generating high-quality question-answer pairs for educational materials. Previous work on this problem showed that using summaries as input improves the quality of question generation (QG) over original textbook text and that human-written summaries result in higher quality QG than automatic summaries. In this paper, a) we show that advances in Large Language Models (LLMs) are not yet sufficient to generate quality summaries for QG and b) we introduce a new methodology for enhancing bullet point student notes into fully fledged summaries and find that our methodology yields higher quality QG. We conducted a large-scale human annotation study of generated question-answer pairs for the evaluation of our methodology. In order to aid in future research, we release a new dataset of 9.2K human annotations of generated questions.

pdf abs
Difficulty-Controllable Neural Question Generation for Reading Comprehension using Item Response Theory
Masaki Uto | Yuto Tomikawa | Ayaka Suzuki

Question generation (QG) for reading comprehension, a technology for automatically generating questions related to given reading passages, has been used in various applications, including in education. Recently, QG methods based on deep neural networks have succeeded in generating fluent questions that are pertinent to given reading passages. One example of how QG can be applied in education is a reading tutor that automatically offers reading comprehension questions related to various reading materials. In such an application, QG methods should provide questions with difficulty levels appropriate for each learner’s reading ability in order to improve learning efficiency. Several difficulty-controllable QG methods have been proposed for doing so. However, conventional methods focus only on generating questions and cannot generate answers to them. Furthermore, they ignore the relation between question difficulty and learner ability, making it hard to determine an appropriate difficulty for each learner. To resolve these problems, we propose a new method for generating question–answer pairs that considers their difficulty, estimated using item response theory. The proposed difficulty-controllable generation is realized by extending two pre-trained transformer models: BERT and GPT-2.

pdf abs
Evaluating Classroom Potential for Card-it: Digital Flashcards for Studying and Learning Italian Morphology
Mariana Shimabukuro | Jessica Zipf | Shawn Yama | Christopher Collins

This paper presents Card-it, a web-based application for learning Italian verb conjugation. Card-it integrates a large-scale finite-state morphological~(FSM) analyzer and a flashcard application as a user-friendly way for learners to utilize the analyzer. While Card-it can be used by individual learners, to support classroom adoption, we implemented simple classroom management functionalities such as sharing flashcards to a class and tracking students’ progression. We evaluated Card-it with teachers of Italian. Card-it was reported as engaging and supportive, especially by featuring two different quiz types combined with a verb form look-up feature. Teachers were optimistic about the potential of Card-it as a classroom supplementary tool for learners of Italian as L2. Future work includes sample sentences and a complete learners evaluation.

pdf abs
Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems
Scott Hellman | Alejandro Andrade | Kyle Habermehl

Open-ended constructed response math word problems (“math plus text”, or MPT) are a powerful tool in the assessment of students’ abilities to engage in mathematical reasoning and creative thinking. Such problems ask the student to compute a value or construct an expression and then explain, potentially in prose, what steps they took and why they took them. MPT items can be scored against highly structured rubrics, and we develop a novel technique for the automated scoring of MPT items that leverages these rubrics to provide explainable scoring. We show that our approach can be trained automatically and performs well on a large dataset of 34,417 responses across 14 MPT items.

pdf abs
Gender-Inclusive Grammatical Error Correction through Augmentation
Gunnar Lund | Kostiantyn Omelianchuk | Igor Samokhin

In this paper we show that GEC systems display gender bias related to the use of masculine and feminine terms and the gender-neutral singular “they”. We develop parallel datasets of texts with masculine and feminine terms, and singular “they”, and use them to quantify gender bias in three competitive GEC systems. We contribute a novel data augmentation technique for singular “they” leveraging linguistic insights about its distribution relative to plural “they”. We demonstrate that both this data augmentation technique and a refinement of a similar augmentation technique for masculine and feminine terms can generate training data that reduces bias in GEC systems, especially with respect to singular “they” while maintaining the same level of quality.

We develop an interactive web-based user interface for performing textspeech alignment and creating digital interactive “read-along audio books that highlight words as they are spoken and allow users to replay individual words when clicked. We build on an existing Python library for zero-shot multilingual textspeech alignment (Littell et al., 2022), extend it by exposing its functionality through a RESTful API, and rewrite the underlying speech recognition engine to run in the browser. The ReadAlong Studio Web App is open-source, user-friendly, prioritizes privacy and data sovereignty, allows for a variety of standard export formats, and is designed to work for the majority of the world’s languages.

A peer-assessment system allows students to provide feedback on each other’s work. An effective peer assessment system urgently requires helpful reviews to facilitate students to make improvements and progress. Automated evaluation of review helpfulness, with the help of deep learning models and natural language processing techniques, gains much interest in the field of peer assessment. However, collecting labeled data with the “helpfulness” tag to build these prediction models remains challenging. A straightforward solution would be using a supervised learning algorithm to train a prediction model on a similar domain and apply it to our peer review domain for inference. But naively doing so can degrade the model performance in the presence of the distributional gap between domains. Such a distributional gap can be effectively addressed by Domain Adaptation (DA). Self-training has recently been shown as a powerful branch of DA to address the distributional gap. The first goal of this study is to evaluate the performance of self-training-based DA in predicting the helpfulness of peer reviews as well as the ability to overcome the distributional gap. Our second goal is to propose an advanced self-training framework to overcome the weakness of the existing self-training by tailoring knowledge distillation and noise injection, to further improve the model performance and better address the distributional gap.

pdf abs
Generating Dialog Responses with Specified Grammatical Items for Second Language Learning
Yuki Okano | Kotaro Funakoshi | Ryo Nagata | Manabu Okumura

This paper proposes a new second language learning task of generating a response including specified grammatical items. We consider two approaches: 1) fine-tuning a pre-trained language model (DialoGPT) by reinforcement learning and 2) providing a few-shot prompt to a large language model (GPT-3). For reinforcement learning, we examine combinations of three reward functions that consider grammatical items, diversity, and fluency. Our experiments confirm that both approaches can generate responses including the specified grammatical items and that it is crucial to consider fluency rather than diversity as the reward function.

pdf abs
UKP-SQuARE: An Interactive Tool for Teaching Question Answering
Haishuo Fang | Haritz Puerto | Iryna Gurevych

The exponential growth of question answering (QA) has made it an indispensable topic in any Natural Language Processing (NLP) course. Additionally, the breadth of QA derived from this exponential growth makes it an ideal scenario for teaching related NLP topics such as information retrieval, explainability, and adversarial attacks among others. In this paper, we introduce UKP-SQuARE as a platform for QA education. This platform provides an interactive environment where students can run, compare, and analyze various QA models from different perspectives, such as general behavior, explainability, and robustness. Therefore, students can get a first-hand experience in different QA techniques during the class. Thanks to this, we propose a learner-centered approach for QA education in which students proactively learn theoretical concepts and acquire problem-solving skills through interactive exploration, experimentation, and practical assignments, rather than solely relying on traditional lectures. To evaluate the effectiveness of UKP-SQuARE in teaching scenarios, we adopted it in a postgraduate NLP course and surveyed the students after the course. Their positive feedback shows the platform’s effectiveness in their course and invites a wider adoption.

pdf abs
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods
Mengsay Loem | Masahiro Kaneko | Sho Takase | Naoaki Okazaki

Large-scale pre-trained language models such as GPT-3 have shown remarkable performance across various natural language processing tasks. However, applying prompt-based methods with GPT-3 for Grammatical Error Correction (GEC) tasks and their controllability remains underexplored. Controllability in GEC is crucial for real-world applications, particularly in educational settings, where the ability to tailor feedback according to learner levels and specific error types can significantly enhance the learning process. This paper investigates the performance and controllability of prompt-based methods with GPT-3 for GEC tasks using zero-shot and few-shot setting. We explore the impact of task instructions and examples on GPT-3’s output, focusing on controlling aspects such as minimal edits, fluency edits, and learner levels. Our findings demonstrate that GPT-3 could effectively perform GEC tasks, outperforming existing supervised and unsupervised approaches. We also showed that GPT-3 could achieve controllability when appropriate task instructions and examples are given.

pdf abs
A Closer Look at k-Nearest Neighbors Grammatical Error Correction
Justin Vasselli | Taro Watanabe

In various natural language processing tasks, such as named entity recognition and machine translation, example-based approaches have been used to improve performance by leveraging existing knowledge. However, the effectiveness of this approach for Grammatical Error Correction (GEC) is unclear. In this work, we explore how an example-based approach affects the accuracy and interpretability of the output of GEC systems and the trade-offs involved. The approach we investigate has shown great promise in machine translation by using the $k$-nearest translation examples to improve the results of a pretrained Transformer model. We find that using this technique increases precision by reducing the number of false positives, but recall suffers as the model becomes more conservative overall. Increasing the number of example sentences in the datastore does lead to better performing systems, but with diminishing returns and a high decoding cost. Synthetic data can be used as examples, but the effectiveness varies depending on the base model. Finally, we find that finetuning on a set of data may be more effective than using that data during decoding as examples.

pdf abs
Towards Extracting and Understanding the Implicit Rubrics of Transformer Based Automatic Essay Scoring Models
James Fiacco | David Adamson | Carolyn Rose

By aligning the functional components derived from the activations of transformer models trained for AES with external knowledge such as human-understandable feature groups, the proposed method improves the interpretability of a Longformer Automatic Essay Scoring (AES) system and provides tools for performing such analyses on further neural AES systems. The analysis focuses on models trained to score essays based on organization, main idea, support, and language. The findings provide insights into the models’ decision-making processes, biases, and limitations, contributing to the development of more transparent and reliable AES systems.

pdf abs
Analyzing Bias in Large Language Model Solutions for Assisted Writing Feedback Tools: Lessons from the Feedback Prize Competition Series
Perpetual Baffour | Tor Saxberg | Scott Crossley

This paper analyzes winning solutions from the Feedback Prize competition series hosted from 2021-2022. The competition sought to improve Assisted Writing Feedback Tools (AWFTs) by crowdsourcing Large Language Model (LLM) solutions for evaluating student writing. The winning models are freely available for incorporation into educational applications, but the models need to be assessed for performance and other factors. This study reports the performance accuracy of Feedback Prize-winning models based on demographic factors such as student race/ethnicity, economic disadvantage, and English Language Learner status. Two competitions are analyzed. The first, which focused on identifying discourse elements, demonstrated minimal bias based on students’ demographic factors. However, the second competition, which aimed to predict discourse effectiveness, exhibited moderate bias.

pdf abs
Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rank
Nischal Ashok Kumar | Nigel Fernandez | Zichao Wang | Andrew Lan

Reading comprehension is a crucial skill in many aspects of education, including language learning, cognitive development, and fostering early literacy skills in children. Automated answer-aware reading comprehension question generation has significant potential to scale up learner support in educational activities. One key technical challenge in this setting is that there can be multiple questions, sometimes very different from each other, with the same answer; a trained question generation method may not necessarily know which question human educators would prefer. To address this challenge, we propose 1) a data augmentation method that enriches the training dataset with diverse questions given the same context and answer and 2) an overgenerate-and-rank method to select the best question from a pool of candidates. We evaluate our method on the FairytaleQA dataset, showing a 5% absolute improvement in ROUGE-L over the best existing method. We also demonstrate the effectiveness of our method in generating harder, “implicit” questions, where the answers are not contained in the context as text spans.

The standard definition generation task requires to automatically produce mono-lingual definitions (e.g., English definitions for English words), but ignores that the generated definitions may also consist of unfamiliar words for language learners. In this work, we propose a novel task of Trans-Lingual Definition Generation (TLDG), which aims to generate definitions in another language, i.e., the native speaker’s language. Initially, we explore the unsupervised manner of this task and build up a simple implementation of fine-tuning the multi-lingual machine translation model. Then, we develop two novel methods, Prompt Combination and Contrastive Prompt Learning, for further enhancing the quality of the generation. Our methods are evaluated against the baseline Pipeline method in both rich- and low-resource settings, and we empirically establish its superiority in generating higher-quality trans-lingual definitions.

pdf abs
Predicting the Quality of Revisions in Argumentative Writing
Zhexiong Liu | Diane Litman | Elaine Wang | Lindsay Matsumura | Richard Correnti

The ability to revise in response to feedback is critical to students’ writing success. In the case of argument writing in specific, identifying whether an argument revision (AR) is successful or not is a complex problem because AR quality is dependent on the overall content of an argument. For example, adding the same evidence sentence could strengthen or weaken existing claims in different argument contexts (ACs). To address this issue we developed Chain-of-Thought prompts to facilitate ChatGPT-generated ACs for AR quality predictions. The experiments on two corpora, our annotated elementary essays and existing college essays benchmark, demonstrate the superiority of the proposed ACs over baselines.

pdf abs
Reconciling Adaptivity and Task Orientation in the Student Dashboard of an Intelligent Language Tutoring System
Leona Colling | Tanja Heck | Detmar Meurers

In intelligent language tutoring systems, student dashboards should display the learning progress and performance and support the navigation through the learning content. Designing an interface that transparently offers information on students’ learning in relation to specific learning targets while linking to the overarching functional goal, that motivates and organizes the practice in current foreign language teaching, is challenging. This becomes even more difficult in systems that adaptively expose students to different learning material and individualize system interactions. If such a system is used in an ecologically valid setting of blended learning, this generates additional requirements to incorporate the needs of students and teachers for control and customizability.We present the conceptual design of a student dashboard for a task-based, user-adaptive intelligent language tutoring system intended for use in real-life English classes in secondary schools. We highlight the key challenges and spell out open questions for future research.

pdf abs
GrounDialog: A Dataset for Repair and Grounding in Task-oriented Spoken Dialogues for Language Learning
Xuanming Zhang | Rahul Divekar | Rutuja Ubale | Zhou Yu

Improving conversational proficiency is a key target for students learning a new language. While acquiring conversational proficiency, students must learn the linguistic mechanisms of Repair and Grounding (R\&G) to negotiate meaning and find common ground with their interlocutor so conversational breakdowns can be resolved. Task-oriented Spoken Dialogue Systems (SDS) have long been sought as a tool to hone conversational proficiency. However, the R&G patterns for language learners interacting with a task-oriented spoken dialogue system are not reflected explicitly in any existing datasets. Therefore, to move the needle in Spoken Dialogue Systems for language learning we present GrounDialog: an annotated dataset of spoken conversations where we elicit a rich set of R&G patterns.

pdf abs
SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education Transcripts
Rose Wang | Pawan Wirawarn | Noah Goodman | Dorottya Demszky

Lectures are a learning experience for both students and teachers. Students learn from teachers about the subject material, while teachers learn from students about how to refine their instruction. Unfortunately, online student feedback is unstructured and abundant, making it challenging for teachers to learn and improve. We take a step towards tackling this challenge. First, we contribute a dataset for studying this problem: SIGHT is a large dataset of 288 math lecture transcripts and 15,784 comments collected from the Massachusetts Institute of Technology OpenCourseWare (MIT OCW) YouTube channel. Second, we develop a rubric for categorizing feedback types using qualitative analysis. Qualitative analysis methods are powerful in uncovering domain-specific insights, however they are costly to apply to large data sources. To overcome this challenge, we propose a set of best practices for using large language models (LLMs) to cheaply classify the comments at scale. We observe a striking correlation between the model’s and humans’ annotation: Categories with consistent human annotations (0.9 inter-rater reliability, IRR) also display higher human-model agreement (0.7), while categories with less consistent human annotations (0.7-0.8 IRR) correspondingly demonstrate lower human-model agreement (0.3-0.5). These techniques uncover useful student feedback from thousands of comments, costing around $0.002 per comment. We conclude by discussing exciting future directions on using online student feedback and improving automated annotation techniques for qualitative research.

pdf abs
Recognizing Learner Handwriting Retaining Orthographic Errors for Enabling Fine-Grained Error Feedback
Christian Gold | Ronja Laarmann-Quante | Torsten Zesch

This paper addresses the problem of providing automatic feedback on orthographic errors in handwritten text. Despite the availability of automatic error detection systems, the practical problem of digitizing the handwriting remains. Current handwriting recognition (HWR) systems produce highly accurate transcriptions but normalize away the very errors that are essential for providing useful feedback, e.g. orthographic errors. Our contribution is twofold:First, we create a comprehensive dataset of handwritten text with transcripts retaining orthographic errors by transcribing 1,350 pages from the German learner dataset FD-LEX. Second, we train a simple HWR system on our dataset, allowing it to transcribe words with orthographic errors. Thereby, we evaluate the effect of different dictionaries on recognition output, highlighting the importance of addressing spelling errors in these dictionaries.

pdf abs
ExASAG: Explainable Framework for Automatic Short Answer Grading
Maximilian Tornqvist | Mosleh Mahamud | Erick Mendez Guzman | Alexandra Farazouli

As in other NLP tasks, Automatic Short Answer Grading (ASAG) systems have evolved from using rule-based and interpretable machine learning models to utilizing deep learning architectures to boost accuracy. Since proper feedback is critical to student assessment, explainability will be crucial for deploying ASAG in real-world applications. This paper proposes a framework to generate explainable outcomes for assessing question-answer pairs of a Data Mining course in a binary manner. Our framework utilizes a fine-tuned Transformer-based classifier and an explainability module using SHAP or Integrated Gradients to generate language explanations for each prediction. We assess the outcome of our framework by calculating accuracy-based metrics for classification performance. Furthermore, we evaluate the quality of the explanations by measuring their agreement with human-annotated justifications using Intersection-Over-Union at a token level to derive a plausibility score. Despite the relatively limited sample, results show that our framework derives explanations that are, to some degree, aligned with domain-expert judgment. Furthermore, both explainability methods perform similarly in their agreement with human-annotated explanations. A natural progression of our work is to analyze the use of our explainable ASAG framework on a larger sample to determine the feasibility of implementing a pilot study in a real-world setting.

pdf abs
You’ve Got a Friend in ... a Language Model? A Comparison of Explanations of Multiple-Choice Items of Reading Comprehension between ChatGPT and Humans
George Duenas | Sergio Jimenez | Geral Mateus Ferro

Creating high-quality multiple-choice items requires careful attention to several factors, including ensuring that there is only one correct option, that options are independent of each other, that there is no overlap between options, and that each option is plausible. This attention is reflected in the explanations provided by human item-writers for each option. This study aimed to compare the creation of explanations of multiple-choice item options for reading comprehension by ChatGPT with those created by humans. We used two context-dependent multiple-choice item sets created based on EvidenceCentered Design. Results indicate that ChatGPT is capable of producing explanations with different type of information that are comparable to those created by humans. So that humans could benefit from additional information given to enhance their explanations. We conclude that ChatGPT ability to generate explanations for multiple-choice item options in reading comprehension tests is comparable to that of humans.

We introduce a novel technique for automatically summarizing lecture videos using large language models such as GPT-3 and we present a user study investigating the effects on the studying experience when automatic summaries are added to lecture videos. We test students under different conditions and find that the students who are shown a summary next to a lecture video perform better on quizzes designed to test the course materials than the students who have access only to the video or the summary. Our findings suggest that adding automatic summaries to lecture videos enhances the learning experience. Qualitatively, students preferred summaries when studying under time constraints.

pdf abs
Automated evaluation of written discourse coherence using GPT-4
Ben Naismith | Phoebe Mulcaire | Jill Burstein

The popularization of large language models (LLMs) such as OpenAI’s GPT-3 and GPT-4 have led to numerous innovations in the field of AI in education. With respect to automated writing evaluation (AWE), LLMs have reduced challenges associated with assessing writing quality characteristics that are difficult to identify automatically, such as discourse coherence. In addition, LLMs can provide rationales for their evaluations (ratings) which increases score interpretability and transparency. This paper investigates one approach to producing ratings by training GPT-4 to assess discourse coherence in a manner consistent with expert human raters. The findings of the study suggest that GPT-4 has strong potential to produce discourse coherence ratings that are comparable to human ratings, accompanied by clear rationales. Furthermore, the GPT-4 ratings outperform traditional NLP coherence metrics with respect to agreement with human ratings. These results have implications for advancing AWE technology for learning and assessment.

pdf abs
ALEXSIS+: Improving Substitute Generation and Selection for Lexical Simplification with Information Retrieval
Kai North | Alphaeus Dmonte | Tharindu Ranasinghe | Matthew Shardlow | Marcos Zampieri

Lexical simplification (LS) automatically replaces words that are deemed difficult to understand for a given target population with simpler alternatives, whilst preserving the meaning of the original sentence. The TSAR-2022 shared task on LS provided participants with a multilingual lexical simplification test set. It contained nearly 1,200 complex words in English, Portuguese, and Spanish and presented multiple candidate substitutions for each complex word. The competition did not make training data available; therefore, teams had to use either off-the-shelf pre-trained large language models (LLMs) or out-domain data to develop their LS systems. As such, participants were unable to fully explore the capabilities of LLMs by re-training and/or fine-tuning them on in-domain data. To address this important limitation, we present ALEXSIS+, a multilingual dataset in the aforementioned three languages, and ALEXSIS++, an English monolingual dataset that together contains more than 50,000 unique sentences retrieved from news corpora and annotated with cosine similarities to the original complex word and sentence. Using these additional contexts, we are able to generate new high-quality candidate substitutions that improve LS performance on the TSAR-2022 test set regardless of the language or model.

pdf abs
Generating Better Items for Cognitive Assessments Using Large Language Models
Antonio Laverghetta Jr. | John Licato

Writing high-quality test questions (items) is critical to building educational measures but has traditionally also been a time-consuming process. One promising avenue for alleviating this is automated item generation, whereby methods from artificial intelligence (AI) are used to generate new items with minimal human intervention. Researchers have explored using large language models (LLMs) to generate new items with equivalent psychometric properties to human-written ones. But can LLMs generate items with improved psychometric properties, even when existing items have poor validity evidence? We investigate this using items from a natural language inference (NLI) dataset. We develop a novel prompting strategy based on selecting items with both the best and worst properties to use in the prompt and use GPT-3 to generate new NLI items. We find that the GPT-3 items show improved psychometric properties in many cases, whilst also possessing good content, convergent and discriminant validity evidence. Collectively, our results demonstrate the potential of employing LLMs to ease the item development process and suggest that the careful use of prompting may allow for iterative improvement of item quality.

pdf abs
Span Identification of Epistemic Stance-Taking in Academic Written English
Masaki Eguchi | Kristopher Kyle

Responding to the increasing need for automated writing evaluation (AWE) systems to assess language use beyond lexis and grammar (Burstein et al., 2016), we introduce a new approach to identify rhetorical features of stance in academic English writing. Drawing on the discourse-analytic framework of engagement in the Appraisal analysis (Martin & White, 2005), we manually annotated 4,688 sentences (126,411 tokens) for eight rhetorical stance categories (e.g., PROCLAIM, ATTRIBUTION) and additional discourse elements. We then report an experiment to train machine learning models to identify and categorize the spans of these stance expressions. The best-performing model (RoBERTa + LSTM) achieved macro-averaged F1 of .7208 in the span identification of stance-taking expressions, slightly outperforming the intercoder reliability estimates before adjudication (F1 = .6629).

This paper presents the ACTA system, which performs automated short-answer grading in the domain of high-stakes medical exams. The system builds upon previous work on neural similarity-based grading approaches by applying these to the medical domain and utilizing contrastive learning as a means to optimize the similarity metric. ACTA is evaluated against three strong baselines and is developed in alignment with operational needs, where low-confidence responses are flagged for human review. Learning curves are explored to understand the effects of training data on performance. The results demonstrate that ACTA leads to substantially lower number of responses being flagged for human review, while maintaining high classification accuracy.

pdf abs
Hybrid Models for Sentence Readability Assessment
Fengkai Liu | John Lee

Automatic readability assessment (ARA) predicts how difficult it is for the reader to understand a text. While ARA has traditionally been performed at the passage level, there has been increasing interest in ARA at the sentence level, given its applications in downstream tasks such as text simplification and language exercise generation. Recent research has suggested the effectiveness of hybrid approaches for ARA, but they have yet to be applied on the sentence level. We present the first study that compares neural and hybrid models for sentence-level ARA. We conducted experiments on graded sentences from the Wall Street Journal (WSJ) and a dataset derived from the OneStopEnglish corpus. Experimental results show that both neural and hybrid models outperform traditional classifiers trained on linguistic features. Hybrid models obtained the best accuracy on both datasets, surpassing the previous best result reported on the WSJ dataset by almost 13% absolute.

pdf abs
Training for Grammatical Error Correction Without Human-Annotated L2 Learners’ Corpora
Mikio Oda

Grammatical error correction (GEC) is a challenging task for non-native second language (L2) learners and learning machines. Data-driven GEC learning requires as much human-annotated genuine training data as possible. However, it is difficult to produce larger-scale human-annotated data, and synthetically generated large-scale parallel training data is valuable for GEC systems. In this paper, we propose a method for rebuilding a corpus of synthetic parallel data using target sentences predicted by a GEC model to improve performance. Experimental results show that our proposed pre-training outperforms that on the original synthetic datasets. Moreover, it is also shown that our proposed training without human-annotated L2 learners’ corpora is as practical as conventional full pipeline training with both synthetic datasets and L2 learners’ corpora in terms of accuracy.

This paper explores the use of L2-specific grammatical microsystems as elements of the domain knowledge of an Intelligent Computer-assisted Language Learning (ICALL) system. We report on the design of new grammatico-functional measures and their association with proficiency. We illustrate the approach with the design of the IT, THIS, THAT proform microsystem. The measures rely on the paradigmatic relations between words of the same linguistic functions. They are operationalised with one frequency-based and two probabilistic methods, i.e., the relative proportions of the forms and their likelihood of occurrence. Ordinal regression models show that the measures are significant in terms of association with CEFR levels, paving the way for their introduction in a specific proform microsystem expert model.

pdf abs
Japanese Lexical Complexity for Non-Native Readers: A New Dataset
Yusuke Ide | Masato Mita | Adam Nohejl | Hiroki Ouchi | Taro Watanabe

Lexical complexity prediction (LCP) is the task of predicting the complexity of words in a text on a continuous scale. It plays a vital role in simplifying or annotating complex words to assist readers. To study lexical complexity in Japanese, we construct the first Japanese LCP dataset. Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers’ L1-specific needs. In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.

pdf abs
Grammatical Error Correction for Sentence-level Assessment in Language Learning
Anisia Katinskaia | Roman Yangarber

The paper presents experiments on using a Grammatical Error Correction (GEC) model to assess the correctness of answers that language learners give to grammar exercises. We explored whether a GEC model can be applied in the language learning context for a language with complex morphology. We empirically check a hypothesis that a GEC model corrects only errors and leaves correct answers unchanged. We perform a test on assessing learner answers in a real but constrained language-learning setup: the learners answer only fill-in-the-blank and multiple-choice exercises. For this purpose, we use ReLCo, a publicly available manually annotated learner dataset in Russian (Katinskaia et al., 2022). In this experiment, we fine-tune a large-scale T5 language model for the GEC task and estimate its performance on the RULEC-GEC dataset (Rozovskaya and Roth, 2019) to compare with top-performing models. We also release an updated version of the RULEC-GEC test set, manually checked by native speakers. Our analysis shows that the GEC model performs reasonably well in detecting erroneous answers to grammar exercises and potentially can be used for best-performing error types in a real learning setup. However, it struggles to assess answers which were tagged by human annotators as alternative-correct using the aforementioned hypothesis. This is in large part due to a still low recall in correcting errors, and the fact that the GEC model may modify even correct words—it may generate plausible alternatives, which are hard to evaluate against the gold-standard reference.

pdf abs
“Geen makkie”: Interpretable Classification and Simplification of Dutch Text Complexity
Eliza Hobo | Charlotte Pouw | Lisa Beinborn

An inclusive society needs to facilitate access to information for all of its members, including citizens with low literacy and with non-native language skills. We present an approach to assess Dutch text complexity on the sentence level and conduct an interpretability analysis to explore the link between neural models and linguistic complexity features. Building on these findings, we develop the first contextual lexical simplification model for Dutch and publish a pilot dataset for evaluation. We go beyondprevious work which primarily targeted lexical substitution and propose strategies for adjusting the model’s linguistic register to generate simpler candidates. Our results indicate that continual pre-training and multi-task learning with conceptually related tasks are promising directions for ensuring the simplicity of the generated substitutions.

pdf abs
CEFR-based Contextual Lexical Complexity Classifier in English and French
Desislava Aleksandrova | Vincent Pouliot

This paper describes a CEFR-based classifier of single-word and multi-word lexical complexity in context from a second language learner perspective in English and in French, developed as an analytical tool for the pedagogical team of the language learning application Mauril. We provide an overview of the required corpora and the way we transformed it into rich contextual representations that allow the disambiguation and accurate labelling in context of polysemous occurrences of a given lexical item. We report evaluation results for all models, including two multi-lingual lexical classifiers evaluated on novel French datasets created for this experiment. Finally, we share the perspective of Mauril’s pedagogical team on the limitations of such systems.

pdf abs
The NCTE Transcripts: A Dataset of Elementary Math Classroom Transcripts
Dorottya Demszky | Heather Hill

Classroom discourse is a core medium of instruction analyzing it can provide a window into teaching and learning as well as driving the development of new tools for improving instruction. We introduce the largest dataset of mathematics classroom transcripts available to researchers, and demonstrate how this data can help improve instruction. The dataset consists of 1,660 45-60 minute long 4th and 5th grade elementary mathematics observations collected by the National Center for Teacher Effectiveness (NCTE) between 2010-2013. The anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores. We demonstrate that our natural language processing model, trained on our turn-level annotations, can learn to identify dialogic discourse moves and these moves are correlated with better classroom observation scores and learning outcomes. This dataset opens up several possibilities for researchers, educators and policymakers to learn about and improve K-12 instruction. The dataset can be found at https://github.com/ddemszky/classroom-transcript-analysis.

Online learning platforms offer a wealth of educational material, but as the amount of content on these platforms grows, students may struggle to determine the most efficient order in which to cover the material to achieve a particular learning objective. In this paper, we propose a feature-based method for identifying pre-requisite dependencies between academic videos. Our approach involves using a transcript engine with a language model to transcribe domain-specific terms and then extracting novel similarity-based features to determine pre-requisite dependencies between video transcripts. This approach succeeds due to the development of a novel corpus of K-12 academic text, which was created using a proposed feature-based document parser. We evaluate our method on hand-annotated datasets for transcript extraction, video pre-requisites determination, and textbook parsing, which we have released. Our method for pre-requisite edge determination shows significant improvement (+4.7%-10.24% F1-score) compared to existing methods.

pdf abs
Transformer-based Hebrew NLP models for Short Answer Scoring in Biology
Abigail Gurin Schleifer | Beata Beigman Klebanov | Moriah Ariely | Giora Alexandron

Pre-trained large language models (PLMs) are adaptable to a wide range of downstream tasks by fine-tuning their rich contextual embeddings to the task, often without requiring much task-specific data. In this paper, we explore the use of a recently developed Hebrew PLM aleph-BERT for automated short answer grading of high school biology items. We show that the alephBERT-based system outperforms a strong CNN-based baseline, and that it general-izes unexpectedly well in a zero-shot paradigm to items on an unseen topic that address the same underlying biological concepts, opening up the possibility of automatically assessing new items without item-specific fine-tuning.

pdf abs
Comparing Neural Question Generation Architectures for Reading Comprehension
E. Margaret Perkoff | Abhidip Bhattacharyya | Jon Cai | Jie Cao

In recent decades, there has been a significant push to leverage technology to aid both teachers and students in the classroom. Language processing advancements have been harnessed to provide better tutoring services, automated feedback to teachers, improved peer-to-peer feedback mechanisms, and measures of student comprehension for reading. Automated question generation systems have the potential to significantly reduce teachers’ workload in the latter. In this paper, we compare three differ- ent neural architectures for question generation across two types of reading material: narratives and textbooks. For each architecture, we explore the benefits of including question attributes in the input representation. Our models show that a T5 architecture has the best overall performance, with a RougeL score of 0.536 on a narrative corpus and 0.316 on a textbook corpus. We break down the results by attribute and discover that the attribute can improve the quality of some types of generated questions, including Action and Character, but this is not true for all models.

pdf abs
A dynamic model of lexical experience for tracking of oral reading fluency
Beata Beigman Klebanov | Michael Suhan | Zuowei Wang | Tenaha O’reilly

We present research aimed at solving a problem in assessment of oral reading fluency using children’s oral reading data from our online book reading app. It is known that properties of the passage being read aloud impact fluency estimates; therefore, passage-based measures are used to remove passage-related variance when estimating growth in oral reading fluency. However, passage-based measures reported in the literature tend to treat passages as independent events, without explicitly modeling accumulation of lexical experience as one reads through a book. We propose such a model and show that it helps explain additional variance in the measurements of children’s fluency as they read through a book, improving over a strong baseline. These results have implications for measuring growth in oral reading fluency.

pdf abs
Rating Short L2 Essays on the CEFR Scale with GPT-4
Kevin P. Yancey | Geoffrey Laflair | Anthony Verardi | Jill Burstein

Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and GPT-4 can rate short essay responses written by L2 English learners on a high-stakes language assessment, computing inter-rater agreement with human ratings. Results show that when calibration examples are provided, GPT-4 can perform almost as well as modern Automatic Writing Evaluation (AWE) methods, but agreement with human ratings can vary depending on the test-taker’s first language (L1).

pdf abs
Towards automatically extracting morphosyntactical error patterns from L1-L2 parallel dependency treebanks
Arianna Masciolini | Elena Volodina | Dana Dannlls

L1-L2 parallel dependency treebanks are UD-annotated corpora of learner sentences paired with correction hypotheses. Automatic morphosyntactical annotation has the potential to remove the need for explicit manual error tagging and improve interoperability, but makes it more challenging to locate grammatical errors in the resulting datasets. We therefore propose a novel method for automatically extracting morphosyntactical error patterns and perform a preliminary bilingual evaluation of its first implementation through a similar example retrieval task. The resulting pipeline is also available as a prototype CALL application.

pdf abs
Learning from Partially Annotated Data: Example-aware Creation of Gap-filling Exercises for Language Learning
Semere Kiros Bitew | Johannes Deleu | A. Seza Doğruöz | Chris Develder | Thomas Demeester

Since performing exercises (including, e.g.,practice tests) forms a crucial component oflearning, and creating such exercises requiresnon-trivial effort from the teacher. There is agreat value in automatic exercise generationin digital tools in education. In this paper, weparticularly focus on automatic creation of gap-filling exercises for language learning, specifi-cally grammar exercises. Since providing anyannotation in this domain requires human ex-pert effort, we aim to avoid it entirely and ex-plore the task of converting existing texts intonew gap-filling exercises, purely based on anexample exercise, without explicit instructionor detailed annotation of the intended gram-mar topics. We contribute (i) a novel neuralnetwork architecture specifically designed foraforementioned gap-filling exercise generationtask, and (ii) a real-world benchmark datasetfor French grammar. We show that our modelfor this French grammar gap-filling exercisegeneration outperforms a competitive baselineclassifier by 8% in F1 percentage points, achiev-ing an average F1 score of 82%. Our model im-plementation and the dataset are made publiclyavailable to foster future research, thus offeringa standardized evaluation and baseline solutionof the proposed partially annotated data predic-tion task in grammar exercise creation.

pdf abs
Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications
Changrong Xiao | Sean Xin Xu | Kunpeng Zhang | Yufang Wang | Lei Xia

The recent advancement of pre-trained Large Language Models (LLMs), such as OpenAI’s ChatGPT, has led to transformative changes across fields. For example, developing intelligent systems in the educational sector that leverage the linguistic capabilities of LLMs demonstrates a visible potential. Though researchers have recently explored how ChatGPT could possibly assist in student learning, few studies have applied these techniques to real-world classroom settings involving teachers and students. In this study, we implement a reading comprehension exercise generation system that provides high-quality and personalized reading materials for middle school English learners in China. Extensive evaluations of the generated reading passages and corresponding exercise questions, conducted both automatically and manually, demonstrate that the system-generated materials are suitable for students and even surpass the quality of existing human-written ones. By incorporating first-hand feedback and suggestions from experienced educators, this study serves as a meaningful pioneering application of ChatGPT, shedding light on the future design and implementation of LLM-based systems in the educational context.

pdf abs
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction
Rose Wang | Dorottya Demszky

Coaching, which involves classroom observation and expert feedback, is a widespread and fundamental part of teacher training. However, the majority of teachers do not have access to consistent, high quality coaching due to limited resources and access to expertise. We explore whether generative AI could become a cost-effective complement to expert feedback by serving as an automated teacher coach. In doing so, we propose three teacher coaching tasks for generative AI: (A) scoring transcript segments based on classroom observation instruments, (B)identifying highlights and missed opportunities for good instructional strategies, and (C) providing actionable suggestions for eliciting more student reasoning. We recruit expert math teachers to evaluate the zero-shot performance of ChatGPT on each of these tasks for elementary math classroom transcripts. Our results reveal that ChatGPT generates responses that are relevant to improving instruction, but they are often not novel or insightful. For example, 82% of the model’s suggestions point to places in the transcript where the teacher is already implementing that suggestion. Our work highlights the challenges of producing insightful, novel and truthful feedback for teachers while paving the way for future research to address these obstacles and improve the capacity of generative AI to coach teachers.

In English speaking assessment, pretrained large language models (LLMs) such as BERT can score constructed response items as accurately as human raters. Less research has investigated whether LLMs perpetuate or exacerbate biases, which would pose problems for the fairness and validity of the test. This study examines gender and native language (L1) biases in human and automated scores, using an off-the-shelf (OOS) BERT model. Analyses focus on a specific type of bias known as differential item functioning (DIF), which compares examinees of similar English language proficiency. Results show that there is a moderate amount of DIF, based on examinees’ L1 background in grade band 912. DIF is higher when scored by an OOS BERT model, indicating that BERT may exacerbate this bias; however, in practical terms, the degree to which BERT exacerbates DIF is very small. Additionally, there is more DIF for longer speaking items and for older examinees, but BERT does not exacerbate these patterns of DIF.

pdf abs
MultiQG-TI: Towards Question Generation from Multi-modal Sources
Zichao Wang | Richard Baraniuk

We study the new problem of automatic question generation (QG) from multi-modal sources containing images and texts, significantly expanding the scope of most of the existing work that focuses exclusively on QG from only textual sources. We propose a simple solution for our new problem, called MultiQG-TI, which enables a text-only question generator to process visual input in addition to textual input. Specifically, we leverage an image-to-text model and an optical character recognition model to obtain the textual description of the image and extract any texts in the image, respectively, and then feed them together with the input texts to the question generator. We only fine-tune the question generator while keeping the other components fixed. On the challenging ScienceQA dataset, we demonstrate that MultiQG-TI significantly outperforms ChatGPT with few-shot prompting, despite having hundred-times less trainable parameters. Additional analyses empirically confirm the necessity of both visual and textual signals for QG and show the impact of various modeling choices. Code is available at https://anonymous.4open.science/r/multimodal-QG-47F2/

pdf abs
Inspecting Spoken Language Understanding from Kids for Basic Math Learning at Home
Eda Okur | Roddy Fuentes Alba | Saurav Sahay | Lama Nachman

Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for Kid Space, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. We validate the advantages of a multi-task architecture for NLU and experiment with a diverse set of pretrained language representations for Intent Recognition and Entity Extraction tasks in the math learning domain. To recognize kids’ speech in realistic home environments, we investigate several ASR systems, including the commercial Google Cloud and the latest open-source Whisper solutions with varying model sizes. We evaluate the SLU pipeline by testing our best-performing NLU models on noisy ASR output to inspect the challenges of understanding children for math learning in authentic homes.

Socratic questioning is a teaching strategy where the student is guided towards solving a problem on their own, instead of being given the solution directly. In this paper, we introduce a dataset of Socratic conversations where an instructor helps a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of GPT-based language models. While GPT-4 is observed to perform much better than GPT-3.5, its precision, and recall still fall short of human expert abilities, motivating further work in this area.

pdf abs
Beyond Black Box AI generated Plagiarism Detection: From Sentence to Document Level
Ali Quidwai | Chunhui Li | Parijat Dube

The increasing reliance on large language models (LLMs) in academic writing has led to a rise in plagiarism. Existing AI-generated text classifiers have limited accuracy and often produce false positives. We propose a novel approach using natural language processing (NLP) techniques, offering quantifiable metrics at both sentence and document levels for easier interpretation by human evaluators. Our method employs a multi-faceted approach, generating multiple paraphrased versions of a given question and inputting them into the LLM to generate answers. By using a contrastive loss function based on cosine similarity, we match generated sentences with those from the student’s response. Our approach achieves up to 94% accuracy in classifying human and AI text, providing a robust and adaptable solution for plagiarism detection in academic settings. This method improves with LLM advancements, reducing the need for new model training or reconfiguration, and offers a more transparent way of evaluating and detecting AI-generated text.

pdf abs
Enhancing Educational Dialogues: A Reinforcement Learning Approach for Generating AI Teacher Responses
Thomas Huber | Christina Niklaus | Siegfried Handschuh

Reinforcement Learning remains an underutilized method of training and fine-tuning Language Models (LMs) despite recent successes. This paper presents a simple approach of fine-tuning a language model with Reinforcement Learning to achieve competitive performance on the BEA 2023 Shared Task whose goal is to automatically generate teacher responses in educational dialogues. We utilized the novel NLPO algorithm that masks out tokens during generation to direct the model towards generations that maximize a reward function. We show results for both the t5-base model with 220 million parameters from the HuggingFace repository submitted to the leaderboard that, despite its comparatively small size, has achieved a good performance on both test and dev set, as well as GPT-2 with 124 million parameters. The presented results show that despite maximizing only one of the metrics used in the evaluation as a reward function our model scores highly in the other metrics as well.

pdf abs
Assessing the efficacy of large language models in generating accurate teacher responses
Yann Hicke | Abhishek Masand | Wentao Guo | Tushaar Gangavarapu

(Tack et al., 2023) organized the shared task hosted by the 18th Workshop on Innovative Use of NLP for Building Educational Applications on generation of teacher language in educational dialogues. Following the structure of the shared task, in this study, we attempt to assess the generative abilities of large language models in providing informative and helpful insights to students, thereby simulating the role of a knowledgeable teacher. To this end, we present an extensive evaluation of several benchmarking generative models, including GPT-4 (few-shot, in-context learning), fine-tuned GPT-2, and fine-tuned DialoGPT. Additionally, to optimize for pedagogical quality, we fine-tuned the Flan-T5 model using reinforcement learning. Our experimental findings on the Teacher-Student Chatroom Corpus subset indicate the efficacy of GPT-4 over other fine-tuned models, measured using BERTScore and DialogRPT. We hypothesize that several dataset characteristics, including sampling, representativeness, and dialog completeness, pose significant challenges to fine-tuning, thus contributing to the poor generalizability of the fine-tuned models. Finally, we note the need for these generative models to be evaluated with a metric that relies not only on dialog coherence and matched language modeling distribution but also on the model’s ability to showcase pedagogical skills.

pdf abs
RETUYT-InCo at BEA 2023 Shared Task: Tuning Open-Source LLMs for Generating Teacher Responses
Alexis Baladón | Ignacio Sastre | Luis Chiruzzo | Aiala Rosá

This paper presents the results of our participation in the BEA 2023 shared task, which focuses on generating AI teacher responses in educational dialogues. We conducted experiments using several Open-Source Large Language Models (LLMs) and explored fine-tuning techniques along with prompting strategies, including Few-Shot and Chain-of-Thought approaches. Our best model was ranked 4.5 in the competition with a BertScore F1 of 0.71 and a DialogRPT final (avg) of 0.35. Nevertheless, our internal results did not exactly correlate with those obtained in the competition, which showed the difficulty in evaluating this task. Other challenges we faced were data leakage on the train set and the irregular format of the conversations.

pdf abs
Empowering Conversational Agents using Semantic In-Context Learning
Amin Omidvar | Aijun An

Language models are one of the biggest game changers in downstream NLP applications, especially in conversational agents. In spite of their awesome capabilities to generated responses to solve the inquireis, there are still some big challenges to using them. One challenge is how to enable the LLMs to use the private internal data to solve inquires. And secondly, how to keep the LLMs updated with newly incoming data without the burden of fine-tuning as it is not only expensive but also not an available option for some commercial LLMs, such as ChatGPT. In this work, we propose Semantic In-Context Learning (S-ICL) to address the aforementioned challenges. Our approach was participated in the BEA 2023 shared task and ended up having the fourth place in both development and evaluation phases.

pdf abs
NAISTeacher: A Prompt and Rerank Approach to Generating Teacher Utterances in Educational Dialogues
Justin Vasselli | Christopher Vasselli | Adam Nohejl | Taro Watanabe

This paper presents our approach to the BEA 2023 shared task of generating teacher responses in educational dialogues, using the Teacher-Student Chatroom Corpus. Our system prompts GPT-3.5-turbo to generate initial suggestions, which are then subjected to reranking. We explore multiple strategies for candidate generation, including prompting for multiple candidates and employing iterative few-shot prompts with negative examples. We aggregate all candidate responses and rerank them based on DialogRPT scores. To handle consecutive turns in the dialogue data, we divide the task of generating teacher utterances into two components: teacher replies to the student and teacher continuations of previously sent messages. Through our proposed methodology, our system achieved the top score on both automated metrics and human evaluation, surpassing the reference human teachers on the latter.

pdf abs
The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues
Anaïs Tack | Ekaterina Kochmar | Zheng Yuan | Serge Bibauw | Chris Piech

This paper describes the results of the first shared task on generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab and experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT- 2, GPT-3, GPT-4, LLaMA, OPT-2.7B, and T5- base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 Turbo using an ensemble of prompts and DialogRPT-based ranking of responses for given dialogue contexts. Despite promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts.

pdf abs
The ADAIO System at the BEA-2023 Shared Task: Shared Task Generating AI Teacher Responses in Educational Dialogues
Adaeze Adigwe | Zheng Yuan

This paper presents the ADAIO team’s system entry in the Building Educational Applications (BEA) 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. The task aims to assess the performance of state-of-the-art generative models as AI teachers in producing suitable responses within a student-teacher dialogue. Our system comprises evaluating various baseline models using OpenAI GPT-3 and designing diverse prompts to prompt the OpenAI models for teacher response generation. After the challenge, our system achieved second place by employing a few-shot prompt-based approach with the OpenAI text-davinci-003 model. The results highlight the few-shot learning capabilities of large-language models, particularly OpenAI’s GPT-3, in the role of AI teachers.