This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
RichardCorrenti
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
This study explores the use of ChatGPT-4.1 as a formative assessment tool for identifying revision patterns in young adolescents’ argumentative writing. ChatGPT-4.1 shows moderate agreement with human coders on identifying evidence-related revision patterns and fair agreement on explanation-related ones. Implications for LLM-assisted formative assessment of young adolescent writing are discussed.
Recent advancements in natural language processing, particularly large language models (LLMs), are making the automated evaluation of classroom discussions more achievable. In this work, we propose a method to improve the performance of LLMs on classroom discussion quality assessment by utilizing in-context learning (ICL) example retrieval. Specifically, we leverage example re-ranking and label ratio regulation, which forces a specific ratio of different types of examples on the ICL examples.While a standard ICL example retrieval approach shows inferior performance compared to using a predetermined set of examples, our approach improves performance in all tested dimensions. We also conducted experiments to examine the ineffectiveness of the generic ICL example retrieval approach and found that the lack of positive and hard negative examples can be a potential cause. Our analyses emphasize the importance of maintaining a balanced distribution of classes (positive, non-hard negative, and hard negative examples) in creating a good set of ICL examples, especially when we can utilize educational knowledge to identify instances of hard negative examples.
The ability to revise essays in response to feedback is important for students’ writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.
Although effective revision is the crucial component of writing instruction, few automated writing evaluation (AWE) systems specifically focus on the quality of the revisions students undertake. In this study we investigate the use of a large language model (GPT-4) with Chain-of-Thought (CoT) prompting for assessing the quality of young students’ essay revisions aligned with the automated feedback messages they received. Results indicate that GPT-4 has significant potential for evaluating revision quality, particularly when detailed rubrics are included that describe common revision patterns shown by young writers. However, the addition of CoT prompting did not significantly improve performance. Further examination of GPT-4’s scoring performance across various levels of student writing proficiency revealed variable agreement with human ratings. The implications for improving AWE systems focusing on young students are discussed.
The ability to revise in response to feedback is critical to students’ writing success. In the case of argument writing in specific, identifying whether an argument revision (AR) is successful or not is a complex problem because AR quality is dependent on the overall content of an argument. For example, adding the same evidence sentence could strengthen or weaken existing claims in different argument contexts (ACs). To address this issue we developed Chain-of-Thought prompts to facilitate ChatGPT-generated ACs for AR quality predictions. The experiments on two corpora, our annotated elementary essays and existing college essays benchmark, demonstrate the superiority of the proposed ACs over baselines.
Automated writing evaluation systems can improve students’ writing insofar as students attend to the feedback provided and revise their essay drafts in ways aligned with such feedback. Existing research on revision of argumentative writing in such systems, however, has focused on the types of revisions students make (e.g., surface vs. content) rather than the extent to which revisions actually respond to the feedback provided and improve the essay. We introduce an annotation scheme to capture the nature of sentence-level revisions of evidence use and reasoning (the ‘RER’ scheme) and apply it to 5th- and 6th-grade students’ argumentative essays. We show that reliable manual annotation can be achieved and that revision annotations correlate with a holistic assessment of essay improvement in line with the feedback provided. Furthermore, we explore the feasibility of automatically classifying revisions according to our scheme.