Jill Burstein

2024

pdf bib
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Ekaterina Kochmar | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anaïs Tack | Victoria Yaneva | Zheng Yuan
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

2023

pdf abs
Automated evaluation of written discourse coherence using GPT-4
Ben Naismith | Phoebe Mulcaire | Jill Burstein
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

The popularization of large language models (LLMs) such as OpenAI’s GPT-3 and GPT-4 have led to numerous innovations in the field of AI in education. With respect to automated writing evaluation (AWE), LLMs have reduced challenges associated with assessing writing quality characteristics that are difficult to identify automatically, such as discourse coherence. In addition, LLMs can provide rationales for their evaluations (ratings) which increases score interpretability and transparency. This paper investigates one approach to producing ratings by training GPT-4 to assess discourse coherence in a manner consistent with expert human raters. The findings of the study suggest that GPT-4 has strong potential to produce discourse coherence ratings that are comparable to human ratings, accompanied by clear rationales. Furthermore, the GPT-4 ratings outperform traditional NLP coherence metrics with respect to agreement with human ratings. These results have implications for advancing AWE technology for learning and assessment.

pdf abs
Rating Short L2 Essays on the CEFR Scale with GPT-4
Kevin P. Yancey | Geoffrey Laflair | Anthony Verardi | Jill Burstein
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Essay scoring is a critical task used to evaluate second-language (L2) writing proficiency on high-stakes language assessments. While automated scoring approaches are mature and have been around for decades, human scoring is still considered the gold standard, despite its high costs and well-known issues such as human rater fatigue and bias. The recent introduction of large language models (LLMs) brings new opportunities for automated scoring. In this paper, we evaluate how well GPT-3.5 and GPT-4 can rate short essay responses written by L2 English learners on a high-stakes language assessment, computing inter-rater agreement with human ratings. Results show that when calibration examples are provided, GPT-4 can perform almost as well as modern Automatic Writing Evaluation (AWE) methods, but agreement with human ratings can vary depending on the test-taker’s first language (L1).

2022

2021

Writing Mentor is a free Google Docs add-on designed to provide feedback to struggling writers and help them improve their writing in a self-paced and self-regulated fashion. Writing Mentor uses natural language processing (NLP) methods and resources to generate feedback in terms of features that research into post-secondary struggling writers has classified as developmental (Burstein et al., 2016b). These features span many writing sub-constructs (use of sources, claims, and evidence; topic development; coherence; and knowledge of English conventions). Prelimi- nary analysis indicates that users have a largely positive impression of Writing Mentor in terms of usability and potential impact on their writing.

pdf bib
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Helen Yannakoudakis
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

2017

pdf abs
Building Better Open-Source Tools to Support Fairness in Automated Scoring
Nitin Madnani | Anastassia Loukina | Alina von Davier | Jill Burstein | Aoife Cahill
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Automated scoring of written and spoken responses is an NLP application that can significantly impact lives especially when deployed as part of high-stakes tests such as the GRE® and the TOEFL®. Ethical considerations require that automated scoring algorithms treat all test-takers fairly. The educational measurement community has done significant research on fairness in assessments and automated scoring systems must incorporate their recommendations. The best way to do that is by making available automated, non-proprietary tools to NLP researchers that directly incorporate these recommendations and generate the analyses needed to help identify and resolve biases in their scoring systems. In this paper, we attempt to provide such a solution.

pdf bib
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock | Helen Yannakoudakis
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
Exploring Relationships Between Writing & Broader Outcomes With Automated Writing Evaluation
Jill Burstein | Dan McCaffrey | Beata Beigman Klebanov | Guangming Ling
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Writing is a challenge, especially for at-risk students who may lack the prerequisite writing skills required to persist in U.S. 4-year postsecondary (college) institutions. Educators teaching postsecondary courses requiring writing could benefit from a better understanding of writing achievement and its role in postsecondary success. In this paper, novel exploratory work examined how automated writing evaluation (AWE) can inform our understanding of the relationship between postsecondary writing skill and broader success outcomes. An exploratory study was conducted using test-taker essays from a standardized writing assessment of postsecondary student learning outcomes. Findings showed that for the essays, AWE features were found to be predictors of broader outcomes measures: college success and learning outcomes measures. Study findings illustrate AWE’s potential to support educational analytics – i.e., relationships between writing skill and broader outcomes – taking a step toward moving AWE beyond writing assessment and instructional use cases.

2016

pdf bib
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock | Helen Yannakoudakis
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Enhancing STEM Motivation through Personal and Communal Values: NLP for Assessment of Utility Value in Student Writing
Beata Beigman Klebanov | Jill Burstein | Judith Harackiewicz | Stacy Priniski | Matthew Mulholland
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Language Muse: Automated Linguistic Activity Generation for English Language Learners
Nitin Madnani | Jill Burstein | John Sabatini | Kietha Biggers | Slava Andreyev
Proceedings of ACL-2016 System Demonstrations

2015

pdf bib
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Scoring Persuasive Essays Using Opinions and their Targets
Noura Farra | Swapna Somasundaran | Jill Burstein
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

2014

pdf bib
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Finding your “Inner-Annotator”: An Experiment in Annotator Independence for Rating Discourse Coherence Quality in Essays
Jill Burstein | Swapna Somasundaran | Martin Chodorow
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf
Content Importance Models for Scoring Writing From Sources
Beata Beigman Klebanov | Nitin Madnani | Jill Burstein | Swapna Somasundaran
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Lexical Chaining for Measuring Discourse Coherence Quality in Test-taker Essays
Swapna Somasundaran | Jill Burstein | Martin Chodorow
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
The Far Reach of Multiword Expressions in Educational Technology
Jill Burstein
Proceedings of the 9th Workshop on Multiword Expressions

pdf bib
A User Study: Technology to Increase Teachers’ Linguistic Awareness to Improve Instructional Language Support for English Language Learners
Jill Burstein | John Sabatini | Jane Shore | Brad Moulder | Jennifer Lentini
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility

pdf bib
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
Joel Tetreault | Jill Burstein | Claudia Leacock
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Automated Scoring of a Summary-Writing Task Designed to Measure Reading Comprehension
Nitin Madnani | Jill Burstein | John Sabatini | Tenaha O’Reilly
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf abs
Using Pivot-Based Paraphrasing and Sentiment Profiles to Improve a Subjectivity Lexicon for Essay Data
Beata Beigman Klebanov | Nitin Madnani | Jill Burstein
Transactions of the Association for Computational Linguistics, Volume 1

We demonstrate a method of improving a seed sentiment lexicon developed on essay data by using a pivot-based paraphrasing system for lexical expansion coupled with sentiment profile enrichment using crowdsourcing. Profile enrichment alone yields up to 15% improvement in the accuracy of the seed lexicon on 3-way sentence-level sentiment polarity classification of essay data. Using lexical expansion in addition to sentiment profiles provides a further 7% improvement in performance. Additional experiments show that the proposed method is also effective with other subjectivity lexicons and in a different domain of application (product reviews).