Brian Riordan


pdf bib
Using PRMSE to evaluate automated scoring systems in the presence of label noise
Anastassia Loukina | Nitin Madnani | Aoife Cahill | Lili Yao | Matthew S. Johnson | Brian Riordan | Daniel F. McCaffrey
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.

An empirical investigation of neural methods for content scoring of science explanations
Brian Riordan | Sarah Bichler | Allison Bradford | Jennifer King Chen | Korah Wiley | Libby Gerard | Marcia C. Linn
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students’ integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics.

Context-based Automated Scoring of Complex Mathematical Responses
Aoife Cahill | James H Fife | Brian Riordan | Avijit Vajpayee | Dmytro Galochkin
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The tasks of automatically scoring either textual or algebraic responses to mathematical questions have both been well-studied, albeit separately. In this paper we propose a method for automatically scoring responses that contain both text and algebraic expressions. Our method not only achieves high agreement with human raters, but also links explicitly to the scoring rubric – essentially providing explainable models and a way to potentially provide feedback to students in the future.

Don’t take “nswvtnvakgxpm” for an answer –The surprising vulnerability of automatic content scoring systems to adversarial input
Yuning Ding | Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch
Proceedings of the 28th International Conference on Computational Linguistics

Automatic content scoring systems are widely used on short answer tasks to save human effort. However, the use of these systems can invite cheating strategies, such as students writing irrelevant answers in the hopes of gaining at least partial credit. We generate adversarial answers for benchmark content scoring datasets based on different methods of increasing sophistication and show that even simple methods lead to a surprising decrease in content scoring performance. As an extreme example, up to 60% of adversarial answers generated from random shuffling of words in real answers are accepted by a state-of-the-art scoring system. In addition to analyzing the vulnerabilities of content scoring systems, we examine countermeasures such as adversarial training and show that these measures improve system robustness against adversarial answers considerably but do not suffice to completely solve the problem.


How to account for mispellings: Quantifying the benefit of character representations in neural content scoring models
Brian Riordan | Michael Flor | Robert Pugh
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Character-based representations in neural models have been claimed to be a tool to overcome spelling variation in in word token-based input. We examine this claim in neural models for content scoring. We formulate precise hypotheses about the possible effects of adding character representations to word-based models and test these hypotheses on large-scale real world content scoring datasets. We find that, while character representations may provide small performance gains in general, their effectiveness in accounting for spelling variation may be limited. We show that spelling correction can provide larger gains than character representations, and that spelling correction improves the performance of models with character representations. With these insights, we report a new state of the art on the ASAP-SAS content scoring dataset.


Atypical Inputs in Educational Applications
Su-Youn Yoon | Aoife Cahill | Anastassia Loukina | Klaus Zechner | Brian Riordan | Nitin Madnani
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

In large-scale educational assessments, the use of automated scoring has recently become quite common. While the majority of student responses can be processed and scored without difficulty, there are a small number of responses that have atypical characteristics that make it difficult for an automated scoring system to assign a correct score. We describe a pipeline that detects and processes these kinds of responses at run-time. We present the most frequent kinds of what are called non-scorable responses along with effective filtering models based on various NLP and speech processing technologies. We give an overview of two operational automated scoring systems —one for essay scoring and one for speech scoring— and describe the filtering models they use. Finally, we present an evaluation and analysis of filtering models used for spoken responses in an assessment of language proficiency.

A Semantic Role-based Approach to Open-Domain Automatic Question Generation
Michael Flor | Brian Riordan
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a novel rule-based system for automatic generation of factual questions from sentences, using semantic role labeling (SRL) as the main form of text analysis. The system is capable of generating both wh-questions and yes/no questions from the same semantic analysis. We present an extensive evaluation of the system and compare it to a recent neural network architecture for question generation. The SRL-based system outperforms the neural system in both average quality and variety of generated questions.


Investigating neural architectures for short answer scoring
Brian Riordan | Andrea Horbach | Aoife Cahill | Torsten Zesch | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Neural approaches to automated essay scoring have recently shown state-of-the-art performance. The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions. This differs from the short answer content scoring task, which focuses on content accuracy. The inputs to neural essay scoring models – ngrams and embeddings – are arguably well-suited to evaluate content in short answer scoring tasks. We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring. We show that neural architectures can outperform a strong non-neural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring.


Automatically Scoring Tests of Proficiency in Music Instruction
Nitin Madnani | Aoife Cahill | Brian Riordan
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

Evaluating Argumentative and Narrative Essays using Graphs
Swapna Somasundaran | Brian Riordan | Binod Gyawali | Su-Youn Yoon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This work investigates whether the development of ideas in writing can be captured by graph properties derived from the text. Focusing on student essays, we represent the essay as a graph, and encode a variety of graph properties including PageRank as features for modeling essay scores related to quality of development. We demonstrate that our approach improves on a state-of-the-art system on the task of holistic scoring of persuasive essays and on the task of scoring narrative essays along the development dimension.


pdf bib
Detecting Sociostructural Beliefs about Group Status Differences in Online Discussions
Brian Riordan | Heather Wade | Afzal Upal
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media