Paula Buttery


The Specificity and Helpfulness of Peer-to-Peer Feedback in Higher Education
Roman Rietsche | Andrew Caines | Cornelius Schramm | Dominik Pfütze | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

With the growth of online learning through MOOCs and other educational applications, it has become increasingly difficult for course providers to offer personalized feedback to students. Therefore asking students to provide feedback to each other has become one way to support learning. This peer-to-peer feedback has become increasingly important whether in MOOCs to provide feedback to thousands of students or in large-scale classes at universities. One of the challenges when allowing peer-to-peer feedback is that the feedback should be perceived as helpful, and an import factor determining helpfulness is how specific the feedback is. However, in classes including thousands of students, instructors do not have the resources to check the specificity of every piece of feedback between students. Therefore, we present an automatic classification model to measure sentence specificity in written feedback. The model was trained and tested on student feedback texts written in German where sentences have been labelled as general or specific. We find that we can automatically classify the sentences with an accuracy of 76.7% using a conventional feature-based approach, whereas transfer learning with BERT for German gives a classification accuracy of 81.1%. However, the feature-based approach comes with lower computational costs and preserves human interpretability of the coefficients. In addition we show that specificity of sentences in feedback texts has a weak positive correlation with perceptions of helpfulness. This indicates that specificity is one of the ingredients of good feedback, and invites further investigation.

ALEN App: Argumentative Writing Support To Foster English Language Learning
Thiemo Wambsganss | Andrew Caines | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

This paper introduces a novel tool to support and engage English language learners with feedback on the quality of their argument structures. We present an approach which automatically detects claim-premise structures and provides visual feedback to the learner to prompt them to repair any broken argumentation structures.To investigate, if our persuasive feedback on language learners’ essay writing tasks engages and supports them in learning better English language, we designed the ALEN app (Argumentation for Learning English). We leverage an argumentation mining model trained on texts written by students and embed it in a writing support tool which provides students with feedback in their essay writing process. We evaluated our tool in two field-studies with a total of 28 students from a German high school to investigate the effects of adaptive argumentation feedback on their learning of English. The quantitative results suggest that using the ALEN app leads to a high self-efficacy, ease-of-use, intention to use and perceived usefulness for students in their English language learning process. Moreover, the qualitative answers indicate the potential benefits of combining grammar feedback with discourse level argumentation mining.

Towards an open-domain chatbot for language practice
Gladys Tyen | Mark Brenchley | Andrew Caines | Paula Buttery
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

State-of-the-art chatbots for English are now able to hold conversations on virtually any topic (e.g. Adiwardana et al., 2020; Roller et al., 2021). However, existing dialogue systems in the language learning domain still use hand-crafted rules and pattern matching, and are much more limited in scope. In this paper, we make an initial foray into adapting open-domain dialogue generation for second language learning. We propose and implement decoding strategies that can adjust the difficulty level of the chatbot according to the learner’s needs, without requiring further training of the chatbot. These strategies are then evaluated using judgements from human examiners trained in language education. Our results show that re-ranking candidate outputs is a particularly effective strategy, and performance can be further improved by adding sub-token penalties and filtering.

CEPOC: The Cambridge Exams Publishing Open Cloze dataset
Mariano Felice | Shiva Taslimipoor | Øistein E. Andersen | Paula Buttery
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Open cloze tests are a standard type of exercise where examinees must complete a text by filling in the gaps without any given options to choose from. This paper presents the Cambridge Exams Publishing Open Cloze (CEPOC) dataset, a collection of open cloze tests from world-renowned English language proficiency examinations. The tests in CEPOC have been expertly designed and validated using standard principles in language research and assessment. They are prepared for language learners at different proficiency levels and hence classified into different CEFR levels (A2, B1, B2, C1, C2). This resource can be a valuable testbed for various NLP tasks. We perform a complete set of experiments on three tasks: gap filling, gap prediction, and CEFR text classification. We implement transformer-based systems based on pre-trained language models to model each task and use our dataset as a test set, providing promising benchmark results.

Probing for targeted syntactic knowledge through grammatical error detection
Christopher Davis | Christopher Bryant | Andrew Caines | Marek Rei | Paula Buttery
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Targeted studies testing knowledge of subject-verb agreement (SVA) indicate that pre-trained language models encode syntactic information. We assert that if models robustly encode subject-verb agreement, they should be able to identify when agreement is correct and when it is incorrect. To that end, we propose grammatical error detection as a diagnostic probe to evaluate token-level contextual representations for their knowledge of SVA. We evaluate contextual representations at each layer from five pre-trained English language models: BERT, XLNet, GPT-2, RoBERTa and ELECTRA. We leverage public annotated training data from both English second language learners and Wikipedia edits, and report results on manually crafted stimuli for subject-verb agreement. We find that masked language models linearly encode information relevant to the detection of SVA errors, while the autoregressive models perform on par with our baseline. However, we also observe a divergence in performance when probes are trained on different training sets, and when they are evaluated on different syntactic constructions, suggesting the information pertaining to SVA error detection is not robustly encoded.

The Teacher-Student Chatroom Corpus version 2: more lessons, new annotation, automatic detection of sequence shifts
Andrew Caines | Helen Yannakoudakis | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 11th Workshop on NLP for Computer Assisted Language Learning

Constructing Open Cloze Tests Using Generation and Discrimination Capabilities of Transformers
Mariano Felice | Shiva Taslimipoor | Paula Buttery
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first multi-objective transformer model for generating open cloze tests that exploits generation and discrimination capabilities to improve performance. Our model is further enhanced by tweaking its loss function and applying a post-processing re-ranking algorithm that improves overall test structure. Experiments using automatic and human evaluation show that our approach can achieve up to 82% accuracy according to experts, outperforming previous work and baselines. We also release a collection of high-quality open cloze tests along with sample system output and human annotations that can serve as a future benchmark.


Efficient Unsupervised NMT for Related Languages with Cross-Lingual Language Models and Fidelity Objectives
Rami Aly | Andrew Caines | Paula Buttery
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

The most successful approach to Neural Machine Translation (NMT) when only monolingual training data is available, called unsupervised machine translation, is based on back-translation where noisy translations are generated to turn the task into a supervised one. However, back-translation is computationally very expensive and inefficient. This work explores a novel, efficient approach to unsupervised NMT. A transformer, initialized with cross-lingual language model weights, is fine-tuned exclusively on monolingual data of the target language by jointly learning on a paraphrasing and denoising autoencoder objective. Experiments are conducted on WMT datasets for German-English, French-English, and Romanian-English. Results are competitive to strong baseline unsupervised NMT models, especially for closely related source languages (German) compared to more distant ones (Romanian, French), while requiring about a magnitude less training time.


pdf bib
The Teacher-Student Chatroom Corpus
Andrew Caines | Helen Yannakoudakis | Helena Edmondson | Helen Allen | Pascual Pérez-Paredes | Bill Byrne | Paula Buttery
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

Investigating the effect of auxiliary objectives for the automated grading of learner English speech transcriptions
Hannah Craighead | Andrew Caines | Paula Buttery | Helen Yannakoudakis
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We address the task of automatically grading the language proficiency of spontaneous speech based on textual features from automatic speech recognition transcripts. Motivated by recent advances in multi-task learning, we develop neural networks trained in a multi-task fashion that learn to predict the proficiency level of non-native English speakers by taking advantage of inductive transfer between the main task (grading) and auxiliary prediction tasks: morpho-syntactic labeling, language modeling, and native language identification (L1). We encode the transcriptions with both bi-directional recurrent neural networks and with bi-directional representations from transformers, compare against a feature-rich baseline, and analyse performance at different proficiency levels and with transcriptions of varying error rates. Our best performance comes from a transformer encoder with L1 prediction as an auxiliary task. We discuss areas for improvement and potential applications for text-only speech scoring.

Detecting Trending Terms in Cybersecurity Forum Discussions
Jack Hughes | Seth Aycock | Andrew Caines | Paula Buttery | Alice Hutchings
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We present a lightweight method for identifying currently trending terms in relation to a known prior of terms, using a weighted log-odds ratio with an informative prior. We apply this method to a dataset of posts from an English-language underground hacking forum, spanning over ten years of activity, with posts containing misspellings, orthographic variation, acronyms, and slang. Our statistical approach supports analysis of linguistic change and discussion topics over time, without a requirement to train a topic model for each time interval for analysis. We evaluate the approach by comparing the results to TF-IDF using the discounted cumulative gain metric with human annotations, finding our method outperforms TF-IDF on information retrieval.

REPROLANG 2020: Automatic Proficiency Scoring of Czech, English, German, Italian, and Spanish Learner Essays
Andrew Caines | Paula Buttery
Proceedings of the Twelfth Language Resources and Evaluation Conference

We report on our attempts to reproduce the work described in Vajjala & Rama 2018, ‘Experiments with universal CEFR classification’, as part of REPROLANG 2020: this involves featured-based and neural approaches to essay scoring in Czech, German and Italian. Our results are broadly in line with those from the original paper, with some differences due to the stochastic nature of machine learning and programming language used. We correct an error in the reported metrics, introduce new baselines, apply the experiments to English and Spanish corpora, and generate adversarial data to test classifier robustness. We conclude that feature-based approaches perform better than neural network classifiers for text datasets of this size, though neural network modifications do bring performance closer to the best feature-based models.

Grammatical error detection in transcriptions of spoken English
Andrew Caines | Christian Bentz | Kate Knill | Marek Rei | Paula Buttery
Proceedings of the 28th International Conference on Computational Linguistics

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics. The corpus recordings were crowdsourced from native speakers of English and learners of English with German as their first language. The new transcriptions and annotations are obtained from different crowdworkers: we analyse the 1108 new crowdworker submissions and propose that they can be used for automatic transcription post-editing and grammatical error correction for speech. To further explore the data we train grammatical error detection models with various configurations including pre-trained and contextual word representations as input, additional features and auxiliary objectives, and extra training data from written error-annotated corpora. We find that a model concatenating pre-trained and contextual word representations as input performs best, and that additional information does not lead to further performance gains.


Entropy as a Proxy for Gap Complexity in Open Cloze Tests
Mariano Felice | Paula Buttery
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper presents a pilot study of entropy as a measure of gap complexity in open cloze tests aimed at learners of English. Entropy is used to quantify the information content in each gap, which can be used to estimate complexity. Our study shows that average gap entropy correlates positively with proficiency levels while individual gap entropy can capture contextual complexity. To the best of our knowledge, this is the first unsupervised information-theoretical approach to evaluating the quality of cloze tests.

CAMsterdam at SemEval-2019 Task 6: Neural and graph-based feature extraction for the identification of offensive tweets
Guy Aglionby | Chris Davis | Pushkar Mishra | Andrew Caines | Helen Yannakoudakis | Marek Rei | Ekaterina Shutova | Paula Buttery
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe the CAMsterdam team entry to the SemEval-2019 Shared Task 6 on offensive language identification in Twitter data. Our proposed model learns to extract textual features using a multi-layer recurrent network, and then performs text classification using gradient-boosted decision trees (GBDT). A self-attention architecture enables the model to focus on the most relevant areas in the text. In order to enrich input representations, we use node2vec to learn globally optimised embeddings for hashtags, which are then given as additional features to the GBDT classifier. Our best model obtains 78.79% macro F1-score on detecting offensive language (subtask A), 66.32% on categorising offence types (targeted/untargeted; subtask B), and 55.36% on identifying the target of offence (subtask C).


Aggressive language in an online hacking forum
Andrew Caines | Sergio Pastrana | Alice Hutchings | Paula Buttery
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

We probe the heterogeneity in levels of abusive language in different sections of the Internet, using an annotated corpus of Wikipedia page edit comments to train a binary classifier for abuse detection. Our test data come from the CrimeBB Corpus of hacking-related forum posts and we find that (a) forum interactions are rarely abusive, (b) the abusive language which does exist tends to be relatively mild compared to that found in the Wikipedia comments domain, and tends to involve aggressive posturing rather than hate speech or threats of violence. We observe that the purpose of conversations in online forums tend to be more constructive and informative than those in Wikipedia page edit comments which are geared more towards adversarial interactions, and that this may explain the lower levels of abuse found in our forum data than in Wikipedia comments. Further work remains to be done to compare these results with other inter-domain classification experiments, and to understand the impact of aggressive language in forum conversations.


A Text Normalisation System for Non-Standard English Words
Emma Flint | Elliot Ford | Olivia Thomas | Andrew Caines | Paula Buttery
Proceedings of the 3rd Workshop on Noisy User-generated Text

This paper investigates the problem of text normalisation; specifically, the normalisation of non-standard words (NSWs) in English. Non-standard words can be defined as those word tokens which do not have a dictionary entry, and cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g. lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of text-to-speech technology, and the solution is to spell them out in such a way that they can be pronounced appropriately. We describe our four-stage normalisation system made up of components for detection, classification, division and expansion of NSWs. Performance is favourabe compared to previous work in the field (Sproat et al. 2001, Normalization of non-standard words), as well as state-of-the-art text-to-speech software. Further, we update Sproat et al.’s NSW taxonomy, and create a more customisable system where users are able to input their own abbreviations and specify into which variety of English (currently available: British or American) they wish to normalise.

Parsing transcripts of speech
Andrew Caines | Michael McCarthy | Paula Buttery
Proceedings of the Workshop on Speech-Centric Natural Language Processing

We present an analysis of parser performance on speech data, comparing word type and token frequency distributions with written data, and evaluating parse accuracy by length of input string. We find that parser performance tends to deteriorate with increasing length of string, more so for spoken than for written texts. We train an alternative parsing model with added speech data and demonstrate improvements in accuracy on speech-units, with no deterioration in performance on written text.

Collecting fluency corrections for spoken learner English
Andrew Caines | Emma Flint | Paula Buttery
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We present crowdsourced collection of error annotations for transcriptions of spoken learner English. Our emphasis in data collection is on fluency corrections, a more complete correction than has traditionally been aimed for in grammatical error correction research (GEC). Fluency corrections require improvements to the text, taking discourse and utterance level semantics into account: the result is a more naturalistic, holistic version of the original. We propose that this shifted emphasis be reflected in a new name for the task: ‘holistic error correction’ (HEC). We analyse crowdworker behaviour in HEC and conclude that the method is useful with certain amendments for future work.


Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora
Andrew Caines | Christian Bentz | Calbert Graham | Tim Polzehl | Paula Buttery
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

Predicting Author Age from Weibo Microblog Posts
Wanru Zhang | Andrew Caines | Dimitrios Alikaniotis | Paula Buttery
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Binary file summaries/958.html matches

Automated speech-unit delimitation in spoken learner English
Russell Moore | Andrew Caines | Calbert Graham | Paula Buttery
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In order to apply computational linguistic analyses and pass information to downstream applications, transcriptions of speech obtained via automatic speech recognition (ASR) need to be divided into smaller meaningful units, in a task we refer to as ‘speech-unit (SU) delimitation’. We closely recreate the automatic delimitation system described by Lee and Glass (2012), ‘Sentence detection using multiple annotations’, Proceedings of INTERSPEECH, which combines a prosodic model, language model and speech-unit length model in log-linear fashion. Since state-of-the-art natural language processing (NLP) tools have been developed to deal with written text and its characteristic sentence-like units, SU delimitation helps bridge the gap between ASR and NLP, by normalising spoken data into a more canonical format. Previous work has focused on native speaker recordings; we test the system of Lee and Glass (2012) on non-native speaker (or ‘learner’) data, achieving performance above the state-of-the-art. We also consider alternative evaluation metrics which move away from the idea of a single ‘truth’ in SU delimitation, and frame this work in the context of downstream NLP applications.


Towards a computational model of grammaticalization and lexical diversity
Christian Bentz | Paula Buttery
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

The effect of disfluencies and learner errors on the parsing of spoken learner language
Andrew Caines | Paula Buttery
Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages


Reclassifying subcategorization frames for experimental analysis and stimulus generation
Paula Buttery | Andrew Caines
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Researchers in the fields of psycholinguistics and neurolinguistics increasingly test their experimental hypotheses against probabilistic models of language. VALEX (Korhonen et al., 2006) is a large-scale verb lexicon that specifies verb usage as probability distributions over a set of 163 verb SUBCATEGORIZATION FRAMES (SCFs). VALEX has proved to be a popular computational linguistic resource and may also be used by psycho- and neurolinguists for experimental analysis and stimulus generation. However, a probabilistic model based upon a set of 163 SCFs often proves too fine grained for experimenters in these fields. Our goal is to simplify the classification by grouping the frames into genera―explainable clusters that may be used as experimental parameters. We adopted two methods for reclassification. One was a manual linguistic approach derived from verb argumentation and clause features; the other was an automatic, computational approach driven from a graphical representation of SCFs. The premise was not only to compare the results of two quite different methods for our own interest, but also to enable other researchers to choose whichever reclassification better suited their purpose (one being grounded purely in theoretical linguistics and the other in practical language engineering). The various classifications are available as an online resource to researchers.

Annotating progressive aspect constructions in the spoken section of the British National Corpus
Andrew Caines | Paula Buttery
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a set of stand-off annotations for the ninety thousand sentences in the spoken section of the British National Corpus (BNC) which feature a progressive aspect verb group. These annotations may be matched to the original BNC text using the supplied document and sentence identifiers. The annotated features mostly relate to linguistic form: subject type, subject person and number, form of auxiliary verb, and clause type, tense and polarity. In addition, the sentences are classified for register, the formality of recording context: three levels of `spontaneity' with genres such as sermons and scripted speech at the most formal level and casual conversation at the least formal. The resource has been designed so that it may easily be augmented with further stand-off annotations. Expert linguistic annotations of spoken data, such as these, are valuable for improving the performance of natural language processing tools in the spoken language domain and assist linguistic research in general.


You Talking to Me? A Predictive Model for Zero Auxiliary Constructions
Andrew Caines | Paula Buttery
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

LIPS: A Tool for Predicting the Lexical Isolation Point of a Word
Andrew Thwaites | Jeroen Geertzen | William D. Marslen-Wilson | Paula Buttery
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present LIPS (Lexical Isolation Point Software), a tool for accurate lexical isolation point (IP) prediction in recordings of speech. The IP is the point in time in which a word is correctly recognised given the acoustic evidence available to the hearer. The ability to accurately determine lexical IPs is of importance to work in the field of cognitive processing, since it enables the evaluation of competing models of word recognition. IPs are also of importance in the field of neurolinguistics, where the analyses of high-temporal-resolution neuroimaging data require a precise time alignment of the observed brain activity with the linguistic input. LIPS provides an attractive alternative to costly multi-participant perception experiments by automatically computing IPs for arbitrary words. On a test set of words, the LIPS system predicts IPs with a mean difference from the actual IP of within 1ms. The difference from the predicted and actual IP approximate to a normal distribution with a standard deviation of around 80ms (depending on the model used).

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals
Caroline Williams | Andrew Thwaites | Paula Buttery | Jeroen Geertzen | Billi Randall | Meredith Shafto | Barry Devereux | Lorraine Tyler
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Investigating differences in linguistic usage between individuals who have suffered brain injury (hereafter patients) and those who haven’t can yield a number of benefits. It provides a better understanding about the precise way in which impairments affect patients’ language, improves theories of how the brain processes language, and offers heuristics for diagnosing certain types of brain damage based on patients’ speech. One method for investigating usage differences involves the analysis of spontaneous speech. In the work described here we construct a text corpus consisting of transcripts of individuals’ speech produced during two tasks: the Boston-cookie-theft picture description task (Goodglass and Kaplan, 1983) and a spontaneous speech task, which elicits a semi-prompted monologue, and/or free speech. Interviews with patients from 19yrs to 89yrs were transcribed, as were interviews with a comparable number of healthy individuals (20yrs to 89yrs). Structural brain images are available for approximately 30% of participants. This unique data source provides a rich resource for future research in many areas of language impairment and has been constructed to facilitate analysis with natural language processing and corpus linguistics techniques.


Biomedical Event Extraction without Training Data
Andreas Vlachos | Paula Buttery | Diarmuid Ó Séaghdha | Ted Briscoe
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task


pdf bib
Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition
Paula Buttery | Aline Villavicencio | Anna Korhonen
Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition

I will shoot your shopping down and you can shoot all my tins—Automatic Lexical Acquisition from the CHILDES Database
Paula Buttery | Anna Korhonen
Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition


pdf bib
A Quantitative Evaluation of Naturalistic Models of Language Acquisition; the Efficiency of the Triggering Learning Algorithm Compared to a Categorial Grammar Learner
Paula Buttery
Proceedings of the Workshop on Psycho-Computational Models of Human Language Acquisition