Matthew Shardlow


2021

pdf bib
Investigating Text Simplification Evaluation
Laura Vásquez-Rodríguez | Matthew Shardlow | Piotr Przybyła | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
SemEval-2021 Task 1: Lexical Complexity Prediction
Matthew Shardlow | Richard Evans | Gustavo Henrique Paetzold | Marcos Zampieri
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

pdf bib
Manchester Metropolitan at SemEval-2021 Task 1: Convolutional Networks for Complex Word Identification
Robert Flynn | Matthew Shardlow
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

We present two convolutional neural networks for predicting the complexity of words and phrases in context on a continuous scale. Both models utilize word and character embeddings alongside lexical features as inputs. Our system displays reasonable results with a Pearson correlation of 0.7754 on the task as a whole. We highlight the limitations of this method in properly assessing the context of the target text, and explore the effectiveness of both systems across a range of genres. Both models were submitted as part of LCP 2021, which focuses on the identification of complex words and phrases as a context dependent, regression based task.

2020

pdf bib
CompLex — A New Corpus for Lexical Complexity Prediction from Likert Scale Data
Matthew Shardlow | Michael Cooper | Marcos Zampieri
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such astext simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studieshave approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) fora set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using abinary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexityprediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl,and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

pdf bib
Multi-Word Lexical Simplification
Piotr Przybyła | Matthew Shardlow
Proceedings of the 28th International Conference on Computational Linguistics

In this work we propose the task of multi-word lexical simplification, in which a sentence in natural language is made easier to understand by replacing its fragment with a simpler alternative, both of which can consist of many words. In order to explore this new direction, we contribute a corpus (MWLS1), including 1462 sentences in English from various sources with 7059 simplifications provided by human annotators. We also propose an automatic solution (Plainifier) based on a purpose-trained neural language model and evaluate its performance, comparing to human and resource-based baselines.

pdf bib
Detecting Multiword Expression Type Helps Lexical Complexity Assessment
Ekaterina Kochmar | Sian Gooding | Matthew Shardlow
Proceedings of the 12th Language Resources and Evaluation Conference

Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. Multiple NLP applications have been shown to benefit from MWE identification, however the research on lexical complexity of MWEs is still an under-explored area. In this work, we re-annotate the Complex Word Identification Shared Task 2018 dataset of Yimam et al. (2017), which provides complexity scores for a range of lexemes, with the types of MWEs. We release the MWE-annotated dataset with this paper, and we believe this dataset represents a valuable resource for the text simplification community. In addition, we investigate which types of expressions are most problematic for native and non-native readers. Finally, we show that a lexical complexity assessment system benefits from the information about MWE types.

pdf bib
CombiNMT: An Exploration into Neural Text Simplification Models
Michael Cooper | Matthew Shardlow
Proceedings of the 12th Language Resources and Evaluation Conference

This work presents a replication study of Exploring Neural Text Simplification Models (Nisioi et al., 2017). We were able to successfully replicate and extend the methods presented in the original paper. Alongside the replication results, we present our improvements dubbed CombiNMT. By using an updated implementation of OpenNMT, and incorporating the Newsela corpus alongside the original Wikipedia dataset (Hwang et al., 2016), as well as refining both datasets to select high quality training examples. Our work present two new systems, CombiNMT995, which is a result of matched sentences with a cosine similarity of 0.995 or less, and CombiNMT98, which, similarly, runs on a cosine similarity of 0.98 or less. By extending the human evaluation presented within the original paper, increasing both the number of annotators and the number of sentences annotated, with the intention of increasing the quality of the results, CombiNMT998 shows significant improvement over any of the Neural Text Simplification (NTS) systems from the original paper in terms of both the number of changes and the percentage of correct changes made.

2019

pdf bib
Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table
Matthew Shardlow | Raheel Nawaz
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Clinical letters are infamously impenetrable for the lay patient. This work uses neural text simplification methods to automatically improve the understandability of clinical letters for patients. We take existing neural text simplification software and augment it with a new phrase table that links complex medical terminology to simpler vocabulary by mining SNOMED-CT. In an evaluation task using crowdsourcing, we show that the results of our new system are ranked easier to understand (average rank 1.93) than using the original system (2.34) without our phrase table. We also show improvement against baselines including the original text (2.79) and using the phrase table without the neural text simplification software (2.94). Our methods can easily be transferred outside of the clinical domain by using domain-appropriate resources to provide effective neural text simplification for any domain without the need for costly annotation.

2018

pdf bib
Manchester Metropolitan at SemEval-2018 Task 2: Random Forest with an Ensemble of Features for Predicting Emoji in Tweets
Luciano Gerber | Matthew Shardlow
Proceedings of The 12th International Workshop on Semantic Evaluation

We present our submission to the Semeval 2018 task on emoji prediction. We used a random forest, with an ensemble of bag-of-words, sentiment and psycholinguistic features. Although we performed well on the trial dataset (attaining a macro f-score of 63.185 for English and 81.381 for Spanish), our approach did not perform as well on the test data. We describe our features and classi cation protocol, as well as initial experiments, concluding with a discussion of the discrepancy between our trial and test results.

pdf bib
A New Corpus to Support Text Mining for the Curation of Metabolites in the ChEBI Database
Matthew Shardlow | Nhung Nguyen | Gareth Owen | Claire O’Donovan | Andrew Leach | John McNaught | Steve Turner | Sophia Ananiadou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
NaCTeM at SemEval-2016 Task 1: Inferring sentence-level semantic similarity from an ensemble of complementary lexical and sentence-level features
Piotr Przybyła | Nhung T. H. Nguyen | Matthew Shardlow | Georgios Kontonatsios | Sophia Ananiadou
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2014

pdf bib
Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline
Matthew Shardlow
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Lexical simplification is the task of automatically reducing the complexity of a text by identifying difficult words and replacing them with simpler alternatives. Whilst this is a valuable application of natural language generation, rudimentary lexical simplification systems suffer from a high error rate which often results in nonsensical, non-simple text. This paper seeks to characterise and quantify the errors which occur in a typical baseline lexical simplification system. We expose 6 distinct categories of error and propose a classification scheme for these. We also quantify these errors for a moderate size corpus, showing the magnitude of each error type. We find that for 183 identified simplification instances, only 19 (10.38%) result in a valid simplification, with the rest causing errors of varying gravity.

2013

pdf bib
The CW Corpus: A New Resource for Evaluating the Identification of Complex Words
Matthew Shardlow
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations

pdf bib
A Comparison of Techniques to Automatically Identify Complex Words.
Matthew Shardlow
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop