Sandaru Seneviratne


2024

pdf
ANU at MLSP-2024: Prompt-based Lexical Simplification for English and Sinhala
Sandaru Seneviratne | Hanna Suominen
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Lexical simplification, the process of simplifying complex content in text without any modifications to the syntactical structure of text, plays a crucial role in enhancing comprehension and accessibility. This paper presents an approach to lexical simplification that relies on the capabilities of generative Artificial Intelligence (AI) models to predict the complexity of words and substitute complex words with simpler alternatives. Early lexical simplification methods predominantly relied on rule-based approaches, transitioning gradually to machine learning and deep learning techniques, leveraging contextual embeddings from large language models. However, the the emergence of generative AI models revolutionized the landscape of natural language processing, including lexical simplification. In this study, we proposed a straightforward yet effective method that employs generative AI models for both predicting lexical complexity and generating appropriate substitutions. To predict lexical complexity, we adopted three distinct types of prompt templates, while for lexical substitution, we employed three prompt templates alongside an ensemble approach. Extending our experimentation to include both English and Sinhala data, our approach demonstrated comparable performance across both languages, with particular strengths in lexical substitution.

2023

pdf
TextSimplifier: A Modular, Extensible, and Context Sensitive Simplification Framework for Improved Natural Language Understanding
Sandaru Seneviratne | Eleni Daskalaki | Hanna Suominen
Proceedings of the Second Workshop on Text Simplification, Accessibility and Readability

Natural language understanding is fundamental to knowledge acquisition in today’s information society. However, natural language is often ambiguous with frequent occurrences of complex terms, acronyms, and abbreviations that require substitution and disambiguation, for example, by “translation” from complex to simpler text for better understanding. These tasks are usually difficult for people with limited reading skills, second language learners, and non-native speakers. Hence, the development of text simplification systems that are capable of simplifying complex text is of paramount importance. Thus, we conducted a user study to identify which components are essential in a text simplification system. Based on our findings, we proposed an improved text simplification framework, covering a broader range of aspects related to lexical simplification — from complexity identification to lexical substitution and disambiguation — while supplementing the simplified outputs with additional information for better understandability. Based on the improved framework, we developed TextSimplifier, a modularised, context-sensitive, end-to-end simplification framework, and engineered its web implementation. This system targets lexical simplification that identifies complex terms and acronyms followed by their simplification through substitution and disambiguation for better understanding of complex language.

2022

pdf
Improving Text Simplification with Factuality Error Detection
Yuan Ma | Sandaru Seneviratne | Elena Daskalaki
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

In the past few years, the field of text simplification has been dominated by supervised learning approaches thanks to the appearance of large parallel datasets such as Wikilarge and Newsela. However, these datasets suffer from sentence pairs with factuality errors which compromise the models’ performance. So, we proposed a model-independent factuality error detection mechanism, considering bad simplification and bad alignment, to refine the Wikilarge dataset through reducing the weight of these samples during training. We demonstrated that this approach improved the performance of the state-of-the-art text simplification model TST5 by an FKGL reduction of 0.33 and 0.29 on the TurkCorpus and ASSET testing datasets respectively. Our study illustrates the impact of erroneous samples in TS datasets and highlights the need for automatic methods to improve their quality.

pdf
CILS at TSAR-2022 Shared Task: Investigating the Applicability of Lexical Substitution Methods for Lexical Simplification
Sandaru Seneviratne | Elena Daskalaki | Hanna Suominen
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

Lexical simplification — which aims to simplify complex text through the replacement of difficult words using simpler alternatives while maintaining the meaning of the given text — is popular as a way of improving text accessibility for both people and computers. First, lexical simplification through substitution can improve the understandability of complex text for, for example, non-native speakers, second language learners, and people with low literacy. Second, its usefulness has been demonstrated in many natural language processing problems like data augmentation, paraphrase generation, or word sense induction. In this paper, we investigated the applicability of existing unsupervised lexical substitution methods based on pre-trained contextual embedding models and WordNet, which incorporate Context Information, for Lexical Simplification (CILS). Although the performance of this CILS approach has been outstanding in lexical substitution tasks, its usefulness was limited at the TSAR-2022 shared task on lexical simplification. Consequently, a minimally supervised approach with careful tuning to a given simplification task may work better than unsupervised methods. Our investigation also encouraged further work on evaluating the simplicity of potential candidates and incorporating them into the lexical simplification methods.

pdf
m-Networks: Adapting the Triplet Networks for Acronym Disambiguation
Sandaru Seneviratne | Elena Daskalaki | Artem Lenskiy | Hanna Suominen
Proceedings of the 4th Clinical Natural Language Processing Workshop

Acronym disambiguation (AD) is the process of identifying the correct expansion of the acronyms in text. AD is crucial in natural language understanding of scientific and medical documents due to the high prevalence of technical acronyms and the possible expansions. Given that natural language is often ambiguous with more than one meaning for words, identifying the correct expansion for acronyms requires learning of effective representations for words, phrases, acronyms, and abbreviations based on their context. In this paper, we proposed an approach to leverage the triplet networks and triplet loss which learns better representations of text through distance comparisons of embeddings. We tested both the triplet network-based method and the modified triplet network-based method with m networks on the AD dataset from the SDU@AAAI-21 AD task, CASI dataset, and MeDAL dataset. F scores of 87.31%, 70.67%, and 75.75% were achieved by the m network-based approach for SDU, CASI, and MeDAL datasets respectively indicating that triplet network-based methods have comparable performance but with only 12% of the number of parameters in the baseline method. This effective implementation is available at https://github.com/sandaruSen/m_networks under the MIT license.

pdf
CILex: An Investigation of Context Information for Lexical Substitution Methods
Sandaru Seneviratne | Elena Daskalaki | Artem Lenskiy | Hanna Suominen
Proceedings of the 29th International Conference on Computational Linguistics

Lexical substitution, which aims to generate substitutes for a target word given a context, is an important natural language processing task useful in many applications. Due to the paucity of annotated data, existing methods for lexical substitution tend to rely on manually curated lexical resources and contextual word embedding models. Methods based on lexical resources are likely to miss relevant substitutes whereas relying only on contextual word embedding models fails to provide adequate information on the impact of a substitute in the entire context and the overall meaning of the input. We proposed CILex, which uses contextual sentence embeddings along with methods that capture additional context information complimenting contextual word embeddings for lexical substitution. This ensured the semantic consistency of a substitute with the target word while maintaining the overall meaning of the sentence. Our experimental comparisons with previously proposed methods indicated that our solution is now the state-of-the-art on both the widely used LS07 and CoInCo datasets with P@1 scores of 55.96% and 57.25% for lexical substitution. The implementation of the proposed approach is available at https://github.com/sandaruSen/CILex under the MIT license.