2025
pdf
bib
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Ruvan Weerasinghe
|
Isuri Anuradha
|
Deshan Sumanathilaka
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
pdf
bib
abs
IndoNLP 2025 Shared Task: Romanized Sinhala to Sinhala Reverse Transliteration Using BERT
Sandun Sameera Perera
|
Lahiru Prabhath Jayakodi
|
Deshan Koshala Sumanathilaka
|
Isuri Anuradha
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
The Romanized text has become popular with the growth of digital communication platforms, largely due to the familiarity with English keyboards. In Sri Lanka, Romanized Sinhala, commonly referred to as “Singlish” is widely used in digital communications. This paper introduces a novel context-aware back-transliteration system designed to address the ad-hoc typing patterns and lexical ambiguity inherent in Singlish. The proposed system com bines dictionary-based mapping for Singlish words, a rule-based transliteration for out of-vocabulary words and a BERT-based language model for addressing lexical ambiguities. Evaluation results demonstrate the robustness of the proposed approach, achieving high BLEU scores along with low Word Error Rate (WER) and Character Error Rate (CER) across test datasets. This study provides an effective solution for Romanized Sinhala back-transliteration and establishes the foundation for improving NLP tools for similar low-resourced languages.
pdf
bib
abs
Toponym Resolution: Will Prompt Engineering Change Expectations?
Isuri Anuradha
|
Deshan Koshala Sumanathilaka
|
Ruslan Mitkov
|
Paul Rayson
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Large Language Models(LLMs) have revolutionised the field of artificial intelligence and have been successfully employed in many disciplines, capturing widespread attention and enthusiasm. Many previous studies have established that Domain-specific Deep Learning models to competitively perform with the general-purpose LLMs (Maatouk et al., 2024;Lu et al., 2024). However, a suitable prompt which provides direct instructions and background information is expected to yield im- proved results (Kamruzzaman and Kim, 2024). The present study focuses on utilising LLMs for the Toponym Resolution task by incorporating Retrieval-Augmented Generation(RAG) and prompting techniques to surpass the results of the traditional Deep Learning models. Moreover, this study demonstrates that promising results can be achieved without relying on large amounts of labelled, domain-specific data. After a descriptive comparison between open-source and proprietary LLMs through different prompt engineering techniques, the GPT-4o model performs best compared to the other LLMs for the Toponym Resolution task.
pdf
bib
abs
HoloBERT: Pre-Trained Transformer Model for Historical Narratives
Isuri Anuradha
|
Le An Ha
|
Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Oral texts often contain spontaneous, unstructured language with features like disfluencies, colloquialisms, and non-standard syntax. In this paper, we investigate how further pretraining language models with specialised learning objectives for oral and transcribed texts to enhance Named Entity Recognition (NER) performance in Holocaust-related discourse. To evaluate our models, we compare the extracted named entities (NE) against those from other pretrained models on historical texts and generative AI models such as GPT. Furthermore, we demonstrate practical applications of the recognised NEs by linking them to a knowledge base as structured metadata and representing them in a graph format. With these contributions, our work illustrates how the further-pretrain-and-fine-tune paradigm in Natural Language Processing advances research in Digital Humanities.
2024
pdf
bib
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
Isuri Anuradha
|
Martin Wynne
|
Francesca Frontini
|
Alistair Plum
Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
pdf
bib
abs
CHAMUÇA: Towards a Linked Data Language Resource of Portuguese Borrowings in Asian Languages
Anas Fahad Khan
|
Ana Salgado
|
Isuri Anuradha
|
Rute Costa
|
Chamila Liyanage
|
John P. McCrae
|
Atul Kr. Ojha
|
Priya Rani
|
Francesca Frontini
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
This paper presents the development of CHAMUÇA, a novel lexical resource designed to document the influence of the Portuguese language on various Asian languages, with an initial focus on the languages of South Asia. Through the utilization of linked open data and the OntoLex vocabulary, CHAMUÇA offers structured insights into the linguistic characteristics, and cultural ramifications of Portuguese borrowings across multiple languages. The article outlines CHAMUÇA’s potential contributions to the linguistic linked data community, emphasising its role in addressing the scarcity of resources for lesser-resourced languages and serving as a test case for organising etymological data in a queryable format. CHAMUÇA emerges as an initiative towards the comprehensive catalogization and analysis of Portuguese borrowings, offering valuable insights into language contact dynamics, historical evolution, and cultural exchange in Asia, one that is based on linked data technology.
2023
pdf
bib
abs
Evaluating of Large Language Models in Relationship Extraction from Unstructured Data: Empirical Study from Holocaust Testimonies
Isuri Anuradha
|
Le An Ha
|
Ruslan Mitkov
|
Vinita Nahar
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Relationship extraction from unstructured data remains one of the most challenging tasks in the field of Natural Language Processing (NLP). The complexity of relationship extraction arises from the need to comprehend the underlying semantics, syntactic structures, and contextual dependencies within the text. Unstructured data poses challenges with diverse linguistic patterns, implicit relationships, contextual nuances, complicating accurate relationship identification and extraction. The emergence of Large Language Models (LLMs), such as GPT (Generative Pre-trained Transformer), has indeed marked a significant advancement in the field of NLP. In this work, we assess and evaluate the effectiveness of LLMs in relationship extraction in the Holocaust testimonies within the context of the Historical realm. By delving into this domain-specific context, we aim to gain deeper insights into the performance and capabilities of LLMs in accurately capturing and extracting relationships within the Holocaust domain by developing a novel knowledge graph to visualise the relationships of the Holocaust. To the best of our knowledge, there is no existing study which discusses relationship extraction in Holocaust testimonies. The majority of current approaches for Information Extraction (IE) in historic documents are either manual or OCR based. Moreover, in this study, we found that the Subject-Object-Verb extraction using GPT3-based relations produced more meaningful results compared to the Semantic Role labeling-based triple extraction.