2024
pdf
abs
Detecting Cybercrimes in Accordance with Pakistani Law: Dataset and Evaluation Using PLMs
Faizad Ullah
|
Ali Faheem
|
Ubaid Azam
|
Muhammad Sohaib Ayub
|
Faisal Kamiran
|
Asim Karim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Cybercrime is a serious and growing threat affecting millions of people worldwide. Detecting cybercrimes from text messages is challenging, as it requires understanding the linguistic and cultural nuances of different languages and regions. Roman Urdu is a widely used language in Pakistan and other South Asian countries, however, it lacks sufficient resources and tools for natural language processing and cybercrime detection. To address this problem, we make three main contributions in this paper. (1) We create and release CRU, a benchmark dataset for text-based cybercrime detection in Roman Urdu, which covers a number of cybercrimes as defined by the Prevention of Electronic Crimes Act (PECA) of Pakistan. This dataset is annotated by experts following a standardized procedure based on Pakistan’s legal framework. (2) We perform experiments on four pre-trained language models (PLMs) for cybercrime text classification in Roman Urdu. Our results show that xlm-roberta-base is the best model for this task, achieving the highest performance on all metrics. (3) We explore the utility of prompt engineering techniques, namely prefix and cloze prompts, for enhancing the performance of PLMs for low-resource languages such as Roman Urdu. We analyze the impact of different prompt shapes and k-shot settings on the performance of xlm-roberta-base and bert-base-multilingual-cased. We find that prefix prompts are more effective than cloze prompts for Roman Urdu classification tasks, as they provide more contextually relevant completions for the models. Our work provides useful insights and resources for future research on cybercrime detection and text classification in low-resource languages.
pdf
abs
UrduMASD: A Multimodal Abstractive Summarization Dataset for Urdu
Ali Faheem
|
Faizad Ullah
|
Muhammad Sohaib Ayub
|
Asim Karim
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this era of multimedia dominance, the surge of multimodal content on social media has transformed our methods of communication and information exchange. With the widespread use of multimedia content, the ability to effectively summarize this multimodal content is crucial for enhancing consumption, searchability, and retrieval. The scarcity of such training datasets has been a barrier to research in this area, especially for low-resource languages like Urdu. To address this gap, this paper introduces “UrduMASD”, a video-based Urdu multimodal abstractive text summarization dataset. The dataset contains 15,374 collections of videos, audio, titles, transcripts, and corresponding text summaries. To ensure the quality of the dataset, intrinsic evaluation metrics such as Abstractivity, Compression, Redundancy, and Semantic coherence have been employed. It was observed that our dataset surpasses existing datasets on numerous key quality metrics. Additionally, we present baseline results achieved using both text-based and state-of-the-art multimodal summarization models. On adding visual information, an improvement of 2.6% was observed in the ROUGE scores, highlighting the efficacy of utilizing multimodal inputs for summarization. To the best of our knowledge, this is the first dataset in Urdu that provides video-based multimodal data for abstractive text summarization, making it a valuable resource for advancing research in this field.
2023
pdf
abs
Comparing Prompt-Based and Standard Fine-Tuning for Urdu Text Classification
Faizad Ullah
|
Ubaid Azam
|
Ali Faheem
|
Faisal Kamiran
|
Asim Karim
Findings of the Association for Computational Linguistics: EMNLP 2023
Recent advancements in natural language processing have demonstrated the efficacy of pre-trained language models for various downstream tasks through prompt-based fine-tuning. In contrast to standard fine-tuning, which relies solely on labeled examples, prompt-based fine-tuning combines a few labeled examples (few shot) with guidance through prompts tailored for the specific language and task. For low-resource languages, where labeled examples are limited, prompt-based fine-tuning appears to be a promising alternative. In this paper, we compare prompt-based and standard fine-tuning for the popular task of text classification in Urdu and Roman Urdu languages. We conduct experiments using five datasets, covering different domains, and pre-trained multilingual transformers. The results reveal that significant improvement of up to 13% in accuracy is achieved by prompt-based fine-tuning over standard fine-tuning approaches. This suggests the potential of prompt-based fine-tuning as a valuable approach for low-resource languages with limited labeled data.
2022
pdf
abs
Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu
Ubaid Azam
|
Hammad Rizwan
|
Asim Karim
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In an era where social media platform users are growing rapidly, there has been a marked increase in hateful content being generated; to combat this, automatic hate speech detection systems are a necessity. For this purpose, researchers have recently focused their efforts on developing datasets, however, the vast majority of them have been generated for the English language, with only a few available for low-resource languages such as Roman Urdu. Furthermore, what few are available have small number of samples that pertain to hateful classes and these lack variations in topics and content. Thus, deep learning models trained on such datasets perform poorly when deployed in the real world. To improve performance the option of collecting and annotating more data can be very costly and time consuming. Thus, data augmentation techniques need to be explored to exploit already available datasets to improve model generalizability. In this paper, we explore different data augmentation techniques for the improvement of hate speech detection in Roman Urdu. We evaluate these augmentation techniques on two datasets. We are able to improve performance in the primary metric of comparison (F1 and Macro F1) as well as in recall, which is impertinent for human-in-the-loop AI systems.
2020
pdf
abs
Hate-Speech and Offensive Language Detection in Roman Urdu
Hammad Rizwan
|
Muhammad Haroon Shakeel
|
Asim Karim
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
The task of automatic hate-speech and offensive language detection in social media content is of utmost importance due to its implications in unprejudiced society concerning race, gender, or religion. Existing research in this area, however, is mainly focused on the English language, limiting the applicability to particular demographics. Despite its prevalence, Roman Urdu (RU) lacks language resources, annotated datasets, and language models for this task. In this study, we: (1) Present a lexicon of hateful words in RU, (2) Develop an annotated dataset called RUHSOLD consisting of 10,012 tweets in RU with both coarse-grained and fine-grained labels of hate-speech and offensive language, (3) Explore the feasibility of transfer learning of five existing embedding models to RU, (4) Propose a novel deep learning architecture called CNN-gram for hate-speech and offensive language detection and compare its performance with seven current baseline approaches on RUHSOLD dataset, and (5) Train domain-specific embeddings on more than 4.7 million tweets and make them publicly available. We conclude that transfer learning is more beneficial as compared to training embedding from scratch and that the proposed model exhibits greater robustness as compared to the baselines.
2015
pdf
An Unsupervised Method for Discovering Lexical Variations in Roman Urdu Informal Text
Abdul Rafae
|
Abdul Qayyum
|
Muhammad Moeenuddin
|
Asim Karim
|
Hassan Sajjad
|
Faisal Kamiran
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2012
pdf
MIKE: An Interactive Microblogging Keyword Extractor using Contextual Semantic Smoothing
Osama Khan
|
Asim Karim
Proceedings of COLING 2012: Demonstration Papers