Rui Sousa-Silva

2025

pdf bib abs
MOZ-Smishing: A Benchmark Dataset for Detecting Mobile Money Frauds
Felermino D. M. A. Ali | Henrique Lopes Cardoso | Rui Sousa-Silva | Saide.saide@unilurio.ac.mz Saide.saide@unilurio.ac.mz
Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

Despite the increasing prevalence of smishing attacks targeting Mobile Money Transfer systems, there is a notable lack of publicly available SMS phishing datasets in this domain. This study seeks to address this gap by creating a specialized dataset designed to detect smishing attacks aimed at Mobile Money Transfer users. The data set consists of crowd-sourced text messages from Mozambican mobile users, meticulously annotated into two categories: legitimate messages (ham) and fraudulent smishing attempts (spam). The messages are written in Portuguese, often incorporating microtext styles and linguistic nuances unique to the Mozambican context.We also investigate the effectiveness of LLMs in detecting smishing. Using in-context learning approaches, we evaluate the models’ ability to identify smishing attempts without requiring extensive task-specific training. The data set is released under an open license at the following link: huggingface-Anonymous

Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

pdf bib abs
Leveraging Loanword Constraints for Improving Machine Translation in a Low-Resource Multilingual Context
Felermino D. M. A. Ali | Henrique Lopes Cardoso | Rui Sousa-Silva
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This research investigates how to improve machine translation systems for low-resource languages by integrating loanword constraints as external linguistic knowledge. Focusing on the Portuguese-Emakhuwa language pair, which exhibits significant lexical borrowing, we address the challenge of effectively adapting loanwords during the translation process. To tackle this, we propose a novel approach that augments source sentences with loanword constraints, explicitly linking source-language loanwords to their target-language equivalents. Then, we perform supervised fine-tuning on multilingual neural machine translation models and multiple Large Language Models of different sizes. Our results demonstrate that incorporating loanword constraints leads to significant improvements in translation quality as well as in handling loanword adaptation correctly in target languages, as measured by different machine translation metrics. This approach offers a promising direction for improving machine translation performance in low-resource settings characterized by frequent lexical borrowing.

This paper presents the evaluation of submissions to the WMT 2025 Metrics Shared Task on the SSA-MTE challenge set, a large-scale benchmark for machine translation evaluation (MTE) in Sub-Saharan African languages. The SSA-MTE test sets contains over 12,768 human-annotated adequacy scores across 11 language pairs sourced from English, French, and Portuguese, spanning 6 commercial and open-source MT systems. Results show that correlations with human judgments remain generally low, with most systems falling below the 0.4 Spearman threshold for medium-level agreement. Performance varies widely across language pairs, with most correlations under 0.4; in some extremely low-resource cases, such as Portuguese–Emakhuwa, correlations drop to around 0.1, underscoring the difficulty of evaluating MT for very low-resource African languages. These findings highlight the urgent need for more research on robust, generalizable MT evaluation methods tailored for African languages.

2024

pdf bib abs
Building Resources for Emakhuwa: Machine Translation and News Classification Benchmarks
Felermino D. M. A. Ali | Henrique Lopes Cardoso | Rui Sousa-Silva
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper introduces a comprehensive collection of NLP resources for Emakhuwa, Mozambique’s most widely spoken language. The resources include the first manually translated news bitext corpus between Portuguese and Emakhuwa, news topic classification datasets, and monolingual data. We detail the process and challenges of acquiring this data and present benchmark results for machine translation and news topic classification tasks. Our evaluation examines the impact of different data types—originally clean text, post-corrected OCR, and back-translated data—and the effects of fine-tuning from pre-trained models, including those focused on African languages.Our benchmarks demonstrate good performance in news topic classification and promising results in machine translation. We fine-tuned multilingual encoder-decoder models using real and synthetic data and evaluated them on our test set and the FLORES evaluation sets. The results highlight the importance of incorporating more data and potential for future improvements.All models, code, and datasets are available in the https://huggingface.co/LIACC repository under the CC BY 4.0 license.

pdf bib abs
Detecting Loanwords in Emakhuwa: An Extremely Low-Resource Bantu Language Exhibiting Significant Borrowing from Portuguese
Felermino Dario Mario Ali | Henrique Lopes Cardoso | Rui Sousa-Silva
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The accurate identification of loanwords within a given text holds significant potential as a valuable tool for addressing data augmentation and mitigating data sparsity issues. Such identification can improve the performance of various natural language processing tasks, particularly in the context of low-resource languages that lack standardized spelling conventions.This research proposes a supervised method to identify loanwords in Emakhuwa, borrowed from Portuguese. Our methodology encompasses a two-fold approach. Firstly, we employ traditional machine learning algorithms incorporating handcrafted features, including language-specific and similarity-based features. We build upon prior studies to extract similarity features and propose utilizing two external resources: a Sequence-to-Sequence model and a dictionary. This innovative approach allows us to identify loanwords solely by analyzing the target word without prior knowledge about its donor counterpart. Furthermore, we fine-tune the pre-trained CANINE model for the downstream task of loanword detection, which culminates in the impressive achievement of the F1-score of 93%. To the best of our knowledge, this study is the first of its kind focusing on Emakhuwa, and the preliminary results are promising as they pave the way to further advancements.

pdf bib
Proceedings of the First LUHME Workshop
Rui Sousa-Silva | Henrique Lopes Cardoso | Maarit Koponen | Antonio Pareja Lora | Márta Seresi
Proceedings of the First LUHME Workshop

pdf bib abs
Fighting Cyber-malice: A Forensic Linguistics Approach to Detecting AI-generated Malicious Texts
Rui Sousa-Silva
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

Technology has long been used for criminal purposes, but the technological developments of the last decades have allowed users to remain anonymous online, which in turn increased the volume and heterogeneity of cybercrimes and made it more difficult for law enforcement agencies to detect and fight them. However, as they ignore the very nature of language, cybercriminals tend to overlook the potential of linguistic analysis to positively identify them by the language that they use. Forensic linguistics research and practice has therefore proven reliable in fighting cybercrime, either by analysing authorship to confirm or reject the law enforcement agents’ suspicions, or by sociolinguistically profiling the author of the cybercriminal communications to provide the investigators with sociodemographic information to help guide the investigation. However, large language models and generative AI have raised new challenges: not only has cybercrime increased as a result of AI-generated texts, but also generative AI makes it more difficult for forensic linguists to attribute the authorship of the texts to the perpetrators. This paper argues that, although a shift of focus is required, forensic linguistics plays a core role in detecting and fighting cybercrime. A focus on deep linguistic features, rather than low-level and purely stylistic elements, has the potential to discriminate between human- and AI-generated texts and provide the investigation with vital information. We conclude by discussing the foreseeable future limitations, especially resulting from the developments expected from language models.

pdf bib
Network-based Approach for Stopwords Detection
Felermino D. M. A. Ali | Gabriel de Jesus | Henrique Lopes Cardoso | Sérgio Nunes | Rui Sousa-Silva
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2

pdf bib abs
Expanding FLORES+ Benchmark for More Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
Felermino Dario Mario Ali | Henrique Lopes Cardoso | Rui Sousa-Silva
Proceedings of the Ninth Conference on Machine Translation

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa.The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES

2022

Interest in argument mining has resulted in an increasing number of argument annotated corpora. However, most focus on English texts with explicit argumentative discourse markers, such as persuasive essays or legal documents. Conversely, we report on the first extensive and consolidated Portuguese argument annotation project focused on opinion articles. We briefly describe the annotation guidelines based on a multi-layered process and analyze the manual annotations produced, highlighting the main challenges of this textual genre. We then conduct a comprehensive inter-annotator agreement analysis, including argumentative discourse units, their classes and relations, and resulting graphs. This analysis reveals that each of these aspects tackles very different kinds of challenges. We observe differences in annotator profiles, motivating our aim of producing a non-aggregated corpus containing the insights of every annotator. We note that the interpretation and identification of token-level arguments is challenging; nevertheless, tasks that focus on higher-level components of the argument structure can obtain considerable agreement. We lay down perspectives on corpus usage, exploiting its multi-faceted nature.

2019

pdf bib abs
Team Fernando-Pessa at SemEval-2019 Task 4: Back to Basics in Hyperpartisan News Detection
André Cruz | Gil Rocha | Rui Sousa-Silva | Henrique Lopes Cardoso
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our submission to the SemEval 2019 Hyperpartisan News Detection task. Our system aims for a linguistics-based document classification from a minimal set of interpretable features, while maintaining good performance. To this goal, we follow a feature-based approach and perform several experiments with different machine learning classifiers. Additionally, we explore feature importances and distributions among the two classes. On the main task, our model achieved an accuracy of 71.7%, which was improved after the task’s end to 72.9%. We also participate on the meta-learning sub-task, for classifying documents with the binary classifications of all submitted systems as input, achieving an accuracy of 89.9%.

Venues