Martim Brandao


2026

Large Language Models (LLMs) have been shown to be vulnerable to various issues of bias and safety, for which new safety alignment techniques have been proposed. In this paper, we investigate the degree to which such techniques improve safety in a non-English language, specifically in Italian, both when they have and don’t have access to safety training data in that language. We evaluate standard mitigation techniques and assess cross-lingual safety transfer by comparing English-only versus bilingual Supervised Fine-Tuning (SFT), on several open-source small LLMs: Qwen3, Llama3.2, and Gemma3. Results confirm a significant cross-lingual safety gap, with most models performing worse in Italian. We find that while prompt engineering is generally effective, the impact of SFT is highly inconsistent. English-only SFT occasionally failed to transfer safety improvements into Italian and even deteriorated the performance of some models. Furthermore, bilingual SFT repeatedly underperformed other mitigation methods. These findings demonstrate that safety alignment does not always generalize across languages and models, and standard mitigation strategies can lead to unpredictable effects. We thus highlight the critical necessity for language-specific evaluation and dedicated multilingual safety research to ensure AI is developed equitably and safely for a global audience.

2025

This research investigates the detection of covert sales tactics in human-chatbot interactions with a focus on the classification of solicited and unsolicited product recommendations. A custom dataset of 630 conversations was generated using a Large Language Model (LLM) to simulate chatbot-user interactions in various contexts, such as when interacting with users from different age groups, recommending different types of products and using different types of sales tactics. We then employ various approaches, including BiLSTM-based classification with sentence and word-level embeddings, as well as zero-shot, few-shot and CoT classification on large state-of-the-art LLMs. Our results show that few-shot GPT4 (86.44%) is the most accurate model on our dataset, followed by our compact SBERT+BiLSTM model (78.63%) - despite its small size.Our work demonstrates the feasibility of implementing oversight algorithms for monitoring chatbot conversations for undesired practices and that such monitoring could potentially be implemented locally on-device to mitigate privacy concerns. This research thus lays the groundwork for the development of auditing and oversight methods for virtual assistants such as chatbots, allowing consumer protection agencies to monitor the ethical use of conversational AI.