Namrata Mukhija


2026

AI ethics guidelines for humanitarian settings have grown in number and scope. Whether they produce their intended outcomes depends on which deployers are expected to follow them. These guidelines respond to documented risks: surveillance, data misuse, and discriminatory outcomes affecting refugee populations. For high-risk applications such as biometric identification and asylum adjudication, the concerns they address are genuine. Many differentiate risk tiers in principle, yet the compliance expectations they establish (staff capacity, technical infrastructure, formal evaluation) reflect the organizational contexts in which they were developed. Many nonprofits providing frontline services to refugees operate with limited administrative capacity. When compliance requirements exceed what these organizations can meet, formal AI adoption stalls, while informal adoption proceeds without oversight or recourse. Current guidelines also tend to treat non-adoption as a neutral default, without accounting for the service gaps that follow when AI-assisted language access is unavailable. Drawing on collaboration with refugee-serving practitioners, we show that this gap between governance design and organizational reality has consequences for the people these guidelines are meant to protect. Evaluating AI guidelines, we argue, requires the same realist logic that evaluation research has long applied to social programs: not "does this guideline exist?" but "for which deployers, under what conditions, and does it produce its intended protective outcomes?"

2023

While paraphrasing is a promising approach for data augmentation in classification tasks, its effect on named entity recognition (NER) is not investigated systematically due to the difficulty of span-level label preservation. In this paper, we utilize simple strategies to annotate entity spans in generations and compare established and novel methods of paraphrasing in NLP such as back translation, specialized encoder-decoder models such as Pegasus, and GPT-3 variants for their effectiveness in improving downstream performance for NER across different levels of gold annotations and paraphrasing strength on 5 datasets. We thoroughly explore the influence of paraphrasers, and dynamics between paraphrasing strength and gold dataset size on the NER performance with visualizations and statistical testing. We find that the choice of the paraphraser greatly impacts NER performance, with one of the larger GPT-3 variants exceedingly capable of generating high quality paraphrases, yielding statistically significant improvements in NER performance with increasing paraphrasing strength, while other paraphrasers show more mixed results. Additionally, inline auto annotations generated by larger GPT-3 are strictly better than heuristic based annotations. We also find diminishing benefits of paraphrasing as gold annotations increase for most datasets. Furthermore, while most paraphrasers promote entity memorization in NER, the proposed GPT-3 configuration performs most favorably among the compared paraphrasers when tested on unseen entities, with memorization reducing further with paraphrasing strength. Finally, we explore mention replacement using GPT-3, which provides additional benefits over base paraphrasing for specific datasets.