Stephanie Hirmer

2021

pdf bib abs
Building Representative Corpora from Illiterate Communities: A Reviewof Challenges and Mitigation Strategies for Developing Countries
Stephanie Hirmer | Alycia Leonard | Josephine Tumwesige | Costanza Conforti
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Most well-established data collection methods currently adopted in NLP depend on the as- sumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.

2020

pdf bib abs
Natural Language Processing for Achieving Sustainable Development: the Case of Neural Labelling to Enhance Community Profiling
Costanza Conforti | Stephanie Hirmer | Dai Morgan | Marco Basaldella | Yau Ben Or
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In recent years, there has been an increasing interest in the application of Artificial Intelligence – and especially Machine Learning – to the field of Sustainable Development (SD). However, until now, NLP has not been systematically applied in this context. In this paper, we show the high potential of NLP to enhance project sustainability. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. Here, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new extreme multi-class multi-label Automatic UserPerceived Value classification task. We release Stories2Insights, an expert-annotated dataset of interviews carried out in Uganda, we provide a detailed corpus analysis, and we implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leaves considerable room for future research at the intersection of NLP and SD.

Co-authors

Yau Ben Or 1

Venues

EACL1
EMNLP1