Akintunde Oladipo

2023

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility.We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub: https://github.com/huggingface/gaia.We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces: https://huggingface.co/spaces/spacerini/gaia.

2022

pdf abs
AfriTeVA: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models
Odunayo Jude Ogundepo | Akintunde Oladipo | Mofetoluwa Adeyemi | Kelechi Ogueji | Jimmy Lin
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Pretrained language models represent the state of the art in NLP, but the successful construction of such models often requires large amounts of data and computational resources. Thus, the paucity of data for low-resource languages impedes the development of robust NLP capabilities for these languages. There has been some recent success in pretraining encoderonly models solely on a combination of lowresource African languages, exemplified by AfriBERTa. In this work, we extend the approach of “small data” pretraining to encoder– decoder models. We introduce AfriTeVa, a family of sequence-to-sequence models derived from T5 that are pretrained on 10 African languages from scratch. With a pretraining corpus of only around 1GB, we show that it is possible to achieve competitive downstream effectiveness for machine translation and text classification, compared to larger models trained on much more data. All the code and model checkpoints described in this work are publicly available at https://github.com/castorini/afriteva.

Co-authors

Hailey Schoelkopf 1

Stella Biderman 1

Martin Potthast 1

Odunayo Jude Ogundepo 1

Mofetoluwa Adeyemi 1

Kelechi Ogueji 1

Venues

acl1
deeplo1