AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages
Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, Jimmy Lin
Abstract
Language diversity in NLP is critical in enabling the development of tools for a wide range of users.However, there are limited resources for building such tools for many languages, particularly those spoken in Africa.For search, most existing datasets feature few or no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages.Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages.In total, our dataset contains 6 million queries in English and 23 million relevance judgments automatically mined from Wikipedia inter-language links, covering many more African languages than any existing information retrieval test collection.In addition, we release BM25, dense retrieval, and sparse–dense hybrid baselines to provide a starting point for the development of future systems.We hope that these efforts can spur additional work in search for African languages.AfriCLIRMatrix can be downloaded at https://github.com/castorini/africlirmatrix.- Anthology ID:
- 2022.emnlp-main.597
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8721–8728
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-main.597
- DOI:
- 10.18653/v1/2022.emnlp-main.597
- Cite (ACL):
- Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, and Jimmy Lin. 2022. AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8721–8728, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages (Ogundepo et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2022.emnlp-main.597.pdf