Sunaya Upadhyay
2025
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs
Md. Arid Hasan
|
Maram Hasanain
|
Fatema Ahmad
|
Sahinur Rahman Laskar
|
Sunaya Upadhyay
|
Vrunda N Sukhadia
|
Mucahid Kutlu
|
Shammur Absar Chowdhury
|
Firoj Alam
Findings of the Association for Computational Linguistics: ACL 2025
Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work done in parallel, there is a notable lack of a framework and large-scale region-specific datasets queried by native users in their own languages. This gap hinders effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of approximately ~64K manually annotated QA pairs in seven languages, ranging from high- to extremely low-resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark both open- and closed-source LLMs using the MultiNativQA dataset. The dataset and related experimental scripts are publicly available for the community at: https://huggingface.co/datasets/QCRI/MultiNativQAand https://gitlab.com/nativqa/multinativqa.
Search
Fix author
Co-authors
- Fatema Ahmad 1
- Firoj Alam 1
- Shammur Absar Chowdhury 1
- Md. Arid Hasan 1
- Maram Hasanain 1
- show all...