MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart


Abstract
We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total. We start with Wikipedia articles, which also provide the context for the dataset samples, and use an LLM to generate question/answer pairs related to the Wikipedia article, ensuring that the answer appears verbatim within the article. Next, the question is then rephrased to hinder simple word matching methods from performing well on the dataset. We conduct a crowdsourced human evaluation of the fluency of the generated questions, which included 156 respondents across 30 of the languages (both low- and high-resource). All 30 languages received a mean fluency rating above “mostly natural”, showing that the samples are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. Both the dataset and survey evaluations are publicly available.
Anthology ID:
2026.lrec-main.499
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
6298–6311
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.499/
DOI:
Bibkey:
Cite (ACL):
Dan Saattrup Smart. 2026. MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages. International Conference on Language Resources and Evaluation, main:6298–6311.
Cite (Informal):
MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages (Smart, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.499.pdf