A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages

Dwaipayan Roy, Sumit Bhatia, Prateek Jain


Abstract
Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the IR community to bridge this gap.
Anthology ID:
2020.lrec-1.289
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2373–2380
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.289
DOI:
Bibkey:
Cite (ACL):
Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020. A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2373–2380, Marseille, France. European Language Resources Association.
Cite (Informal):
A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages (Roy et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.lrec-1.289.pdf