# Multilingual Wikipedia Corpus (MWC)

MWC is a multilingual extension of Wikitext2 Corpus provided by MetaMind. The texts are collected from Wikipedia in the 7 languages.

To keep the topic distribution to be approximately the same across the corpora, we extracted articles about entities which explained in all the languages. We extracted articles which exist in all languages and each consist of more than 1,000 words, for a total of 797 articles.

In our paper, we report results on the small version which used 360 randomly sampled articles. The articles are split into 300, 30, 30 sets and the first 300 articles are used for training and the rest are used for dev and test respectively.

## Structure

wiki_{lang}
- ptb_format: dataset with 360 articles splitted to 300 / 30 / 30
- ptb_format_large: dataset with 720 articles splitted to 600 / 60 / 60
- tr/va/te: raw articles. The file name corresponds to wikidata id.

## Statistics (ptb_format)

     | Char Types      | Word Types             | OOV Rate      | Tokens             | Chars
Lang | tr  | va  | te  | tr     | va    | te    | va    | te    | tr   | va   | te   | tr    | va   | te  
----------------------------------------------------------------------------------------------------------
EN   | 307 | 160 | 157 | 193808 | 38826 | 35093 | 6.60% | 5.46% | 2.5M | 0.2M | 0.2M | 15.6M | 1.5M | 1.3M 
FR   | 272 | 141 | 155 | 166354 | 34991 | 38323 | 6.70% | 6.96% | 2.0M | 0.2M | 0.2M | 12.4M | 1.3M | 1.6M 
DE   | 298 | 162 | 183 | 238703 | 40848 | 41962 | 7.07% | 7.01% | 1.9M | 0.2M | 0.2M | 13.6M | 1.2M | 1.3M 
ES   | 307 | 164 | 176 | 160574 | 31358 | 34999 | 6.61% | 7.35% | 1.8M | 0.2M | 0.2M | 11.0M | 1.0M | 1.3M 
CS   | 238 | 128 | 144 | 167886 | 23959 | 29638 | 5.06% | 6.44% | 0.9M | 0.1M | 0.1M |  6.1M | 0.4M | 0.5M 
FI   | 246 | 123 | 135 | 190595 | 32899 | 31109 | 8.33% | 7.39% | 0.7M | 0.1M | 0.1M |  6.4M | 0.7M | 0.6M 
RU   | 273 | 184 | 196 | 236834 | 46663 | 44772 | 7.76% | 7.20% | 1.3M | 0.1M | 0.1M |  9.3M | 1.0M | 0.9M 

Reference: Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling by Kazuya Kawakami, Chris Dyer, Phil Blunsom
URL: k-kawakami.com/pdf/acl2017-cache.pdf
Contact: www.kazuya.kawakami@gmail.com
