WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format

Marcus Klang; Pierre Nugues

WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format

Abstract

Wikipedia has become one of the most popular resources in natural language processing and it is used in quantities of applications. However, Wikipedia requires a substantial pre-processing step before it can be used. For instance, its set of nonstandardized annotations, referred to as the wiki markup, is language-dependent and needs specific parsers from language to language, for English, French, Italian, etc. In addition, the intricacies of the different Wikipedia resources: main article text, categories, wikidata, infoboxes, scattered into the article document or in different files make it difficult to have global view of this outstanding resource. In this paper, we describe WikiParq, a unified format based on the Parquet standard to tabulate and package the Wikipedia corpora. In combination with Spark, a map-reduce computing framework, and the SQL query language, WikiParq makes it much easier to write database queries to extract specific information or subcorpora from Wikipedia, such as all the first paragraphs of the articles in French, or all the articles on persons in Spanish, or all the articles on persons that have versions in French, English, and Spanish. WikiParq is available in six language versions and is potentially extendible to all the languages of Wikipedia. The WikiParq files are downloadable as tarball archives from this location: http://semantica.cs.lth.se/wikiparq/.

Anthology ID:: L16-1654
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4141–4148
Language:
URL:: https://aclanthology.org/L16-1654
DOI:
Bibkey:
Cite (ACL):: Marcus Klang and Pierre Nugues. 2016. WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4141–4148, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: WIKIPARQ: A Tabulated Wikipedia Resource Using the Parquet Format (Klang & Nugues, LREC 2016)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/L16-1654.pdf

PDF Search