Docria: Processing and Storing Linguistic Data with Wikipedia

Marcus Klang; Pierre Nugues

Docria: Processing and Storing Linguistic Data with Wikipedia

Abstract

The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available. Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code. Docria is available as open-source code.

Anthology ID:: W19-6148
Volume:: Proceedings of the 22nd Nordic Conference on Computational Linguistics
Month:: September–October
Year:: 2019
Address:: Turku, Finland
Editors:: Mareike Hartmann, Barbara Plank
Venue:: NoDaLiDa
SIG:
Publisher:: Linköping University Electronic Press
Note:
Pages:: 400–405
Language:
URL:: https://aclanthology.org/W19-6148
DOI:
Bibkey:
Cite (ACL):: Marcus Klang and Pierre Nugues. 2019. Docria: Processing and Storing Linguistic Data with Wikipedia. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 400–405, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):: Docria: Processing and Storing Linguistic Data with Wikipedia (Klang & Nugues, NoDaLiDa 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-bitext-workshop/W19-6148.pdf

PDF Search