Abstract
The availability of user-generated content has increased significantly over time. Wikipedia is one example of a corpora which spans a huge range of topics and is freely available. Storing and processing these corpora requires flexible documents models as they may contain malicious and incorrect data. Docria is a library which attempts to address this issue by providing a solution which can be used with small to large corpora, from laptops using Python interactively in a Jupyter notebook to clusters running map-reduce frameworks with optimized compiled code. Docria is available as open-source code.- Anthology ID:
- W19-6148
- Volume:
- Proceedings of the 22nd Nordic Conference on Computational Linguistics
- Month:
- September–October
- Year:
- 2019
- Address:
- Turku, Finland
- Editors:
- Mareike Hartmann, Barbara Plank
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- Linköping University Electronic Press
- Note:
- Pages:
- 400–405
- Language:
- URL:
- https://aclanthology.org/W19-6148
- DOI:
- Cite (ACL):
- Marcus Klang and Pierre Nugues. 2019. Docria: Processing and Storing Linguistic Data with Wikipedia. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 400–405, Turku, Finland. Linköping University Electronic Press.
- Cite (Informal):
- Docria: Processing and Storing Linguistic Data with Wikipedia (Klang & Nugues, NoDaLiDa 2019)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/W19-6148.pdf