Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments

Mark Granroth-Wilding


Abstract
We present Pimlico, an open source toolkit for building pipelines for processing large corpora. It is especially focused on processing linguistic corpora and provides wrappers around existing, widely used NLP tools. A particular goal is to ease distribution of reproducible and extensible experiments by making it easy to document and re-run all steps involved, including data loading, pre-processing, model training and evaluation. Once a pipeline is released, it is easy to adapt, for example, to run on a new dataset, or to re-run an experiment with different parameters. The toolkit takes care of many common challenges in writing and distributing corpus-processing code, such as managing data between the steps of a pipeline, installing required software and combining existing toolkits with new, task-specific code.
Anthology ID:
2020.nlposs-1.14
Volume:
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Month:
November
Year:
2020
Address:
Online
Venue:
NLPOSS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
101–109
Language:
URL:
https://aclanthology.org/2020.nlposs-1.14
DOI:
10.18653/v1/2020.nlposs-1.14
Bibkey:
Cite (ACL):
Mark Granroth-Wilding. 2020. Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 101–109, Online. Association for Computational Linguistics.
Cite (Informal):
Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments (Granroth-Wilding, NLPOSS 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.nlposs-1.14.pdf
Video:
 https://slideslive.com/38939753