Abstract
Many natural language processing tasks, including the most advanced ones, routinely start by several basic processing steps – tokenization and segmentation, most likely also POS tagging and lemmatization, and commonly parsing as well. A multilingual pipeline performing these steps can be trained using the Universal Dependencies project, which contains annotations of the described tasks for 50 languages in the latest release UD 2.0. We present an update to UDPipe, a simple-to-use pipeline processing CoNLL-U version 2.0 files, which performs these tasks for multiple languages without requiring additional external data. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. UDPipe is a standalone application in C++, with bindings available for Python, Java, C# and Perl. In the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, UDPipe was the eight best system, while achieving low running times and moderately sized models.- Anthology ID:
- K17-3009
- Volume:
- Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
- Month:
- August
- Year:
- 2017
- Address:
- Vancouver, Canada
- Editors:
- Jan Hajič, Dan Zeman
- Venue:
- CoNLL
- SIG:
- SIGNLL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 88–99
- Language:
- URL:
- https://aclanthology.org/K17-3009
- DOI:
- 10.18653/v1/K17-3009
- Cite (ACL):
- Milan Straka and Jana Straková. 2017. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe (Straka & Straková, CoNLL 2017)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/K17-3009.pdf
- Data
- Universal Dependencies