Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages

Mathieu Dehouck, Carlos Gómez-Rodríguez


Abstract
The lack of annotated data is a big issue for building reliable NLP systems for most of the world’s languages. But this problem can be alleviated by automatic data generation. In this paper, we present a new data augmentation method for artificially creating new dependency-annotated sentences. The main idea is to swap subtrees between annotated sentences while enforcing strong constraints on those trees to ensure maximal grammaticality of the new sentences. We also propose a method to perform low-resource experiments using resource-rich languages by mimicking low-resource languages by sampling sentences under a low-resource distribution. In a series of experiments, we show that our newly proposed data augmentation method outperforms previous proposals using the same basic inputs.
Anthology ID:
2020.coling-main.339
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3818–3830
Language:
URL:
https://aclanthology.org/2020.coling-main.339
DOI:
10.18653/v1/2020.coling-main.339
Bibkey:
Cite (ACL):
Mathieu Dehouck and Carlos Gómez-Rodríguez. 2020. Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3818–3830, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages (Dehouck & Gómez-Rodríguez, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.coling-main.339.pdf