Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection

Farhan Samir, Miikka Silfverberg


Abstract
Data augmentation techniques are widely used in low-resource automatic morphological inflection to address the issue of data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the data augmentation strategy StemCorrupt, a method that generates synthetic examples by randomly substituting stem characters in existing gold standard training examples. Our analysis uncovers that StemCorrupt brings about fundamental changes in the underlying data distribution, revealing inherent compositional concatenative structure. To complement our theoretical analysis, we investigate the data-efficiency of StemCorrupt. Through evaluation across a diverse set of seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of compared to competitive baselines. Furthermore, we explore the impact of typological features on the choice of augmentation strategy and find that languages incorporating non-concatenativity, such as morphonological alternations, derive less benefit from synthetic examples with high predictive uncertainty. We attribute this effect to phonotactic violations induced by StemCorrupt, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.
Anthology ID:
2023.emnlp-main.19
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
277–291
Language:
URL:
https://aclanthology.org/2023.emnlp-main.19
DOI:
10.18653/v1/2023.emnlp-main.19
Bibkey:
Cite (ACL):
Farhan Samir and Miikka Silfverberg. 2023. Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 277–291, Singapore. Association for Computational Linguistics.
Cite (Informal):
Understanding Compositional Data Augmentation in Typologically Diverse Morphological Inflection (Samir & Silfverberg, EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2023.emnlp-main.19.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-4/2023.emnlp-main.19.mp4