Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue

Evgeniia Razumovskaia; Ivan Vulić; Anna Korhonen

doi:10.18653/v1/2022.findings-acl.160

Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue

Evgeniia Razumovskaia, Ivan Vulić, Anna Korhonen

Abstract

Scaling dialogue systems to a multitude of domains, tasks and languages relies on costly and time-consuming data annotation for different domain-task-language configurations. The annotation efforts might be substantially reduced by the methods that generalise well in zero- and few-shot scenarios, and also effectively leverage external unannotated data sources (e.g., Web-scale corpora). We propose two methods to this aim, offering improved dialogue natural language understanding (NLU) across multiple languages: 1) Multi-SentAugment, and 2) LayerAgg. Multi-SentAugment is a self-training method which augments available (typically few-shot) training data with similar (automatically labelled) in-domain sentences from large monolingual Web-scale corpora. LayerAgg learns to select and combine useful semantic information scattered across different layers of a Transformer model (e.g., mBERT); it is especially suited for zero-shot scenarios as semantically richer representations should strengthen the model’s cross-lingual capabilities. Applying the two methods with state-of-the-art NLU models obtains consistent improvements across two standard multilingual NLU datasets covering 16 diverse languages. The gains are observed in zero-shot, few-shot, and even in full-data scenarios. The results also suggest that the two methods achieve a synergistic effect: the best overall performance in few-shot setups is attained when the methods are used together.

Anthology ID:: 2022.findings-acl.160
Volume:: Findings of the Association for Computational Linguistics: ACL 2022
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2017–2033
Language:
URL:: https://aclanthology.org/2022.findings-acl.160
DOI:: 10.18653/v1/2022.findings-acl.160
Bibkey:
Cite (ACL):: Evgeniia Razumovskaia, Ivan Vulić, and Anna Korhonen. 2022. Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2017–2033, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue (Razumovskaia et al., Findings 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl24-info/2022.findings-acl.160.pdf
Video:: https://preview.aclanthology.org/naacl24-info/2022.findings-acl.160.mp4
Data: CC100, xSID

PDF Search Video