Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets

Manoj Kumar, Anna Rumshisky, Rahul Gupta


Abstract
Natural language understanding (NLU) tasks are typically defined by creating an annotated dataset in which each utterance is encountered once. Such data does not resemble real-world natural language interactions in which certain utterances are encountered frequently, others rarely. For deployed NLU systems this is a vital problem, since the underlying machine learning (ML) models are often fine-tuned on typical NLU data, and then applied to real-world data with a very different distribution. Such systems need to maintain interpretation consistency for both high-frequency utterances and low-frequency utterances. We propose an alternative strategy that explicitly uses utterance frequency in training data to learn models that are more robust to unknown distributions. We present a methodology to simulate utterance usage in two public NLU corpora and create new corpora with head, body and tail segments. We evaluate several methods for joint intent classification and named entity recognition (IC-NER), and use two domain generalization approaches that we adapt to NER. The proposed approaches demonstrate upto 7.02% relative improvement in semantic accuracy over baselines on the tail data. We provide insights as to why the proposed approaches work and show that the reasons for observed improvements do not align with those reported in previous work.
Anthology ID:
2022.aacl-main.1
Volume:
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:
November
Year:
2022
Address:
Online only
Editors:
Yulan He, Heng Ji, Sujian Li, Yang Liu, Chua-Hui Chang
Venues:
AACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–11
Language:
URL:
https://aclanthology.org/2022.aacl-main.1
DOI:
Bibkey:
Cite (ACL):
Manoj Kumar, Anna Rumshisky, and Rahul Gupta. 2022. Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–11, Online only. Association for Computational Linguistics.
Cite (Informal):
Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets (Kumar et al., AACL-IJCNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.aacl-main.1.pdf