Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View
Ruotian Ma, Xiaolei Wang, Xin Zhou, Qi Zhang, Xuanjing Huang
Abstract
Recently, many studies have illustrated the robustness problem of Named Entity Recognition (NER) systems: the NER models often rely on superficial entity patterns for predictions, without considering evidence from the context. Consequently, even state-of-the-art NER models generalize poorly to out-of-domain scenarios when out-of-distribution (OOD) entity patterns are introduced. Previous research attributes the robustness problem to the existence of NER dataset bias, where simpler and regular entity patterns induce shortcut learning. In this work, we bring new insights into this problem by comprehensively investigating the NER dataset bias from a dataset difficulty view. We quantify the entity-context difficulty distribution in existing datasets and explain their relationship with model robustness. Based on our findings, we explore three potential ways to de-bias the NER datasets by altering entity-context distribution, and we validate the feasibility with intensive experiments. Finally, we show that the de-biased datasets can transfer to different models and even benefit existing model-based robustness-improving methods, indicating that building more robust datasets is fundamental for building more robust NER systems.- Anthology ID:
- 2023.emnlp-main.281
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4616–4630
- Language:
- URL:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.281/
- DOI:
- 10.18653/v1/2023.emnlp-main.281
- Cite (ACL):
- Ruotian Ma, Xiaolei Wang, Xin Zhou, Qi Zhang, and Xuanjing Huang. 2023. Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4616–4630, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Towards Building More Robust NER datasets: An Empirical Study on NER Dataset Bias from a Dataset Difficulty View (Ma et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2023.emnlp-main.281.pdf