Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Samson Tan; Shafiq Joty; Lav Varshney; Min-Yen Kan

doi:10.18653/v1/2020.emnlp-main.455

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

Samson Tan, Shafiq Joty, Lav Varshney, Min-Yen Kan

Abstract

Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

Anthology ID:: 2020.emnlp-main.455
Original:: 2020.emnlp-main.455v1
Version 2:: 2020.emnlp-main.455v2
Volume:: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:: November
Year:: 2020
Address:: Online
Editors:: Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5647–5663
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2020.emnlp-main.455/
DOI:: 10.18653/v1/2020.emnlp-main.455
Bibkey:
Cite (ACL):: Samson Tan, Shafiq Joty, Lav Varshney, and Min-Yen Kan. 2020. Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5647–5663, Online. Association for Computational Linguistics.
Cite (Informal):: Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding (Tan et al., EMNLP 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2020.emnlp-main.455.pdf
Video:: https://slideslive.com/38938886
Code: salesforce/bite
Data: BookCorpus, MultiNLI

PDF (v2) PDF (v1) Cite Search Code Video Fix data