Abstract
Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.- Anthology ID:
- 2020.emnlp-main.455
- Original:
- 2020.emnlp-main.455v1
- Version 2:
- 2020.emnlp-main.455v2
- Volume:
- Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5647–5663
- Language:
- URL:
- https://aclanthology.org/2020.emnlp-main.455
- DOI:
- 10.18653/v1/2020.emnlp-main.455
- Cite (ACL):
- Samson Tan, Shafiq Joty, Lav Varshney, and Min-Yen Kan. 2020. Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5647–5663, Online. Association for Computational Linguistics.
- Cite (Informal):
- Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding (Tan et al., EMNLP 2020)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2020.emnlp-main.455.pdf
- Code
- salesforce/bite
- Data
- BookCorpus, MultiNLI