Maciej Rybinski


2021

pdf
Cross-Domain Language Modeling: An Empirical Investigation
Vincent Nguyen | Sarvnaz Karimi | Maciej Rybinski | Zhenchang Xing
Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association

Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.