Cross-Domain Language Modeling: An Empirical Investigation
Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, Zhenchang Xing
Abstract
Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.- Anthology ID:
- 2021.alta-1.22
- Volume:
- Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
- Month:
- December
- Year:
- 2021
- Address:
- Online
- Editors:
- Afshin Rahimi, William Lane, Guido Zuccon
- Venue:
- ALTA
- SIG:
- Publisher:
- Australasian Language Technology Association
- Note:
- Pages:
- 192–200
- Language:
- URL:
- https://aclanthology.org/2021.alta-1.22
- DOI:
- Cite (ACL):
- Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. 2021. Cross-Domain Language Modeling: An Empirical Investigation. In Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association, pages 192–200, Online. Australasian Language Technology Association.
- Cite (Informal):
- Cross-Domain Language Modeling: An Empirical Investigation (Nguyen et al., ALTA 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2021.alta-1.22.pdf
- Data
- BLUE, GLUE, QNLI