Abstract
Several NLP tasks need the effective repre-sentation of text documents.Arora et al.,2017 demonstrate that simple weighted aver-aging of word vectors frequently outperformsneural models. SCDV (Mekala et al., 2017)further extends this from sentences to docu-ments by employing soft and sparse cluster-ing over pre-computed word vectors. How-ever, both techniques ignore the polysemyand contextual character of words.In thispaper, we address this issue by proposingSCDV+BERT(ctxd), a simple and effective un-supervised representation that combines con-textualized BERT (Devlin et al., 2019) basedword embedding for word sense disambigua-tion with SCDV soft clustering approach. Weshow that our embeddings outperform origi-nal SCDV, pre-train BERT, and several otherbaselines on many classification datasets. Wealso demonstrate our embeddings effective-ness on other tasks, such as concept match-ing and sentence similarity.In addition,we show that SCDV+BERT(ctxd) outperformsfine-tune BERT and different embedding ap-proaches in scenarios with limited data andonly few shots examples.- Anthology ID:
- 2021.sustainlp-1.17
- Volume:
- Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Virtual
- Venue:
- sustainlp
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 166–173
- Language:
- URL:
- https://aclanthology.org/2021.sustainlp-1.17
- DOI:
- 10.18653/v1/2021.sustainlp-1.17
- Cite (ACL):
- Ankur Gupta and Vivek Gupta. 2021. Unsupervised Contextualized Document Representation. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 166–173, Virtual. Association for Computational Linguistics.
- Cite (Informal):
- Unsupervised Contextualized Document Representation (Gupta & Gupta, sustainlp 2021)
- PDF:
- https://preview.aclanthology.org/remove-xml-comments/2021.sustainlp-1.17.pdf
- Code
- vgupta123/contextualize_scdv