Unsupervised Contextualized Document Representation

Ankur Gupta, Vivek Gupta


Abstract
Several NLP tasks need the effective repre-sentation of text documents.Arora et al.,2017 demonstrate that simple weighted aver-aging of word vectors frequently outperformsneural models. SCDV (Mekala et al., 2017)further extends this from sentences to docu-ments by employing soft and sparse cluster-ing over pre-computed word vectors. How-ever, both techniques ignore the polysemyand contextual character of words.In thispaper, we address this issue by proposingSCDV+BERT(ctxd), a simple and effective un-supervised representation that combines con-textualized BERT (Devlin et al., 2019) basedword embedding for word sense disambigua-tion with SCDV soft clustering approach. Weshow that our embeddings outperform origi-nal SCDV, pre-train BERT, and several otherbaselines on many classification datasets. Wealso demonstrate our embeddings effective-ness on other tasks, such as concept match-ing and sentence similarity.In addition,we show that SCDV+BERT(ctxd) outperformsfine-tune BERT and different embedding ap-proaches in scenarios with limited data andonly few shots examples.
Anthology ID:
2021.sustainlp-1.17
Volume:
Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2021
Address:
Virtual
Venues:
EMNLP | sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
166–173
Language:
URL:
https://aclanthology.org/2021.sustainlp-1.17
DOI:
10.18653/v1/2021.sustainlp-1.17
Bibkey:
Cite (ACL):
Ankur Gupta and Vivek Gupta. 2021. Unsupervised Contextualized Document Representation. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 166–173, Virtual. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Contextualized Document Representation (Gupta & Gupta, sustainlp 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2021.sustainlp-1.17.pdf
Code
 vgupta123/contextualize_scdv