CBOW-tag: a Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora

Attila Novák, László Laki, Borbála Novák


Abstract
In this paper, we present a modified version of the CBOW algorithm implemented in the fastText framework. Our modified algorithm, CBOW-tag builds a vector space model that includes the representation of the original word forms and their annotation at the same time. We illustrate the results by presenting a model built from a corpus that includes morphological and syntactic annotations. The simultaneous presence of unannotated elements and different annotations at the same time in the model makes it possible to constrain nearest neighbour queries to specific types of elements. The model can thus efficiently answer questions such as What do we eat?, What can we do with a skeleton? What else do we do with what we eat?, etc. Error analysis reveals that the model can highlight errors introduced into the annotation by the tagger and parser we used to generate the annotations as well as lexical peculiarities in the corpus itself, especially if we do not limit the vocabulary of the model to frequent items.
Anthology ID:
2020.lrec-1.590
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4798–4801
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.590
DOI:
Bibkey:
Cite (ACL):
Attila Novák, László Laki, and Borbála Novák. 2020. CBOW-tag: a Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4798–4801, Marseille, France. European Language Resources Association.
Cite (Informal):
CBOW-tag: a Modified CBOW Algorithm for Generating Embedding Models from Annotated Corpora (Novák et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.590.pdf