Embedding Space Correlation as a Measure of Domain Similarity

Anne Beyer, Göran Kauermann, Hinrich Schütze


Abstract
Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.
Anthology ID:
2020.lrec-1.296
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2431–2439
Language:
English
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2020.lrec-1.296/
DOI:
Bibkey:
Cite (ACL):
Anne Beyer, Göran Kauermann, and Hinrich Schütze. 2020. Embedding Space Correlation as a Measure of Domain Similarity. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2431–2439, Marseille, France. European Language Resources Association.
Cite (Informal):
Embedding Space Correlation as a Measure of Domain Similarity (Beyer et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2020.lrec-1.296.pdf