Hubless Nearest Neighbor Search for Bilingual Lexicon Induction

Jiaji Huang, Qiang Qiu, Kenneth Church


Abstract
Bilingual Lexicon Induction (BLI) is the task of translating words from corpora in two languages. Recent advances in BLI work by aligning the two word embedding spaces. Following that, a key step is to retrieve the nearest neighbor (NN) in the target space given the source word. However, a phenomenon called hubness often degrades the accuracy of NN. Hubness appears as some data points, called hubs, being extra-ordinarily close to many of the other data points. Reducing hubness is necessary for retrieval tasks. One successful example is Inverted SoFtmax (ISF), recently proposed to improve NN. This work proposes a new method, Hubless Nearest Neighbor (HNN), to mitigate hubness. HNN differs from NN by imposing an additional equal preference assumption. Moreover, the HNN formulation explains why ISF works as well as it does. Empirical results demonstrate that HNN outperforms NN, ISF and other state-of-the-art. For reproducibility and follow-ups, we have published all code.
Anthology ID:
P19-1399
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4072–4080
Language:
URL:
https://aclanthology.org/P19-1399
DOI:
10.18653/v1/P19-1399
Bibkey:
Cite (ACL):
Jiaji Huang, Qiang Qiu, and Kenneth Church. 2019. Hubless Nearest Neighbor Search for Bilingual Lexicon Induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4072–4080, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Hubless Nearest Neighbor Search for Bilingual Lexicon Induction (Huang et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/P19-1399.pdf
Supplementary:
 P19-1399.Supplementary.pdf
Code
 baidu-research/HNN