Abstract
We present a word-sense induction method based on pre-trained masked language models (MLMs), which can cheaply scale to large vocabularies and large corpora. The result is a corpus which is sense-tagged according to a corpus-derived sense inventory and where each sense is associated with indicative words. Evaluation on English Wikipedia that was sense-tagged using our method shows that both the induced senses, and the per-instance sense assignment, are of high quality even compared to WSD methods, such as Babelfy. Furthermore, by training a static word embeddings algorithm on the sense-tagged corpus, we obtain high-quality static senseful embeddings. These outperform existing senseful embeddings methods on the WiC dataset and on a new outlier detection dataset we developed. The data driven nature of the algorithm allows to induce corpora-specific senses, which may not appear in standard sense inventories, as we demonstrate using a case study on the scientific domain.- Anthology ID:
- 2022.acl-long.325
- Volume:
- Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Smaranda Muresan, Preslav Nakov, Aline Villavicencio
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4738–4752
- Language:
- URL:
- https://aclanthology.org/2022.acl-long.325
- DOI:
- 10.18653/v1/2022.acl-long.325
- Cite (ACL):
- Matan Eyal, Shoval Sadde, Hillel Taub-Tabib, and Yoav Goldberg. 2022. Large Scale Substitution-based Word Sense Induction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4738–4752, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Large Scale Substitution-based Word Sense Induction (Eyal et al., ACL 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.acl-long.325.pdf
- Data
- CoarseWSD-20, WiC