A Short Survey on Sense-Annotated Corpora

Tommaso Pasini, Jose Camacho-Collados


Abstract
Large sense-annotated datasets are increasingly necessary for training deep supervised systems in Word Sense Disambiguation. However, gathering high-quality sense-annotated data for as many instances as possible is a laborious and expensive task. This has led to the proliferation of automatic and semi-automatic methods for overcoming the so-called knowledge-acquisition bottleneck. In this short survey we present an overview of sense-annotated corpora, annotated either manually- or (semi)automatically, that are currently available for different languages and featuring distinct lexical resources as inventory of senses, i.e. WordNet, Wikipedia, BabelNet. Furthermore, we provide the reader with general statistics of each dataset and an analysis of their specific features.
Anthology ID:
2020.lrec-1.706
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5759–5765
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.706
DOI:
Bibkey:
Cite (ACL):
Tommaso Pasini and Jose Camacho-Collados. 2020. A Short Survey on Sense-Annotated Corpora. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5759–5765, Marseille, France. European Language Resources Association.
Cite (Informal):
A Short Survey on Sense-Annotated Corpora (Pasini & Camacho-Collados, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2020.lrec-1.706.pdf
Data
FrameNetWord Sense Disambiguation: a Unified Evaluation Framework and Empirical Comparison