Measuring Similarity by Linguistic Features rather than Frequency

Rodolfo Delmonte, Nicolò Busetto


Abstract
In the use and creation of current Deep Learning Models the only number that is used for the overall computation is the frequency value associated with the current word form in the corpus, which is used to substitute it. Frequency values come in two forms: absolute and relative. Absolute frequency is used indirectly when selecting the vocabulary against which the word embeddings are created: the cutoff threshold is usually fixed at 30/50K entries of the most frequent words. Relative frequency comes in directly when computing word embeddings based on co-occurrence values of the tokens included in a window size 2/5 adjacent tokens. The latter values are then used to compute similarity, mostly based on cosine distance. In this paper we will evaluate the impact of these two frequency parameters on a small corpus of Italian sentences whose main features are two: presence of very rare words and of non-canonical structures. Rather than basing our evaluation on cosine measure alone, we propose a graded scale of scores which are linguistically motivated. The results computed on the basis of a perusal of BERT’s raw embeddings shows that the two parameters conspire to decide the level of predictability.
Anthology ID:
2022.isa-1.6
Volume:
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
ISA
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
42–52
Language:
URL:
https://aclanthology.org/2022.isa-1.6
DOI:
Bibkey:
Cite (ACL):
Rodolfo Delmonte and Nicolò Busetto. 2022. Measuring Similarity by Linguistic Features rather than Frequency. In Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022, pages 42–52, Marseille, France. European Language Resources Association.
Cite (Informal):
Measuring Similarity by Linguistic Features rather than Frequency (Delmonte & Busetto, ISA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.isa-1.6.pdf
Optional supplementary material:
 2022.isa-1.6.OptionalSupplementaryMaterial.pdf