A comparison of statistical association measures for identifying dependency-based collocations in various languages.

Marcos Garcia, Marcos García Salido, Margarita Alonso-Ramos


Abstract
This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association metrics we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1,526 collocations with the corresponding Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.
Anthology ID:
W19-5107
Volume:
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Agata Savary, Carla Parra Escartín, Francis Bond, Jelena Mitrović, Verginica Barbu Mititelu
Venue:
MWE
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
49–59
Language:
URL:
https://aclanthology.org/W19-5107
DOI:
10.18653/v1/W19-5107
Bibkey:
Cite (ACL):
Marcos Garcia, Marcos García Salido, and Margarita Alonso-Ramos. 2019. A comparison of statistical association measures for identifying dependency-based collocations in various languages.. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pages 49–59, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A comparison of statistical association measures for identifying dependency-based collocations in various languages. (Garcia et al., MWE 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/W19-5107.pdf