2023
pdf
abs
Annotation of lexical bundles with discourse functions in a Spanish academic corpus
Eleonora Guzzi
|
Margarita Alonso-Ramos
|
Marcos Garcia
|
Marcos García Salido
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
This paper describes the process of annotation of 996 lexical bundles (LB) assigned to 39 different discourse functions in a Spanish academic corpus. The purpose of the annotation is to obtain a new Spanish gold-standard corpus of 1,800,000 words useful for training and evaluating computational models that are capable of identifying automatically LBs for each context in new corpora, as well as for linguistic analysis about the role of LBs in academic discourse. The annotation process revealed that correspondence between LBs and discourse functions is not biunivocal and that the degree of ambiguity is high, so linguists’ contribution has been essential for improving the automatic assignation of tags.
2019
pdf
abs
A Method to Automatically Identify Diachronic Variation in Collocations.
Marcos Garcia
|
Marcos García Salido
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change
This paper introduces a novel method to track collocational variations in diachronic corpora that can identify several changes undergone by these phraseological combinations and to propose alternative solutions found in later periods. The strategy consists of extracting syntactically-related candidates of collocations and ranking them using statistical association measures. Then, starting from the first period of the corpus, the system tracks each combination over time, verifying different types of historical variation such as the loss of one or both lemmas, the disappearance of the collocation, or its diachronic frequency trend. Using a distributional semantics strategy, it also proposes other linguistic structures which convey similar meanings to those extinct collocations. A case study on historical corpora of Portuguese and Spanish shows that the system speeds up and facilitates the finding of some diachronic changes and phraseological shifts that are harder to identify without using automated methods.
pdf
abs
A comparison of statistical association measures for identifying dependency-based collocations in various languages.
Marcos Garcia
|
Marcos García Salido
|
Margarita Alonso-Ramos
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association metrics we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1,526 collocations with the corresponding Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.
pdf
abs
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.
Marcos Garcia
|
Marcos García Salido
|
Susana Sotelo
|
Estela Mosqueira
|
Margarita Alonso-Ramos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.
2018
pdf
A Lexical Tool for Academic Writing in Spanish based on Expert and Novice Corpora
Marcos García Salido
|
Marcos García
|
Milka Villayandre-Llamazares
|
Margarita Alonso-Ramos
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
Using bilingual word-embeddings for multilingual collocation extraction
Marcos Garcia
|
Marcos García-Salido
|
Margarita Alonso-Ramos
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains.