This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Margarita AlonsoRamos
Also published as:
Margarita Alonso-Ramos
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
This paper describes the process of annotation of 996 lexical bundles (LB) assigned to 39 different discourse functions in a Spanish academic corpus. The purpose of the annotation is to obtain a new Spanish gold-standard corpus of 1,800,000 words useful for training and evaluating computational models that are capable of identifying automatically LBs for each context in new corpora, as well as for linguistic analysis about the role of LBs in academic discourse. The annotation process revealed that correspondence between LBs and discourse functions is not biunivocal and that the degree of ambiguity is high, so linguists’ contribution has been essential for improving the automatic assignation of tags.
This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.
This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association metrics we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1,526 collocations with the corresponding Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.
This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains.
Collocations play a significant role in second language acquisition. In order to be able to offer efficient support to learners, an NLP-based CALL environment for learning collocations should be based on a representative collocation error annotated learner corpus. However, so far, no theoretically-motivated collocation error tag set is available. Existing learner corpora tag collocation errors simply as lexical errors ― which is clearly insufficient given the wide range of different collocation errors that the learners make. In this paper, we present a fine-grained three-dimensional typology of collocation errors that has been derived in an empirical study from the learner corpus CEDEL2 compiled by a team at the Autonomous University of Madrid. The first dimension captures whether the error concerns the collocation as a whole or one of its elements; the second dimension captures the language-oriented error analysis, while the third exemplifies the interpretative error analysis. To facilitate a smooth annotation along this typology, we adapted Knowtator, a flexible off-the-shelf annotation tool implemented as a Protégé plugin.
We present an experiment in extracting collocations from the FrameNet corpus, specifically, support verbs such as direct in Environmentalists directed strong criticism at world leaders. Support verbs do not contribute meaning of their own and the meaning of the construction is provided by the noun; the recognition of support verbs is thus useful in text understanding. Having access to a list of support verbs is also useful in applications that can benefit from paraphrasing, such as generation (where paraphrasing can provide variety). This paper starts with a brief presentation of the notion of lexical function in Meaning-Text Theory, where they fall under the notion of lexical function, and then discusses how relevant information is encoded in the FrameNet corpus. We describe the resource extracted from the FrameNet corpus.
A series of different automatic query expansion techniques has been suggested in Information Retrieval. To estimate how suitable a document term is as an expansion term, the most popular of them use a measure of the frequency of the co-occurrence of this term with one or several query terms. The benefit of the use of the linguistic relations that hold between query terms is often questioned. If a linguistic phenomenon is taken into account, it is the phrase structure or lexical compound. We propose a technique that is based on the restricted lexical cooccurrence (collocation) of query terms. We use the knowledge on collocations formed by query terms for two tasks: (i) document relevance clustering done in the first stage of local query expansion and (ii) choice of suitable expansion terms from the relevant document cluster. In this paper, we describe the first task, providing evidence from first preliminary experiments on Spanish material that local relevance clustering benefits largely from knowledge on collocations.