This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
AndreaDömötör
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
The goal of annotation standards is to ensure consistency across different corpora and languages. But do they succeed? In our paper we experiment with morphologically annotated Hungarian corpora of different sizes (ELTE DH gold standard corpus, NYTK-NerKor, and Szeged Treebank) to assess their compatibility as a merged training corpus for morphological analysis and disambiguation. Our results show that combining any two corpora not only failed to improve the results of the trained tagger but even degraded them due the inconsistent annotations. Further analysis of the annotation differences among the corpora revealed inconsistencies of several sources: different theoretical approach, lack of consensus, and tagset conversion issues.
The classical mental lexicon models represented the lexicon as a list of words. Usage-based models describe the mental lexicon more dynamically, but they do not capture the real-time operation of speech production. In the linguistic model of Boris Gasparov, the notions of communicative fragment and contour can provide a comprehensive description of the diversity of linguistic experience. Fragments and contours form larger linguistic structures than words and they are recognized as a whole unit by speakers through their communicative profile. Fragments are prefabricated units that can be added to or merged with each other during speech production. The contours serve as templates for the utterances by combining specific and abstract linguistic elements. Based on this theoretical framework, our tool applies remix n-grams (combination of word forms, lemmas and POS-tags) to identify similar linguistic structures in different texts that form the basic units of the mental lexicon.
The research presented in this paper concerns zero copulas in Hungarian, i.e. the phenomenon that nominal predicates lack an explicit verbal copula in the default present tense 3rd person indicative case. We created a tool based on the state-of-the-art transformer architecture implemented in Marian NMT framework that can identify and mark the location of zero copulas, i.e. the position where an overt copula would appear in the non-default cases. Our primary aim was to support quantitative corpus-based linguistic research by creating a tool that can be used to compile a corpus of significant size containing examples of nominal predicates including the location of the zero copulas. We created the training corpus for our system transforming sentences containing overt copulas into ones containing zero copula labels. However, we first needed to disambiguate occurrences of the massively ambiguous verb van ‘exist/be/have’. We performed this using a rule-base classifier relying on English translations in the English-Hungarian parallel subcorpus of the OpenSubtitles corpus. We created several NMT-based models using different sampling methods and optionally using our baseline model to synthesize additional training data. Our best model obtains almost 90% precision and 80% recall on an in-domain test set.
In this article, an ongoing research is presented, the immediate goal of which is to create a corpus annotated with semantic role labels for Hungarian that can be used to train a parser-based system capable of formulating relevant questions about the text it processes. We briefly describe the objectives of our research, our efforts at eliminating errors in the Hungarian Universal Dependencies corpus, which we use as the base of our annotation effort, at creating a Hungarian verbal argument database annotated with thematic roles, at classifying adjuncts, and at matching verbal argument frames to specific occurrences of verbs and participles in the corpus.