Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.
In this paper, we present an upgraded version of the Hungarian NYTK-NerKor named entity corpus, which contains about twice as many annotated spans and 7 times as many distinct entity types as the original version. We used an extended version of the OntoNotes 5 annotation scheme including time and numerical expressions. NerKor is the newest and biggest NER corpus for Hungarian containing diverse domains. We applied cross-lingual transfer of NER models trained for other languages based on multilingual contextual language models to preannotate the corpus. We corrected the annotation semi-automatically and manually. Zero-shot preannotation was very effective with about 0.82 F1 score for the best model. We also added a 12000-token subcorpus on cars and other motor vehicles. We trained and release a transformer-based NER tagger for Hungarian using the annotation in the new corpus version, which provides similar performance to an identical model trained on the original version of the corpus.
While abstractive summarization in certain languages, like English, has already reached fairly good results due to the availability of trend-setting resources, like the CNN/Daily Mail dataset, and considerable progress in generative neural models, progress in abstractive summarization for Arabic, the fifth most-spoken language globally, is still in baby shoes. While some resources for extractive summarization have been available for some time, in this paper, we present the first corpus of human-written abstractive news summaries in Arabic, hoping to lay the foundation of this line of research for this important language. The dataset consists of more than 21 thousand items. We used this dataset to train a set of neural abstractive summarization systems for Arabic by fine-tuning pre-trained language models such as multilingual BERT, AraBERT, and multilingual BART-50. As the Arabic dataset is much smaller than e.g. the CNN/Daily Mail dataset, we also applied cross-lingual knowledge transfer to significantly improve the performance of our baseline systems. The setups included two M-BERT-based summarization models originally trained for Hungarian/English and a similar system based on M-BART-50 originally trained for Russian that were further fine-tuned for Arabic. Evaluation of the models was performed in terms of ROUGE, and a manual evaluation of fluency and adequacy of the models was also performed.
In this paper, we present a major update to the first Hungarian named entity dataset, the Szeged NER corpus. We used zero-shot cross-lingual transfer to initialize the enrichment of entity types annotated in the corpus using three neural NER models: two of them based on the English OntoNotes corpus and one based on the Czech Named Entity Corpus finetuned from multilingual neural language models. The output of the models was automatically merged with the original NER annotation, and automatically and manually corrected and further enriched with additional annotation, like qualifiers for various entity types. We present the evaluation of the zero-shot performance of the two OntoNotes-based models and a transformer-based new NER model trained on the training part of the final corpus. We release the corpus and the trained model.
In this paper, we present the results of our experiments concerning the zero-shot cross-lingual performance of the PERIN sentence-to-graph semantic parser. We applied the PTG model trained using the PERIN parser on a 740k-token Czech newspaper corpus to Hungarian. We evaluated the performance of the parser using the official evaluation tool of the MRP 2020 shared task. The gold standard Hungarian annotation was created by manual correction of the output of the parser following the annotation manual of the tectogrammatical level of the Prague Dependency Treebank. An English model trained on a larger one-million-token English newspaper corpus is also available, however, we found that the Czech model performed significantly better on Hungarian input due to the fact that Hungarian is typologically more similar to Czech than to English. We have found that zero-shot transfer of the PTG meaning representation across typologically not-too-distant languages using a neural parser model based on a multilingual contextual language model followed by a manual correction by linguist experts seems to be a viable scenario.
In this paper, we present a modified version of the CBOW algorithm implemented in the fastText framework. Our modified algorithm, CBOW-tag builds a vector space model that includes the representation of the original word forms and their annotation at the same time. We illustrate the results by presenting a model built from a corpus that includes morphological and syntactic annotations. The simultaneous presence of unannotated elements and different annotations at the same time in the model makes it possible to constrain nearest neighbour queries to specific types of elements. The model can thus efficiently answer questions such as What do we eat?, What can we do with a skeleton? What else do we do with what we eat?, etc. Error analysis reveals that the model can highlight errors introduced into the annotation by the tagger and parser we used to generate the annotations as well as lexical peculiarities in the corpus itself, especially if we do not limit the vocabulary of the model to frequent items.
The research presented in this paper concerns zero copulas in Hungarian, i.e. the phenomenon that nominal predicates lack an explicit verbal copula in the default present tense 3rd person indicative case. We created a tool based on the state-of-the-art transformer architecture implemented in Marian NMT framework that can identify and mark the location of zero copulas, i.e. the position where an overt copula would appear in the non-default cases. Our primary aim was to support quantitative corpus-based linguistic research by creating a tool that can be used to compile a corpus of significant size containing examples of nominal predicates including the location of the zero copulas. We created the training corpus for our system transforming sentences containing overt copulas into ones containing zero copula labels. However, we first needed to disambiguate occurrences of the massively ambiguous verb van ‘exist/be/have’. We performed this using a rule-base classifier relying on English translations in the English-Hungarian parallel subcorpus of the OpenSubtitles corpus. We created several NMT-based models using different sampling methods and optionally using our baseline model to synthesize additional training data. Our best model obtains almost 90% precision and 80% recall on an in-domain test set.
In this article, an ongoing research is presented, the immediate goal of which is to create a corpus annotated with semantic role labels for Hungarian that can be used to train a parser-based system capable of formulating relevant questions about the text it processes. We briefly describe the objectives of our research, our efforts at eliminating errors in the Hungarian Universal Dependencies corpus, which we use as the base of our annotation effort, at creating a Hungarian verbal argument database annotated with thematic roles, at classifying adjuncts, and at matching verbal argument frames to specific occurrences of verbs and participles in the corpus.
The goal of a Hungarian research project has been to create an integrated Hungarian natural language processing framework. This infrastructure includes tools for analyzing Hungarian texts, integrated into a standardized environment. The morphological analyzer is one of the core components of the framework. The goal of this paper is to describe a fast and customizable morphological analyzer and its development framework, which synthesizes and further enriches the morphological knowledge implemented in previous tools existing for Hungarian. In addition, we present the method we applied to add semantic knowledge to the lexical database of the morphology. The method utilizes neural word embedding models and morphological and shallow syntactic knowledge.
MorphoLogic’s Humor morphological analyzer engine has been used for the development of several high-quality computational morphologies, among them ones for complex agglutinative languages. However, Humor’s closed source licensing scheme has been an obstacle to making these resources widely available. Moreover, there are other limitations of the rule-based Humor engine: lack of support for morphological guessing and for the integration of frequency information or other weighting of the models. These problems were solved by converting the databases to a finite-state representation that allows for morphological guessing and the addition of weights. Moreover, it has open-source implementations.
This article presents a set of morphological tools for six small endangered minority languages belonging to the Uralic language family, Udmurt, Komi, Eastern Mari, Northern Mansi, Tundra Nenets and Nganasan. Following an introduction to the languages, the two sets of tools used in the project (MorphoLogic's Humor tools and the Xerox Finite State Tool) are described and compared. The article is concluded by a comparison of the six computational morphologies.
Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.