Csaba Oravecz

2022

pdf abs
eTranslation’s Submissions to the WMT22 General Machine Translation Task
Csaba Oravecz | Katina Bontcheva | David Kolovratnìk | Bogomil Kovachev | Christopher Scott
Proceedings of the Seventh Conference on Machine Translation (WMT)

The paper describes the NMT models for French-German, English-Ukranian and English-Russian, submitted by the eTranslation team to the WMT22 general machine translation shared task. In the WMT news task last year, multilingual systems with deep and complex architectures utilizing immense amount of data and resources were dominant. This year with the task extended to cover less domain specific text we expected even more dominance of such systems. In the hope to produce competitive (constrained) systems despite our limited resources, this time we selected only medium resource language pairs, which are serviced in the European Commission’s eTranslation system. We took the approach of exploring less resource intensive strategies focusing on data selection and filtering to improve the performance of baseline systems. With our submitted systems our approach scored competitively according to the automatic rankings, except for the the English–Russian model where our submission was only a baseline reference model developed as a by-product of the multilingual setup we built focusing primarily on the English-Ukranian language pair.

2021

The paper describes the 3 NMT models submitted by the eTranslation team to the WMT 2021 news translation shared task. We developed systems in language pairs that are actively used in the European Commission’s eTranslation service. In the WMT news task, recent years have seen a steady increase in the need for computational resources to train deep and complex architectures to produce competitive systems. We took a different approach and explored alternative strategies focusing on data selection and filtering to improve the performance of baseline systems. In the domain constrained task for the French–German language pair our approach resulted in the best system by a significant margin in BLEU. For the other two systems (English–German and English-Czech) we tried to build competitive models using standard best practices.

2020

The paper describes the submissions of the eTranslation team to the WMT 2020 news translation shared task. Leveraging the experience from the team’s participation last year we developed systems for 5 language pairs with various strategies. Compared to last year, for some language pairs we dedicated a lot more resources to training, and tried to follow standard best practices to build competitive systems which can achieve good results in the rankings. By using deep and complex architectures we sacrificed direct re-usability of our systems in production environments but evaluation showed that this approach could result in better models that significantly outperform baseline architectures. We submitted two systems to the zero shot robustness task. These submissions are described briefly in this paper as well.

2019

pdf abs
eTranslation’s Submissions to the WMT 2019 News Translation Task
Csaba Oravecz | Katina Bontcheva | Adrien Lardilleux | László Tihanyi | Andreas Eisele
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the submissions of the eTranslation team to the WMT 2019 news translation shared task. The systems have been developed with the aim of identifying and following rather than establishing best practices, under the constraints imposed by a low resource training and decoding environment normally used for our production systems. Thus most of the findings and results are transferable to systems used in the eTranslation service. Evaluations suggest that this approach is able to produce decent models with good performance and speed without the overhead of using prohibitively deep and complex architectures.

2016

pdf abs
A New Integrated Open-source Morphological Analyzer for Hungarian
Attila Novák | Borbála Siklósi | Csaba Oravecz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of a Hungarian research project has been to create an integrated Hungarian natural language processing framework. This infrastructure includes tools for analyzing Hungarian texts, integrated into a standardized environment. The morphological analyzer is one of the core components of the framework. The goal of this paper is to describe a fast and customizable morphological analyzer and its development framework, which synthesizes and further enriches the morphological knowledge implemented in previous tools existing for Hungarian. In addition, we present the method we applied to add semantic knowledge to the lexical database of the morphology. The method utilizes neural word embedding models and morphological and shallow syntactic knowledge.

2014

pdf abs
The Hungarian Gigaword Corpus
Csaba Oravecz | Tamás Váradi | Bálint Sass
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper reports on the development of the Hungarian Gigaword Corpus (HGC), an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having some challenging characteristics with respect to computational processing. As the HGC is designed to serve as a resource for a wide range of linguistic research as well as for the interested public, a number of issues had to be resolved which were raised by trying to find a balance between the above two application areas. The following main objectives have been defined for the development of the HGC, focusing on the pivotal concept of increase in: - size: extending the corpus to minimum 1 billion words, - quality: using new technology for development and analysis, - coverage and representativity: taking new samples of language use and including further variants (transcribed spoken language data and user generated content (social media) from the internet in particular).

2007

pdf
Poster paper: HunPos – an open source trigram tagger
Péter Halácsy | András Kornai | Csaba Oravecz
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf abs
Using a morphological analyzer in high precision POS tagging of Hungarian
Péter Halácsy | András Kornai | Csaba Oravecz | Viktor Trón | Dániel Varga
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper presents an evaluation of maxent POS disambiguation systems that incorporate an open source morphological analyzer to constrain the probabilistic models. The experiments show that the best proposed architecture, which is the first application of the maximum entropy framework in a Hungarian NLP task, outperforms comparable state of the art tagging methods and is able to handle out of vocabulary items robustly, allowing for efficient analysis of large (web-based) corpora.

2004

pdf abs
Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Attila Novák | Viktor Nagy | Csaba Oravecz
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms alleviating the problem of unseen tokens. However, although in a smaller number, there will still remain forms which are unknown even to the morphological analyzer and should be handled by some guesser mechanism. The paper will describe a hybrid method which combines symbolic and statistical information to provide lemmatization and suffix analyses for unknown word forms. Evaluation is carried out with respect to the induction of possible analyses and their respective lexical probabilities for unknown word forms in a part-of-speech tagging system.

2002

pdf
Efficient Stochastic Part-of-Speech Tagging for Hungarian
Csaba Oravecz | Péter Dienes
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf
Bottom-Up Tagset Design from Maximally Reduced Tagset
Péter Dienes | Csaba Oravecz
Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora

pdf
Principled Hidden Tagset Design for Tiered Tagging of Hungarian
Dan Tufiş | Péter Dienes | Csaba Oravecz | Tamás Váradi
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)