HunOr: A Hungarian—Russian Parallel Corpus

Martina Katalin Szabó, Veronika Vincze, István Nagy T.


Abstract
In this paper, we present HunOr, the first multi-domain Hungarian―Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a syntactic annotation of the HunOr corpus, which will further enhance the usability of the corpus in various NLP fields such as transfer-based machine translation or cross lingual information retrieval.
Anthology ID:
L12-1106
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2453–2458
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/262_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Martina Katalin Szabó, Veronika Vincze, and István Nagy T.. 2012. HunOr: A Hungarian—Russian Parallel Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2453–2458, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
HunOr: A Hungarian—Russian Parallel Corpus (Szabó et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/262_Paper.pdf