Abstract
In this paper, we present HunOr, the first multi-domain Hungarian―Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use and science, however, we would like to add texts from the news domain to the corpus. In the future, we are planning to carry out a syntactic annotation of the HunOr corpus, which will further enhance the usability of the corpus in various NLP fields such as transfer-based machine translation or cross lingual information retrieval.- Anthology ID:
- L12-1106
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2453–2458
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/262_Paper.pdf
- DOI:
- Cite (ACL):
- Martina Katalin Szabó, Veronika Vincze, and István Nagy T.. 2012. HunOr: A Hungarian—Russian Parallel Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2453–2458, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- HunOr: A Hungarian—Russian Parallel Corpus (Szabó et al., LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/262_Paper.pdf