Building the Macedonian-Croatian Parallel Corpus

Ines Cebović, Marko Tadić


Abstract
In this paper we present the newly created parallel corpus of two under-resourced languages, namely, Macedonian-Croatian Parallel Corpus (mk-hr_pcorp) that has been collected during 2015 at the Faculty of Humanities and Social Sciences, University of Zagreb. The mk-hr_pcorp is a unidirectional (mk→hr) parallel corpus composed of synchronic fictional prose texts received already in digital form with over 500 Kw in each language. The corpus was sentence segmented and provides 39,735 aligned sentences. The alignment was done automatically and then post-corrected manually. The alignments order was shuffled and this enabled the corpus to be available under CC-BY license through META-SHARE. However, this prevents the research in language units over the sentence level.
Anthology ID:
L16-1671
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4241–4244
Language:
URL:
https://aclanthology.org/L16-1671
DOI:
Bibkey:
Cite (ACL):
Ines Cebović and Marko Tadić. 2016. Building the Macedonian-Croatian Parallel Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4241–4244, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Building the Macedonian-Croatian Parallel Corpus (Cebović & Tadić, LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/L16-1671.pdf