The ELAN Slovene-English aligned corpus

Tomaz Erjavec


Abstract
Multilingual parallel corpora are a basic resource for research and development of MT. Such corpora are still scarce, especially for lower-diffusion languages. The paper presents a sentence-aligned tokenised Slovene-English corpus, developed in the scope of the EU ELAN project. The corpus contains 1 million words from fifteen recent terminology-rich texts and is encoded according to the Guidelines for Text Encoding and Interchange (TEI). Our document type definition is a parametrisation of the TEI which directly encodes translation units of the bi-texts. in a manner similar to that of translation memories. The corpus is aimed as a widely-distributable dataset for language engineering and for translation and terminology studies. The paper describes the compilation of the corpus, its composition, encoding and availability. We highlight the corpus acquisition and distribution bottlenecks and present our solutions. These have to do with the workflow in the project, and. not unrelatedly, with the encoding scheme for the corpus.
Anthology ID:
1999.mtsummit-1.51
Volume:
Proceedings of Machine Translation Summit VII
Month:
September 13-17
Year:
1999
Address:
Singapore, Singapore
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
349–357
Language:
URL:
https://aclanthology.org/1999.mtsummit-1.51
DOI:
Bibkey:
Cite (ACL):
Tomaz Erjavec. 1999. The ELAN Slovene-English aligned corpus. In Proceedings of Machine Translation Summit VII, pages 349–357, Singapore, Singapore.
Cite (Informal):
The ELAN Slovene-English aligned corpus (Erjavec, MTSummit 1999)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/1999.mtsummit-1.51.pdf