Transliteration and alignment of parallel texts from Cyrillic to Latin

Mircea Petic, Daniela Gîfu


Abstract
This article describes a methodology of recovering and preservation of old Romanian texts and problems related to their recognition. Our focus is to create a gold corpus for Romanian language (the novella Sania), for both alphabets used in Transnistria ― Cyrillic and Latin. The resource is available for similar researches. This technology is based on transliteration and semiautomatic alignment of parallel texts at the level of letter/lexem/multiwords. We have analysed every text segment present in this corpus and discovered other conventions of writing at the level of transliteration, academic norms and editorial interventions. These conventions allowed us to elaborate and implement some new heuristics that make a correct automatic transliteration process. Sometimes the words of Latin script are modified in Cyrillic script from semantic reasons (for instance, editor’s interpretation). Semantic transliteration is seen as a good practice in introducing multiwords from Cyrillic to Latin. Not only does it preserve how a multiwords sound in the source script, but also enables the translator to modify in the original text (here, choosing the most common sense of an expression). Such a technology could be of interest to lexicographers, but also to specialists in computational linguistics to improve the actual transliteration standards.
Anthology ID:
L14-1290
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1819–1823
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/328_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Mircea Petic and Daniela Gîfu. 2014. Transliteration and alignment of parallel texts from Cyrillic to Latin. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1819–1823, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Transliteration and alignment of parallel texts from Cyrillic to Latin (Petic & Gîfu, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/328_Paper.pdf