Expanding Parallel Resources for Medium-Density Languages for Free

Georgi Iliev, Angel Genov


Abstract
We discuss a previously proposed method for augmenting parallel corpora of limited size for the purposes of machine translation through monolingual paraphrasing of the source language. We develop a three-stage shallow paraphrasing procedure to be applied to the Swedish-Bulgarian language pair for which limited parallel resources exist. The source language exhibits specifics not typical of high-density languages already studied in a similar setting. Paraphrases of a highly productive type of compound nouns in Swedish are generated by a corpus-based technique. Certain Swedish noun-phrase types are paraphrased using basic heuristics. Further we introduce noun-phrase morphological variations for better wordform coverage. We evaluate the performance of a phrase-based statistical machine translation system trained on a baseline parallel corpus and on three stages of artificial enlargement of the source-language training data. Paraphrasing is shown to have no effect on performance for the Swedish-English translation task. We show a small, yet consistent, increase in the BLEU score of Swedish-Bulgarian translations of larger token spans on the first enlargement stage. A small improvement in the overall BLEU score of Swedish-Bulgarian translation is achieved on the second enlargement stage. We find that both improvements justify further research into the method for the Swedish-Bulgarian translation task.
Anthology ID:
L12-1434
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3937–3943
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Georgi Iliev and Angel Genov. 2012. Expanding Parallel Resources for Medium-Density Languages for Free. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3937–3943, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Expanding Parallel Resources for Medium-Density Languages for Free (Iliev & Genov, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf