Abstract
We discuss a previously proposed method for augmenting parallel corpora of limited size for the purposes of machine translation through monolingual paraphrasing of the source language. We develop a three-stage shallow paraphrasing procedure to be applied to the Swedish-Bulgarian language pair for which limited parallel resources exist. The source language exhibits specifics not typical of high-density languages already studied in a similar setting. Paraphrases of a highly productive type of compound nouns in Swedish are generated by a corpus-based technique. Certain Swedish noun-phrase types are paraphrased using basic heuristics. Further we introduce noun-phrase morphological variations for better wordform coverage. We evaluate the performance of a phrase-based statistical machine translation system trained on a baseline parallel corpus and on three stages of artificial enlargement of the source-language training data. Paraphrasing is shown to have no effect on performance for the Swedish-English translation task. We show a small, yet consistent, increase in the BLEU score of Swedish-Bulgarian translations of larger token spans on the first enlargement stage. A small improvement in the overall BLEU score of Swedish-Bulgarian translation is achieved on the second enlargement stage. We find that both improvements justify further research into the method for the Swedish-Bulgarian translation task.- Anthology ID:
- L12-1434
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3937–3943
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf
- DOI:
- Cite (ACL):
- Georgi Iliev and Angel Genov. 2012. Expanding Parallel Resources for Medium-Density Languages for Free. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3937–3943, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Expanding Parallel Resources for Medium-Density Languages for Free (Iliev & Genov, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/743_Paper.pdf