Bilingual dictionaries for all EU languages

Ahmet Aker, Monica Paramita, Mārcis Pinnis, Robert Gaizauskas


Abstract
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the quality of outputs of tools relying on the dictionaries are negatively affected. In this work we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and translation based approach. We have applied these approaches on the GIZA++ dictionaries – dictionaries covering official EU languages – in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.
Anthology ID:
L14-1623
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2839–2845
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/803_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ahmet Aker, Monica Paramita, Mārcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2839–2845, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Bilingual dictionaries for all EU languages (Aker et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/803_Paper.pdf