Bilingual dictionaries for all EU languages
Ahmet Aker, Monica Paramita, Mārcis Pinnis, Robert Gaizauskas
Abstract
Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the quality of outputs of tools relying on the dictionaries are negatively affected. In this work we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and translation based approach. We have applied these approaches on the GIZA++ dictionaries – dictionaries covering official EU languages – in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.- Anthology ID:
- L14-1623
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2839–2845
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/803_Paper.pdf
- DOI:
- Cite (ACL):
- Ahmet Aker, Monica Paramita, Mārcis Pinnis, and Robert Gaizauskas. 2014. Bilingual dictionaries for all EU languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2839–2845, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Bilingual dictionaries for all EU languages (Aker et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/803_Paper.pdf