Abstract
In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.- Anthology ID:
- 2008.amta-srw.5
- Volume:
- Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop
- Month:
- October 21-25
- Year:
- 2008
- Address:
- Waikiki, USA
- Venue:
- AMTA
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 270–278
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2008.amta-srw.5/
- DOI:
- Cite (ACL):
- Tim Schlippe, ThuyLinh Nguyen, and Stephan Vogel. 2008. Diacritization as a Machine Translation and as a Sequence Labeling Problem. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop, pages 270–278, Waikiki, USA. Association for Machine Translation in the Americas.
- Cite (Informal):
- Diacritization as a Machine Translation and as a Sequence Labeling Problem (Schlippe et al., AMTA 2008)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2008.amta-srw.5.pdf