Diacritization as a Machine Translation and as a Sequence Labeling Problem

Tim Schlippe, ThuyLinh Nguyen, Stephan Vogel


Abstract
In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.
Anthology ID:
2008.amta-srw.5
Volume:
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop
Month:
October 21-25
Year:
2008
Address:
Waikiki, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
270–278
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2008.amta-srw.5/
DOI:
Bibkey:
Cite (ACL):
Tim Schlippe, ThuyLinh Nguyen, and Stephan Vogel. 2008. Diacritization as a Machine Translation and as a Sequence Labeling Problem. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop, pages 270–278, Waikiki, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Diacritization as a Machine Translation and as a Sequence Labeling Problem (Schlippe et al., AMTA 2008)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2008.amta-srw.5.pdf