Heuristic Word Alignment with Parallel Phrases

Maria Holmqvist


Abstract
We present a heuristic method for word alignment, which is the task of identifying corresponding words in parallel text. The heuristic method is based on parallel phrases extracted from manually word aligned sentence pairs. Word alignment is performed by matching parallel phrases to new sentence pairs, and adding word links from the parallel phrase to words in the matching sentence segment. Experiments on an English--Swedish parallel corpus showed that the heuristic phrase-based method produced word alignments with high precision but low recall. In order to improve alignment recall, phrases were generalized by replacing words with part-of-speech categories. The generalization improved recall but at the expense of precision. Two filtering strategies were investigated to prune the large set of generalized phrases. Finally, the phrase-based method was compared to statistical word alignment with Giza++ and we found that although statistical alignments based on large datasets will outperform phrase-based word alignment, a combination of phrase-based and statistical word alignment outperformed pure statistical alignment in terms of Alignment Error Rate (AER).
Anthology ID:
L10-1353
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/508_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Maria Holmqvist. 2010. Heuristic Word Alignment with Parallel Phrases. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Heuristic Word Alignment with Parallel Phrases (Holmqvist, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/508_Paper.pdf