Statistical Models for Unsupervised, Semi-Supervised Supervised Transliteration Mining
Hassan Sajjad, Helmut Schmid, Alexander Fraser, Hinrich Schütze
Abstract
We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.- Anthology ID:
- J17-2003
- Volume:
- Computational Linguistics, Volume 43, Issue 2 - June 2017
- Month:
- June
- Year:
- 2017
- Address:
- Cambridge, MA
- Venue:
- CL
- SIG:
- Publisher:
- MIT Press
- Note:
- Pages:
- 349–375
- Language:
- URL:
- https://aclanthology.org/J17-2003
- DOI:
- 10.1162/COLI_a_00286
- Cite (ACL):
- Hassan Sajjad, Helmut Schmid, Alexander Fraser, and Hinrich Schütze. 2017. Statistical Models for Unsupervised, Semi-Supervised Supervised Transliteration Mining. Computational Linguistics, 43(2):349–375.
- Cite (Informal):
- Statistical Models for Unsupervised, Semi-Supervised Supervised Transliteration Mining (Sajjad et al., CL 2017)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/J17-2003.pdf