Extraction of name and transliteration in monolingual and parallel corpora

Tracy Lin, Jian-Cheng Wu, Jason S. Chang


Abstract
Named-entities in free text represent a challenge to text analysis in Machine Translation and Cross Language Information Retrieval. These phrases are often transliterated into another language with a different sound inventory and writing system. Named-entities found in free text are often not listed in bilingual dictionaries. Although it is possible to identify and translate named-entities on the fly without a list of proper names and transliterations, an extensive list of existing transliterations certainly will ensure high precision rate. We use a seed list of proper names and transliterations to train a Machine Transliteration Model. With the model it is possible to extract proper names and their transliterations in monolingual or parallel corpora with high precision and recall rates.
Anthology ID:
2004.amta-papers.20
Volume:
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers
Month:
September 28 - October 2
Year:
2004
Address:
Washington, USA
Venue:
AMTA
SIG:
Publisher:
Springer
Note:
Pages:
177–186
Language:
URL:
https://link.springer.com/chapter/10.1007/978-3-540-30194-3_20
DOI:
Bibkey:
Cite (ACL):
Tracy Lin, Jian-Cheng Wu, and Jason S. Chang. 2004. Extraction of name and transliteration in monolingual and parallel corpora. In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 177–186, Washington, USA. Springer.
Cite (Informal):
Extraction of name and transliteration in monolingual and parallel corpora (Lin et al., AMTA 2004)
Copy Citation:
PDF:
https://link.springer.com/chapter/10.1007/978-3-540-30194-3_20