2022
pdf
abs
Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction
Zihao Feng
|
Hailong Cao
|
Tiejun Zhao
|
Weixuan Wang
|
Wei Peng
Proceedings of the 29th International Conference on Computational Linguistics
Despite their progress in high-resource language settings, unsupervised bilingual lexicon induction (UBLI) models often fail on corpora with low-resource distant language pairs due to insufficient initialization. In this work, we propose a cross-lingual feature extraction (CFE) method to learn the cross-lingual features from monolingual corpora for low-resource UBLI, enabling representations of words with the same meaning leveraged by the initialization step. By integrating cross-lingual representations with pre-trained word embeddings in a fully unsupervised initialization on UBLI, the proposed method outperforms existing state-of-the-art methods on low-resource language pairs (EN-VI, EN-TH, EN-ZH, EN-JA). The ablation study also proves that the learned cross-lingual features can enhance the representational ability and robustness of the existing embedding model.
2016
pdf
abs
A Distribution-based Model to Learn Bilingual Word Embeddings
Hailong Cao
|
Tiejun Zhao
|
Shu Zhang
|
Yao Meng
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
We introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages.
2014
pdf
A Lexicalized Reordering Model for Hierarchical Phrase-based Translation
Hailong Cao
|
Dongdong Zhang
|
Mu Li
|
Ming Zhou
|
Tiejun Zhao
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
pdf
Soft Dependency Matching for Hierarchical Phrase-based Machine Translation
Hailong Cao
|
Dongdong Zhang
|
Ming Zhou
|
Tiejun Zhao
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
2012
pdf
abs
The HIT-LTRC machine translation system for IWSLT 2012
Xiaoning Zhu
|
Yiming Cui
|
Conghui Zhu
|
Tiejun Zhao
|
Hailong Cao
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign
In this paper, we describe HIT-LTRC's participation in the IWSLT 2012 evaluation campaign. In this year, we took part in the Olympics Task which required the participants to translate Chinese to English with limited data. Our system is based on Moses[1], which is an open source machine translation system. We mainly used the phrase-based models to carry out our experiments, and factored-based models were also performed in comparison. All the involved tools are freely available. In the evaluation campaign, we focus on data selection, phrase extraction method comparison and phrase table combination.
pdf
Expected Error Minimization with Ultraconservative Update for SMT
Lemao Liu
|
Tiejun Zhao
|
Taro Watanabe
|
Hailong Cao
|
Conghui Zhu
Proceedings of COLING 2012: Posters
pdf
Locally Training the Log-Linear Model for SMT
Lemao Liu
|
Hailong Cao
|
Taro Watanabe
|
Tiejun Zhao
|
Mo Yu
|
Conghui Zhu
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
2011
pdf
A Unified and Discriminative Soft Syntactic Constraint Model for Hierarchical Phrase-based Translation
Lemao Liu
|
Tiejun Zhao
|
Chao Wang
|
Hailong Cao
Proceedings of Machine Translation Summit XIII: Papers
2010
pdf
Syntactic Constraints on Phrase Extraction for Phrase-Based Machine Translation
Hailong Cao
|
Andrew Finch
|
Eiichiro Sumita
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation
pdf
Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao
|
Eiichiro Sumita
Proceedings of the ACL 2010 Conference Short Papers
2008
pdf
abs
The NICT/ATR speech translation system for IWSLT 2008.
Masao Utiyama
|
Andrew Finch
|
Hideo Okuma
|
Michael Paul
|
Hailong Cao
|
Hirofumi Yamamoto
|
Keiji Yasuda
|
Eiichiro Sumita
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign
This paper describes the National Institute of Information and Communications Technology/Advanced Telecommunications Research Institute International (NICT/ATR) statistical machine translation (SMT) system used for the IWSLT 2008 evaluation campaign. We participated in the Chinese–English (Challenge Task), English–Chinese (Challenge Task), Chinese–English (BTEC Task), Chinese–Spanish (BTEC Task), and Chinese–English–Spanish (PIVOT Task) translation tasks. In the English–Chinese translation Challenge Task, we focused on exploring various factors for the English–Chinese translation because the research on the translation of English–Chinese is scarce compared to the opposite direction. In the Chinese–English translation Challenge Task, we employed a novel clustering method, where training sentences similar to the development data in terms of the word error rate formed a cluster. In the pivot translation task, we integrated two strategies for pivot translation by linear interpolation.