Jia Xu


2021

pdf bib
Grouping Words with Semantic Diversity
Karine Chubarian | Abdul Rafae Khan | Anastasios Sidiropoulos | Jia Xu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.

2020

pdf bib
Coding Textual Inputs Boosts the Accuracy of Neural Networks
Abdul Rafae Khan | Jia Xu | Weiwei Sun
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural Language Processing (NLP) tasks are usually performed word by word on textual inputs. We can use arbitrary symbols to represent the linguistic meaning of a word and use these symbols as inputs. As “alternatives” to a text representation, we introduce Soundex, MetaPhone, NYSIIS, logogram to NLP, and develop fixed-output-length coding and its extension using Huffman coding. Each of those codings combines different character/digital sequences and constructs a new vocabulary based on codewords. We find that the integration of those codewords with text provides more reliable inputs to Neural-Network-based NLP systems through redundancy than text-alone inputs. Experiments demonstrate that our approach outperforms the state-of-the-art models on the application of machine translation, language modeling, and part-of-speech tagging. The source code is available at https://github.com/abdulrafae/coding_nmt.

2019

pdf bib
CUNY-PKU Parser at SemEval-2019 Task 1: Cross-Lingual Semantic Parsing with UCCA
Weimin Lyu | Sheng Huang | Abdul Rafae Khan | Shengqiang Zhang | Weiwei Sun | Jia Xu
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes the systems of the CUNY-PKU team in SemEval 2019 Task 1: Cross-lingual Semantic Parsing with UCCA. We introduce a novel model by applying a cascaded MLP and BiLSTM model. Then, we ensemble multiple system-outputs by reparsing. In particular, we introduce a new decoding algorithm for building the UCCA representation. Our system won the first place in one track (French-20K-Open), second places in four tracks (English-Wiki-Open, English-20K-Open, German-20K-Open, and German-20K-Closed), and third place in one track (English-20K-Closed), among all seven tracks.

2018

pdf bib
Assessing Quality Estimation Models for Sentence-Level Prediction
Hoang Cuong | Jia Xu
Proceedings of the 27th International Conference on Computational Linguistics

This paper provides an evaluation of a wide range of advanced sentence-level Quality Estimation models, including Support Vector Regression, Ride Regression, Neural Networks, Gaussian Processes, Bayesian Neural Networks, Deep Kernel Learning and Deep Gaussian Processes. Beside the accurateness, our main concerns are also the robustness of Quality Estimation models. Our work raises the difficulty in building strong models. Specifically, we show that Quality Estimation models often behave differently in Quality Estimation feature space, depending on whether the scale of feature space is small, medium or large. We also show that Quality Estimation models often behave differently in evaluation settings, depending on whether test data come from the same domain as the training data or not. Our work suggests several strong candidates to use in different circumstances.

pdf bib
Hunter NMT System for WMT18 Biomedical Translation Task: Transfer Learning in Neural Machine Translation
Abdul Khan | Subhadarshi Panda | Jia Xu | Lampros Flokas
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the submission of Hunter Neural Machine Translation (NMT) to the WMT’18 Biomedical translation task from English to French. The discrepancy between training and test data distribution brings a challenge to translate text in new domains. Beyond the previous work of combining in-domain with out-of-domain models, we found accuracy and efficiency gain in combining different in-domain models. We conduct extensive experiments on NMT with transfer learning. We train on different in-domain Biomedical datasets one after another. That means parameters of the previous training serve as the initialization of the next one. Together with a pre-trained out-of-domain News model, we enhanced translation quality with 3.73 BLEU points over the baseline. Furthermore, we applied ensemble learning on training models of intermediate epochs and achieved an improvement of 4.02 BLEU points over the baseline. Overall, our system is 11.29 BLEU points above the best system of last year on the EDP 2017 test set.

2017

pdf bib
Hunter MT: A Course for Young Researchers in WMT17
Jia Xu | Yi Zong Kuang | Shondell Baijoo | Jacob Hyun Lee | Uman Shahzad | Mir Ahmed | Meredith Lancaster | Chris Carlan
Proceedings of the Second Conference on Machine Translation

2014

pdf bib
Query Lattice for Translation Retrieval
Meiping Dong | Yong Cheng | Yang Liu | Jia Xu | Maosong Sun | Tatsuya Izuha | Jie Hao
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2011

pdf bib
Generating Virtual Parallel Corpus: A Compatibility Centric Method
Jia Xu | Weiwei Sun
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Parallel Corpus Refinement as an Outlier Detection Algorithm
Kaveh Taghipour | Shahram Khadivi | Jia Xu
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Enhancing Chinese Word Segmentation Using Unlabeled Data
Weiwei Sun | Jia Xu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMT
Jia Xu | Hans Uszkoreit | Casey Kennington | David Vilar | Xiaojun Zhang
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
Further Experiments with Shallow Hybrid MT Systems
Christian Federmann | Andreas Eisele | Yu Chen | Sabine Hunsicker | Jia Xu | Hans Uszkoreit
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2008

pdf bib
Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation
Jia Xu | Jianfeng Gao | Kristina Toutanova | Hermann Ney
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Phrase Table Training for Precision and Recall: What Makes a Good Phrase and a Good Phrase Pair?
Yonggang Deng | Jia Xu | Yuqing Gao
Proceedings of ACL-08: HLT

2007

pdf bib
Domain dependent statistical machine translation
Jia Xu | Yonggang Deng | Yuqing Gao | Hermann Ney
Proceedings of Machine Translation Summit XI: Papers

2006

pdf bib
Error Analysis of Statistical Machine Translation Output
David Vilar | Jia Xu | Luis Fernando D’Haro | Hermann Ney
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Evaluation of automatic translation output is a difficult task. Several performance measures like Word Error Rate, Position Independent Word Error Rate and the BLEU and NIST scores are widely use and provide a useful tool for comparing different systems and to evaluate improvements within a system. However the interpretation of all of these measures is not at all clear, and the identification of the most prominent source of errors in a given system using these measures alone is not possible. Therefore some analysis of the generated translations is needed in order to identify the main problems and to focus the research efforts. This area is however mostly unexplored and few works have dealt with it until now. In this paper we will present a framework for classification of the errors of a machine translation system and we will carry out an error analysis of the system used by the RWTH in the first TC-STAR evaluation.

pdf bib
Partitioning Parallel Documents Using Binary Segmentation
Jia Xu | Richard Zens | Hermann Ney
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf bib
Sentence segmentation using IBM word alignment model 1
Jia Xu | Richard Zens | Hermann Ney
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

pdf bib
Integrated Chinese Word Segmentation in Statistical Machine Translation
Jia Xu | Evgeny Matusov | Richard Zens | Hermann Ney
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib
The RWTH Phrase-based Statistical Machine Translation System
Richard Zens | Oliver Bender | Sasa Hasan | Shahram Khadivi | Evgeny Matusov | Jia Xu | Yuqi Zhang | Hermann Ney
Proceedings of the Second International Workshop on Spoken Language Translation

2004

pdf bib
Do We Need Chinese Word Segmentation for Statistical Machine Translation?
Jia Xu | Richard Zens | Hermann Ney
Proceedings of the Third SIGHAN Workshop on Chinese Language Processing