Lidia S. Chao


2024

pdf bib
3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset
Xinyu Ma | Xuebo Liu | Derek F. Wong | Jun Rao | Bei Li | Liang Ding | Lidia S. Chao | Dacheng Tao | Min Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at https://github.com/MaxyLee/3AM.

pdf
A Two-Stage Prediction-Aware Contrastive Learning Framework for Multi-Intent NLU
Guanhua Chen | Yutong Yao | Derek F. Wong | Lidia S. Chao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multi-intent natural language understanding (NLU) presents a formidable challenge due to the model confusion arising from multiple intents within a single utterance. While previous works train the model contrastively to increase the margin between different multi-intent labels, they are less suited to the nuances of multi-intent NLU. They ignore the rich information between the shared intents, which is beneficial to constructing a better embedding space, especially in low-data scenarios. We introduce a two-stage Prediction-Aware Contrastive Learning (PACL) framework for multi-intent NLU to harness this valuable knowledge. Our approach capitalizes on shared intent information by integrating word-level pre-training and prediction-aware contrastive fine-tuning. We construct a pre-training dataset using a word-level data augmentation strategy. Subsequently, our framework dynamically assigns roles to instances during contrastive fine-tuning while introducing a prediction-aware contrastive loss to maximize the impact of contrastive learning. We present experimental results and empirical analysis conducted on three widely used datasets, demonstrating that our method surpasses the performance of three prominent baselines on both low-data and full-data scenarios.

pdf
MoNMT: Modularly Leveraging Monolingual and Bilingual Knowledge for Neural Machine Translation
Jianhui Pang | Baosong Yang | Derek F. Wong | Dayiheng Liu | Xiangpeng Wei | Jun Xie | Lidia S. Chao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The effective use of monolingual and bilingual knowledge represents a critical challenge within the neural machine translation (NMT) community. In this paper, we propose a modular strategy that facilitates the cooperation of these two types of knowledge in translation tasks, while avoiding the issue of catastrophic forgetting and exhibiting superior model generalization and robustness. Our model is comprised of three functionally independent modules: an encoding module, a decoding module, and a transferring module. The former two acquire large-scale monolingual knowledge via self-supervised learning, while the latter is trained on parallel data and responsible for transferring latent features between the encoding and decoding modules. Extensive experiments in multi-domain translation tasks indicate our model yields remarkable performance, with up to 7 BLEU improvements in out-of-domain tests over the conventional pretrain-and-finetune approach. Our codes are available at https://github.com/NLP2CT/MoNMT.

2023

pdf
Test-time Adaptation for Machine Translation Evaluation by Uncertainty Minimization
Runzhe Zhan | Xuebo Liu | Derek F. Wong | Cuilian Zhang | Lidia S. Chao | Min Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The neural metrics recently received considerable attention from the research community in the automatic evaluation of machine translation. Unlike text-based metrics that have interpretable and consistent evaluation mechanisms for various data sources, the reliability of neural metrics in assessing out-of-distribution data remains a concern due to the disparity between training data and real-world data. This paper aims to address the inference bias of neural metrics through uncertainty minimization during test time, without requiring additional data. Our proposed method comprises three steps: uncertainty estimation, test-time adaptation, and inference. Specifically, the model employs the prediction uncertainty of the current data as a signal to update a small fraction of parameters during test time and subsequently refine the prediction through optimization. To validate our approach, we apply the proposed method to three representative models and conduct experiments on the WMT21 benchmarks. The results obtained from both in-domain and out-of-distribution evaluations consistently demonstrate improvements in correlation performance across different models. Furthermore, we provide evidence that the proposed method effectively reduces model uncertainty. The code is publicly available at https://github.com/NLP2CT/TaU.

pdf
kNN-TL: k-Nearest-Neighbor Transfer Learning for Low-Resource Neural Machine Translation
Shudong Liu | Xuebo Liu | Derek F. Wong | Zhaocong Li | Wenxiang Jiao | Lidia S. Chao | Min Zhang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transfer learning has been shown to be an effective technique for enhancing the performance of low-resource neural machine translation (NMT). This is typically achieved through either fine-tuning a child model with a pre-trained parent model, or by utilizing the out- put of the parent model during the training of the child model. However, these methods do not make use of the parent knowledge during the child inference, which may limit the translation performance. In this paper, we propose a k-Nearest-Neighbor Transfer Learning (kNN-TL) approach for low-resource NMT, which leverages the parent knowledge throughout the entire developing process of the child model. Our approach includes a parent-child representation alignment method, which ensures consistency in the output representations between the two models, and a child-aware datastore construction method that improves inference efficiency by selectively distilling the parent datastore based on relevance to the child model. Experimental results on four low-resource translation tasks show that kNN-TL outperforms strong baselines. Extensive analyses further demonstrate the effectiveness of our approach. Code and scripts are freely available at https://github.com/NLP2CT/kNN-TL.

pdf
TransGEC: Improving Grammatical Error Correction with Translationese
Tao Fang | Xuebo Liu | Derek F. Wong | Runzhe Zhan | Liang Ding | Lidia S. Chao | Dacheng Tao | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at https://github.com/NLP2CT/TransGEC.

pdf
Improving Grammatical Error Correction with Multimodal Feature Integration
Tao Fang | Jinpeng Hu | Derek F. Wong | Xiang Wan | Lidia S. Chao | Tsung-Hui Chang
Findings of the Association for Computational Linguistics: ACL 2023

Grammatical error correction (GEC) is a promising task aimed at correcting errors in a text. Many methods have been proposed to facilitate this task with remarkable results. However, most of them only focus on enhancing textual feature extraction without exploring the usage of other modalities’ information (e.g., speech), which can also provide valuable knowledge to help the model detect grammatical errors. To shore up this deficiency, we propose a novel framework that integrates both speech and text features to enhance GEC. In detail, we create new multimodal GEC datasets for English and German by generating audio from text using the advanced text-to-speech models. Subsequently, we extract acoustic and textual representations by a multimodal encoder that consists of a speech and a text encoder. A mixture-of-experts (MoE) layer is employed to selectively align representations from the two modalities, and then a dot attention mechanism is used to fuse them as final multimodal representations. Experimental results on CoNLL14, BEA19 English, and Falko-MERLIN German show that our multimodal GEC models achieve significant improvements over strong baselines and achieve a new state-of-the-art result on the Falko-MERLIN test set.

pdf
Towards Zero-Shot Multilingual Poetry Translation
Wai Lei Song | Haoyun Xu | Derek F. Wong | Runzhe Zhan | Lidia S. Chao | Shanshan Wang
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

The application of machine translation in the field of poetry has always presented significant challenges. Conventional machine translation techniques are inadequate for capturing and translating the unique style of poetry. The absence of a parallel poetry corpus and the distinctive structure of poetry further restrict the effectiveness of traditional methods. This paper introduces a zero-shot method that is capable of translating poetry style without the need for a large-scale training corpus. Specifically, we treat poetry translation as a standard machine translation problem and subsequently inject the poetry style upon completion of the translation process. Our injection model only requires back-translation and easily obtainable monolingual data, making it a low-cost solution. We conducted experiments on three translation directions and presented automatic and human evaluations, demonstrating that our proposed method outperforms existing online systems and other competitive baselines. These results validate the feasibility and potential of our proposed approach and provide new prospects for poetry translation.

pdf
Human-in-the-loop Machine Translation with Large Language Model
Xinyi Yang | Runzhe Zhan | Derek F. Wong | Junchao Wu | Lidia S. Chao
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM’s translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using the GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation instructions. Additionally, we discuss the experimental results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed differences across selected domains; 4) the quantitative analysis of sentence-level and word-level statistics; and 5) the qualitative analysis of representative translation cases.

2022

pdf
ConsistTL: Modeling Consistency in Transfer Learning for Low-Resource Neural Machine Translation
Zhaocong Li | Xuebo Liu | Derek F. Wong | Lidia S. Chao | Min Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Transfer learning is a simple and powerful method that can be used to boost model performance of low-resource neural machine translation (NMT). Existing transfer learning methods for NMT are static, which simply transfer knowledge from a parent model to a child model once via parameter initialization. In this paper, we propose a novel transfer learning method for NMT, namely ConsistTL, which can continuously transfer knowledge from the parent model during the training of the child model. Specifically, for each training instance of the child model, ConsistTL constructs the semantically-equivalent instance for the parent model and encourages prediction consistency between the parent and child for this instance, which is equivalent to the child model learning each instance under the guidance of the parent model. Experimental results on five low-resource NMT tasks demonstrate that ConsistTL results in significant improvements over strong transfer learning baselines, with a gain up to 1.7 BLEU over the existing back-translation model on the widely-used WMT17 Turkish-English benchmark. Further analysis reveals that ConsistTL can improve the inference calibration of the child model. Code and scripts are freely available at https://github.com/NLP2CT/ConsistTL.

pdf
GuoFeng: A Benchmark for Zero Pronoun Recovery and Translation
Mingzhou Xu | Longyue Wang | Derek F. Wong | Hongye Liu | Linfeng Song | Lidia S. Chao | Shuming Shi | Zhaopeng Tu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The phenomenon of zero pronoun (ZP) has attracted increasing interest in the machine translation (MT) community due to its importance and difficulty. However, previous studies generally evaluate the quality of translating ZPs with BLEU scores on MT testsets, which is not expressive or sensitive enough for accurate assessment. To bridge the data and evaluation gaps, we propose a benchmark testset for target evaluation on Chinese-English ZP translation. The human-annotated testset covers five challenging genres, which reveal different characteristics of ZPs for comprehensive evaluation. We systematically revisit eight advanced models on ZP translation and identify current challenges for future exploration. We release data, code, models and annotation guidelines, which we hope can significantly promote research in this field (https://github.com/longyuewangdcu/mZPRT).

pdf
Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task
Yu Wan | Keqin Bao | Dayiheng Liu | Baosong Yang | Derek F. Wong | Lidia S. Chao | Wenqiang Lei | Jun Xie
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this report, we present our submission to the WMT 2022 Metrics Shared Task. We build our system based on the core idea of UNITE (Unified Translation Evaluation), which unifies source-only, reference-only, and source- reference-combined evaluation scenarios into one single model. Specifically, during the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE. Notably, to reduce the gap between pre-training and fine-tuning, we use data cropping and a ranking-based score normalization strategy. During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years’ WMT competitions. Specially, we collect the results from models with different pre-trained language model backbones, and use different ensembling strategies for involved translation directions.

2021

pdf
Difficulty-Aware Machine Translation Evaluation
Runzhe Zhan | Xuebo Liu | Derek F. Wong | Lidia S. Chao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation.

pdf
RoBLEURT Submission for WMT2021 Metrics Task
Yu Wan | Dayiheng Liu | Baosong Yang | Tianchi Bi | Haibo Zhang | Boxing Chen | Weihua Luo | Derek F. Wong | Lidia S. Chao
Proceedings of the Sixth Conference on Machine Translation

In this paper, we present our submission to Shared Metrics Task: RoBLEURT (Robustly Optimizing the training of BLEURT). After investigating the recent advances of trainable metrics, we conclude several aspects of vital importance to obtain a well-performed metric model by: 1) jointly leveraging the advantages of source-included model and reference-only model, 2) continuously pre-training the model with massive synthetic data pairs, and 3) fine-tuning the model with data denoising strategy. Experimental results show that our model reaching state-of-the-art correlations with the WMT2020 human annotations upon 8 out of 10 to-English language pairs.

pdf
On the Copying Behaviors of Pre-Training for Neural Machine Translation
Xuebo Liu | Longyue Wang | Derek F. Wong | Liang Ding | Lidia S. Chao | Shuming Shi | Zhaopeng Tu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
On the Complementarity between Pre-Training and Back-Translation for Neural Machine Translation
Xuebo Liu | Longyue Wang | Derek F. Wong | Liang Ding | Lidia S. Chao | Shuming Shi | Zhaopeng Tu
Findings of the Association for Computational Linguistics: EMNLP 2021

Pre-training (PT) and back-translation (BT) are two simple and powerful methods to utilize monolingual data for improving the model performance of neural machine translation (NMT). This paper takes the first step to investigate the complementarity between PT and BT. We introduce two probing tasks for PT and BT respectively and find that PT mainly contributes to the encoder module while BT brings more benefits to the decoder. Experimental results show that PT and BT are nicely complementary to each other, establishing state-of-the-art performances on the WMT16 English-Romanian and English-Russian benchmarks. Through extensive analyses on sentence originality and word frequency, we also demonstrate that combining Tagged BT with PT is more helpful to their complementarity, leading to better translation quality. Source code is freely available at https://github.com/SunbowLiu/PTvsBT.

pdf
Document Graph for Neural Machine Translation
Mingzhou Xu | Liangyou Li | Derek F. Wong | Qun Liu | Lidia S. Chao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous works have shown that contextual information can improve the performance of neural machine translation (NMT). However, most existing document-level NMT methods failed to leverage contexts beyond a few set of previous sentences. How to make use of the whole document as global contexts is still a challenge. To address this issue, we hypothesize that a document can be represented as a graph that connects relevant contexts regardless of their distances. We employ several types of relations, including adjacency, syntactic dependency, lexical consistency, and coreference, to construct the document graph. Then, we incorporate both source and target graphs into the conventional Transformer architecture with graph convolutional networks. Experiments on various NMT benchmarks, including IWSLT English–French, Chinese-English, WMT English–German and Opensubtitle English–Russian, demonstrate that using document graphs can significantly improve the translation quality. Extensive analysis verifies that the document graph is beneficial for capturing discourse phenomena.

2020

pdf
Norm-Based Curriculum Learning for Neural Machine Translation
Xuebo Liu | Houtim Lai | Derek F. Wong | Lidia S. Chao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

A neural machine translation (NMT) system is expensive to train, especially with high-resource settings. As the NMT architectures become deeper and wider, this issue gets worse and worse. In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method. We use the norm (aka length or module) of a word embedding as a measure of 1) the difficulty of the sentence, 2) the competence of the model, and 3) the weight of the sentence. The norm-based sentence difficulty takes the advantages of both linguistically motivated and model-based sentence difficulties. It is easy to determine and contains learning-dependent features. The norm-based model competence makes NMT learn the curriculum in a fully automated way, while the norm-based sentence weight further enhances the learning of the vector representation of the NMT. Experimental results for the WMT’14 English-German and WMT’17 Chinese-English translation tasks demonstrate that the proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x).

pdf
Uncertainty-Aware Curriculum Learning for Neural Machine Translation
Yikai Zhou | Baosong Yang | Derek F. Wong | Yu Wan | Lidia S. Chao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural machine translation (NMT) has proven to be facilitated by curriculum learning which presents examples in an easy-to-hard order at different training stages. The keys lie in the assessment of data difficulty and model competence. We propose uncertainty-aware curriculum learning, which is motivated by the intuition that: 1) the higher the uncertainty in a translation pair, the more complex and rarer the information it contains; and 2) the end of the decline in model uncertainty indicates the completeness of current training stage. Specifically, we serve cross-entropy of an example as its data difficulty and exploit the variance of distributions over the weights of the network to present the model uncertainty. Extensive experiments on various translation tasks reveal that our approach outperforms the strong baseline and related methods on both translation quality and convergence speed. Quantitative analyses reveal that the proposed strategy offers NMT the ability to automatically govern its learning schedule.

pdf
Self-Paced Learning for Neural Machine Translation
Yu Wan | Baosong Yang | Derek F. Wong | Yikai Zhou | Lidia S. Chao | Haibo Zhang | Boxing Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent studies have proven that the training of neural machine translation (NMT) can be facilitated by mimicking the learning process of humans. Nevertheless, achievements of such kind of curriculum learning rely on the quality of artificial schedule drawn up with the handcrafted features, e.g. sentence length or word rarity. We ameliorate this procedure with a more flexible manner by proposing self-paced learning, where NMT model is allowed to 1) automatically quantify the learning confidence over training examples; and 2) flexibly govern its learning via regulating the loss in each iteration step. Experimental results over multiple translation tasks demonstrate that the proposed model yields better performance than strong baselines and those models trained with human-designed curricula on both translation quality and convergence speed.

2019

pdf
Learning Deep Transformer Models for Machine Translation
Qiang Wang | Bei Li | Tong Xiao | Jingbo Zhu | Changliang Li | Derek F. Wong | Lidia S. Chao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT’16 English-German and NIST OpenMT’12 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

pdf
Leveraging Local and Global Patterns for Self-Attention Networks
Mingzhou Xu | Derek F. Wong | Baosong Yang | Yue Zhang | Lidia S. Chao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self-attention networks have received increasing research attention. By default, the hidden states of each word are hierarchically calculated by attending to all words in the sentence, which assembles global information. However, several studies pointed out that taking all signals into account may lead to overlooking neighboring information (e.g. phrase pattern). To address this argument, we propose a hybrid attention mechanism to dynamically leverage both of the local and global information. Specifically, our approach uses a gating scalar for integrating both sources of the information, which is also convenient for quantifying their contributions. Experiments on various neural machine translation tasks demonstrate the effectiveness of the proposed method. The extensive analyses verify that the two types of contexts are complementary to each other, and our method gives highly effective improvements in their integration.

pdf
Shared-Private Bilingual Word Embeddings for Neural Machine Translation
Xuebo Liu | Derek F. Wong | Yang Liu | Lidia S. Chao | Tong Xiao | Jingbo Zhu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Word embedding is central to neural machine translation (NMT), which has attracted intensive research interest in recent years. In NMT, the source embedding plays the role of the entrance while the target embedding acts as the terminal. These layers occupy most of the model parameters for representation learning. Furthermore, they indirectly interface via a soft-attention mechanism, which makes them comparatively isolated. In this paper, we propose shared-private bilingual word embeddings, which give a closer relationship between the source and target embeddings, and which also reduce the number of model parameters. For similar source and target words, their embeddings tend to share a part of the features and they cooperatively learn these common representation units. Experiments on 5 language pairs belonging to 6 different language families and written in 5 different alphabets demonstrate that the proposed model provides a significant performance boost over the strong baselines with dramatically fewer model parameters.

pdf
Assessing the Ability of Self-Attention Networks to Learn Word Order
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when “lacking positional information” have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.

pdf
Convolutional Self-Attention Networks
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies. SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces. In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads. Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs. Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters.

2018

pdf
Modeling Localness for Self-Attention Networks
Baosong Yang | Zhaopeng Tu | Derek F. Wong | Fandong Meng | Lidia S. Chao | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength of capturing long distance dependencies while enhance the ability of capturing short-range dependencies, we only apply localness modeling to lower layers of self-attention networks. Quantitative and qualitative analyses on Chinese-English and English-German translation tasks demonstrate the effectiveness and universality of the proposed approach.

2017

pdf
Towards Bidirectional Hierarchical Representations for Attention-based Neural Machine Translation
Baosong Yang | Derek F. Wong | Tong Xiao | Lidia S. Chao | Jingbo Zhu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper proposes a hierarchical attentional neural translation model which focuses on enhancing source-side hierarchical representations by covering both local and global semantic information using a bidirectional tree-based encoder. To maximize the predictive likelihood of target words, a weighted variant of an attention mechanism is used to balance the attentive information between lexical and phrase vectors. Using a tree-based rare word encoding, the proposed model is extended to sub-word level to alleviate the out-of-vocabulary (OOV) problem. Empirical results reveal that the proposed model significantly outperforms sequence-to-sequence attention-based and tree-based neural translation models in English-Chinese translation tasks.

2015

pdf
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Aaron Li-Feng Han | Xiaodong Zeng | Derek F. Wong | Lidia S. Chao
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

2014

pdf
UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation
Liang Tian | Derek F. Wong | Lidia S. Chao | Paulo Quaresma | Francisco Oliveira | Yi Lu | Shuo Li | Yiming Wang | Longyue Wang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the acquisition of a large scale and high quality parallel corpora for English and Chinese. The corpora constructed in this paper contain about 15 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are made publicly available. Different from previous work, the corpus is designed to embrace eight different domains. Some of them are further categorized into different topics. The corpus will be released to the research community, which is available at the NLP2CT website.

pdf
Factored Statistical Machine Translation for Grammatical Error Correction
Yiming Wang | Longyue Wang | Xiaodong Zeng | Derek F. Wong | Lidia S. Chao | Yi Lu
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

pdf
Domain Adaptation for Medical Text Translation using Web Resources
Yi Lu | Longyue Wang | Derek F. Wong | Lidia S. Chao | Yiming Wang
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Combining Domain Adaptation Approaches for Medical Text Translation
Longyue Wang | Yi Lu | Derek F. Wong | Lidia S. Chao | Yiming Wang | Francisco Oliveira
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf
Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
Xiaodong Zeng | Lidia S. Chao | Derek F. Wong | Isabel Trancoso | Liang Tian
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf
Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Xiaodong Zeng | Derek F. Wong | Lidia S. Chao | Isabel Trancoso
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
Xiaodong Zeng | Derek F. Wong | Lidia S. Chao | Isabel Trancoso
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling
Aaron Li-Feng Han | Yi Lu | Derek F. Wong | Lidia S. Chao | Liangye He | Junwen Xing
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf
A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task
Aaron Li-Feng Han | Derek F. Wong | Lidia S. Chao | Yi Lu | Liangye He | Yiming Wang | Jiaji Zhou
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf
Experiments with POS-based restructuring and alignment-based reordering for statistical machine translation
Shuo Li | Derek F. Wong | Lidia S. Chao
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf
UM-Checker: A Hybrid System for English Grammatical Error Correction
Junwen Xing | Longyue Wang | Derek F. Wong | Lidia S. Chao | Xiaodong Zeng
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

pdf
Augmented Parsing of Unknown Word by Graph-Based Semi-Supervised Learning
Qiuping Huang | Derek F. Wong | Lidia S. Chao | Xiaodong Zeng | Liangye He
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

pdf
Edit Distance: A New Data Selection Criterion for Domain Adaptation in SMT
Longyue Wang | Derek F. Wong | Lidia S. Chao | Junwen Xing | Yi Lu | Isabel Trancoso
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf
Influence of Part-of-Speech and Phrasal Category Universal Tag-set in Tree-to-Tree Translation Models
Francisco Oliveira | Derek F. Wong | Lidia S. Chao | Liang Tian | Liangye He
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
Aaron Li-Feng Han | Derek F. Wong | Lidia S. Chao | Liangye He | Yi Lu | Junwen Xing | Xiaodong Zeng
Proceedings of Machine Translation Summit XIV: Posters

2012

pdf
An Improvement in Cross-Language Document Retrieval Based on Statistical Models
Longyue Wang | Derek F. Wong | Lidia S. Chao
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

pdf bib
TQDL: Integrated Models for Cross-Language Document Retrieval
Long-Yue Wang | Derek F. Wong | Lidia S. Chao
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 4, December 2012-Special Issue on Selected Papers from ROCLING XXIV

pdf
CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data
Longyue Wang | Derek F. Wong | Lidia S. Chao | Junwen Xing
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
Rules Design in Word Segmentation of Chinese Micro-Blog
Hao Zong | Derek F. Wong | Lidia S. Chao
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
A Template Based Hybrid Model for Chinese Personal Name Disambiguation
Hao Zong | Derek F. Wong | Lidia S. Chao
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
A Joint Chinese Named Entity Recognition and Disambiguation System
Longyue Wang | Shuo Li | Derek F. Wong | Lidia S. Chao
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
A Simplified Chinese Parser with Factored Model
Qiuping Huang | Liangye He | Derek F. Wong | Lidia S. Chao
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
Adapting Multilingual Parsing Models to Sinica Treebank
Liangye He | Derek F. Wong | Lidia S. Chao
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors
Aaron L. F. Han | Derek F. Wong | Lidia S. Chao
Proceedings of COLING 2012: Posters