Yiming Wang


Noise-injected Consistency Training and Entropy-constrained Pseudo Labeling for Semi-supervised Extractive Summarization
Yiming Wang | Qianren Mao | Junnan Liu | Weifeng Jiang | Hongdong Zhu | Jianxin Li
Proceedings of the 29th International Conference on Computational Linguistics

Labeling large amounts of extractive summarization data is often prohibitive expensive due to time, financial, and expertise constraints, which poses great challenges to incorporating summarization system in practical applications. This limitation can be overcome by semi-supervised approaches: consistency-training and pseudo-labeling to make full use of unlabeled data. Researches on the two, however, are conducted independently, and very few works try to connect them. In this paper, we first use the noise-injected consistency training paradigm to regularize model predictions. Subsequently, we propose a novel entropy-constrained pseudo labeling strategy to obtain high-confidence labels from unlabeled predictions, which can obtain high-confidence labels from unlabeled predictions by comparing the entropy of supervised and unsupervised predictions. By combining consistency training and pseudo-labeling, this framework enforce a low-density separation between classes, which decently improves the performance of supervised learning over an insufficient labeled extractive summarization dataset.


pdf bib
Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts
Yiming Wang | Ximing Li | Xiaotang Zhou | Jihong Ouyang
Findings of the Association for Computational Linguistics: EMNLP 2021

Short text nowadays has become a more fashionable form of text data, e.g., Twitter posts, news titles, and product reviews. Extracting semantic topics from short texts plays a significant role in a wide spectrum of NLP applications, and neural topic modeling is now a major tool to achieve it. Motivated by learning more coherent and semantic topics, in this paper we develop a novel neural topic model named Dual Word Graph Topic Model (DWGTM), which extracts topics from simultaneous word co-occurrence and semantic correlation graphs. To be specific, we learn word features from the global word co-occurrence graph, so as to ingest rich word co-occurrence information; we then generate text features with word features, and feed them into an encoder network to get topic proportions per-text; finally, we reconstruct texts and word co-occurrence graph with topical distributions and word features, respectively. Besides, to capture semantics of words, we also apply word features to reconstruct a word semantic correlation graph computed by pre-trained word embeddings. Upon those ideas, we formulate DWGTM in an auto-encoding paradigm and efficiently train it with the spirit of neural variational inference. Empirical results validate that DWGTM can generate more semantically coherent topics than baseline topic models.


pdf bib
Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings
Mahsa Yarmohammadi | Xutai Ma | Sorami Hisamoto | Muhammad Rahman | Yiming Wang | Hainan Xu | Daniel Povey | Philipp Koehn | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track


An Empirical Study of Machine Translation for the Shared Task of WMT18
Chao Bei | Hao Zong | Yiming Wang | Baoyong Fan | Shiqi Li | Conghu Yuan
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the Global Tone Communication Co., Ltd.’s submission of the WMT18 shared news translation task. We participated in the English-to-Chinese direction and get the best BLEU (43.8) scores among all the participants. The submitted system focus on data clearing and techniques to build a competitive model for this task. Unlike other participants, the submitted system are mainly relied on the data filtering to obtain the best BLEU score. We do data filtering not only for provided sentences but also for the back translated sentences. The techniques we apply for data filtering include filtering by rules, language models and translation models. We also conduct several experiments to validate the effectiveness of training techniques. According to our experiments, the Annealing Adam optimizing function and ensemble decoding are the most effective techniques for the model training.


Learning Polylingual Topic Models from Code-Switched Social Media Documents
Nanyun Peng | Yiming Wang | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Factored Statistical Machine Translation for Grammatical Error Correction
Yiming Wang | Longyue Wang | Xiaodong Zeng | Derek F. Wong | Lidia S. Chao | Yi Lu
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

Domain Adaptation for Medical Text Translation using Web Resources
Yi Lu | Longyue Wang | Derek F. Wong | Lidia S. Chao | Yiming Wang
Proceedings of the Ninth Workshop on Statistical Machine Translation

Combining Domain Adaptation Approaches for Medical Text Translation
Longyue Wang | Yi Lu | Derek F. Wong | Lidia S. Chao | Yiming Wang | Francisco Oliveira
Proceedings of the Ninth Workshop on Statistical Machine Translation

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation
Liang Tian | Derek F. Wong | Lidia S. Chao | Paulo Quaresma | Francisco Oliveira | Yi Lu | Shuo Li | Yiming Wang | Longyue Wang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the acquisition of a large scale and high quality parallel corpora for English and Chinese. The corpora constructed in this paper contain about 15 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are made publicly available. Different from previous work, the corpus is designed to embrace eight different domains. Some of them are further categorized into different topics. The corpus will be released to the research community, which is available at the NLP2CT website.


A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task
Aaron Li-Feng Han | Derek F. Wong | Lidia S. Chao | Yi Lu | Liangye He | Yiming Wang | Jiaji Zhou
Proceedings of the Eighth Workshop on Statistical Machine Translation