Yiming Wang


2025

pdf bib
Make Good Use of GujiRoBERTa to Identify Entities in Ancient Chinese
Lihan Lin | Yiming Wang | Jiachen Li | Huan Ouyang | Si Li
Proceedings of the Second Workshop on Ancient Language Processing

This report describes our model submitted for the EvaHan 2025 shared task on named entity recognition for ancient Chinese literary works. Since we participated in the task of closed modality, our method is based on the appointed pretrained language model GujiRoBERTajian-fan and we used appointed datasets.We carried out experiments on decodingstrategies and schedulers to verify the effect of our method. In the final test, our method outperformed the official baseline, demonstrating its effectiveness. In the end, for the results, this report gives an analysis from the perspective of data composition.

pdf bib
GL-GAN: Perceiving and Integrating Global and Local Styles for Handwritten Text Generation with Mamba
Yiming Wang | Hongxi Wei | Heng Wang | Shiwen Sun | Chao He
Proceedings of the 31st International Conference on Computational Linguistics

Handwritten text generation (HTG) aims to synthesize handwritten samples by imitating a specific writer, which has a wide range of applications and thus has significant research value. However, current studies on HTG are confronted with a main bottleneck: dominant models lack the ability to perceive and integrate handwriting styles, which affects the realism of the synthesized samples. In this paper, we propose GL-GAN, which effectively captures and integrates global and local styles. Specifically, we propose a Hybrid Style Encoder (HSE) that combines a state space model (SSM) and convolution to capture multilevel style features through various receptive fields. The captured style features are then fed to the proposed Dynamic Feature Enhancement Module (DFEM), which integrates these features by adaptively modeling the entangled relationships between multilevel styles and removing redundant details. Extensive experiments on two widely used handwriting datasets demonstrate that our GL-GAN is an effective HTG model and outperforms state-of-the-art models remarkably. Our code is publicly available at:https://github.com/Fyzjym/GL-GAN.

pdf bib
Transformer-based Speech Model Learns Well as Infants and Encodes Abstractions through Exemplars in the Poverty of the Stimulus Environment
Yi Yang | Yiming Wang | Jiahong Yuan
Proceedings of the 31st International Conference on Computational Linguistics

Infants are capable of learning language, predominantly through speech and associations, in impoverished environments—a phenomenon known as the Poverty of the Stimulus (POS). Is this ability uniquely human, as an innate linguistic predisposition, or can it be empirically learned through potential linguistic structures from sparse and noisy exemplars? As an early exploratory work, we systematically designed a series of tasks, scenarios, and metrics to simulate the POS. We found that the emerging speech model wav2vec2.0 with pretrained weights from an English corpus can learn well in noisy and sparse Mandarin environments. We then tested various hypotheses and observed three pieces of evidence for abstraction: label correction, categorical patterns, and clustering effects. We concluded that models can encode hierarchical linguistic abstractions through exemplars in POS environments. We hope this work offers new insights into language acquisition from a speech perspective and inspires further research.

2024

pdf bib
Automated Tone Transcription and Clustering with Tone2Vec
Yi Yang | Yiming Wang | ZhiQiang Tang | Jiahong Yuan
Findings of the Association for Computational Linguistics: EMNLP 2024

Lexical tones play a crucial role in Sino-Tibetan languages. However, current phonetic fieldwork relies on manual effort, resulting in substantial time and financial costs. This is especially challenging for the numerous endangered languages that are rapidly disappearing, often compounded by limited funding. In this paper, we introduce pitch-based similarity representations for tone transcription, named Tone2Vec. Experiments on dialect clustering and variance show that Tone2Vec effectively captures fine-grained tone variation. Utilizing Tone2Vec, we develop the first automatic approach for tone transcription and clustering by presenting a novel representation transformation for transcriptions. Additionally, these algorithms are systematically integrated into an open-sourced and easy-to-use package, ToneLab, which facilitates automated fieldwork and cross-regional, cross-lexical analysis for tonal languages. Extensive experiments were conducted to demonstrate the effectiveness of our methods.

pdf bib
CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models
Yiming Wang | Yang Liu | Lingchen Wang | An Xiao
Findings of the Association for Computational Linguistics: EMNLP 2024

Collecting high-quality question-answer (QA) pairs is vital for the training of large language models (LLMs), yet this process is traditionally laborious and time-intensive. With the rapid evolution of LLMs, the potential for leveraging these models to autonomously generate QA pairs has become apparent, particularly through the use of large-scale models like GPT-4. However, the computational demands and associated costs often render such approaches prohibitive for the average researcher. Addressing this gap, we introduce the Collaborative Small Language Model Framework (CSLM), an innovative solution that combines a group of small-scaled, open-source LLMs to collaboratively produce QA pairs. Experiments on datasets of various domains show that CSLM unleashes the full potential of diverse small models to generate high-quality QA pairs, making it accessible to a broader range of researchers.

2022

pdf bib
Noise-injected Consistency Training and Entropy-constrained Pseudo Labeling for Semi-supervised Extractive Summarization
Yiming Wang | Qianren Mao | Junnan Liu | Weifeng Jiang | Hongdong Zhu | Jianxin Li
Proceedings of the 29th International Conference on Computational Linguistics

Labeling large amounts of extractive summarization data is often prohibitive expensive due to time, financial, and expertise constraints, which poses great challenges to incorporating summarization system in practical applications. This limitation can be overcome by semi-supervised approaches: consistency-training and pseudo-labeling to make full use of unlabeled data. Researches on the two, however, are conducted independently, and very few works try to connect them. In this paper, we first use the noise-injected consistency training paradigm to regularize model predictions. Subsequently, we propose a novel entropy-constrained pseudo labeling strategy to obtain high-confidence labels from unlabeled predictions, which can obtain high-confidence labels from unlabeled predictions by comparing the entropy of supervised and unsupervised predictions. By combining consistency training and pseudo-labeling, this framework enforce a low-density separation between classes, which decently improves the performance of supervised learning over an insufficient labeled extractive summarization dataset.

2021

pdf bib
Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts
Yiming Wang | Ximing Li | Xiaotang Zhou | Jihong Ouyang
Findings of the Association for Computational Linguistics: EMNLP 2021

Short text nowadays has become a more fashionable form of text data, e.g., Twitter posts, news titles, and product reviews. Extracting semantic topics from short texts plays a significant role in a wide spectrum of NLP applications, and neural topic modeling is now a major tool to achieve it. Motivated by learning more coherent and semantic topics, in this paper we develop a novel neural topic model named Dual Word Graph Topic Model (DWGTM), which extracts topics from simultaneous word co-occurrence and semantic correlation graphs. To be specific, we learn word features from the global word co-occurrence graph, so as to ingest rich word co-occurrence information; we then generate text features with word features, and feed them into an encoder network to get topic proportions per-text; finally, we reconstruct texts and word co-occurrence graph with topical distributions and word features, respectively. Besides, to capture semantics of words, we also apply word features to reconstruct a word semantic correlation graph computed by pre-trained word embeddings. Upon those ideas, we formulate DWGTM in an auto-encoding paradigm and efficiently train it with the spirit of neural variational inference. Empirical results validate that DWGTM can generate more semantically coherent topics than baseline topic models.

2019

pdf bib
Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings
Mahsa Yarmohammadi | Xutai Ma | Sorami Hisamoto | Muhammad Rahman | Yiming Wang | Hainan Xu | Daniel Povey | Philipp Koehn | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track

2018

pdf bib
An Empirical Study of Machine Translation for the Shared Task of WMT18
Chao Bei | Hao Zong | Yiming Wang | Baoyong Fan | Shiqi Li | Conghu Yuan
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the Global Tone Communication Co., Ltd.’s submission of the WMT18 shared news translation task. We participated in the English-to-Chinese direction and get the best BLEU (43.8) scores among all the participants. The submitted system focus on data clearing and techniques to build a competitive model for this task. Unlike other participants, the submitted system are mainly relied on the data filtering to obtain the best BLEU score. We do data filtering not only for provided sentences but also for the back translated sentences. The techniques we apply for data filtering include filtering by rules, language models and translation models. We also conduct several experiments to validate the effectiveness of training techniques. According to our experiments, the Annealing Adam optimizing function and ensemble decoding are the most effective techniques for the model training.

2014

pdf bib
UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation
Liang Tian | Derek F. Wong | Lidia S. Chao | Paulo Quaresma | Francisco Oliveira | Yi Lu | Shuo Li | Yiming Wang | Longyue Wang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the acquisition of a large scale and high quality parallel corpora for English and Chinese. The corpora constructed in this paper contain about 15 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are made publicly available. Different from previous work, the corpus is designed to embrace eight different domains. Some of them are further categorized into different topics. The corpus will be released to the research community, which is available at the NLP2CT website.

pdf bib
Learning Polylingual Topic Models from Code-Switched Social Media Documents
Nanyun Peng | Yiming Wang | Mark Dredze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Factored Statistical Machine Translation for Grammatical Error Correction
Yiming Wang | Longyue Wang | Xiaodong Zeng | Derek F. Wong | Lidia S. Chao | Yi Lu
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Domain Adaptation for Medical Text Translation using Web Resources
Yi Lu | Longyue Wang | Derek F. Wong | Lidia S. Chao | Yiming Wang
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Combining Domain Adaptation Approaches for Medical Text Translation
Longyue Wang | Yi Lu | Derek F. Wong | Lidia S. Chao | Yiming Wang | Francisco Oliveira
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task
Aaron Li-Feng Han | Derek F. Wong | Lidia S. Chao | Yi Lu | Liangye He | Yiming Wang | Jiaji Zhou
Proceedings of the Eighth Workshop on Statistical Machine Translation