Huiming Wang


2025

pdf bib
Re3Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training
Zhiyang Zhang | Ziqiang Liu | Huiming Wang | Renke Shan | Li Kuang | Lu Wang | De Wen Soh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

An important trend in the realm of large language models (LLMs) is the development of longer context windows. However, training LLMs with long context windows to acquire the capability of effectively modeling lengthy inputs is often hindered by the scarcity of naturally long-context data. Existing methods for constructing long-context data by concatenating short documents have overlooked a crucial characteristic of long-context data quality, namely semantic dependency. In this paper, we propose a novel framework called Retrieval, Dependency Recognition, and Reorder for data synthesis (Re3Syn), which leverages semantic similarity to retrieve relevant documents and form several batches. Within each batch, the framework comprehensively recognizes dependency and utilizes them, along with a reorder algorithm, to organize the short documents into coherent long-context data. Comprehensive experiment on multiple benchmarks indicate that the data generated by the Re3Syn has longer dependencies and significantly enhances the model’s long-context capabilities. For reproducibility, we will release our codebase upon acceptance.

pdf bib
CLaSp: In-Context Layer Skip for Self-Speculative Decoding
Longze Chen | Renke Shan | Huiming Wang | Lu Wang | Ziqiang Liu | Run Luo | Jiawei Wang | Hamid Alinejad-Rokny | Min Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3× ∼ 1.7× on LLaMA3 series models without altering the original distribution of the generated text.

pdf bib
AdaMergeX: Cross-Lingual Transfer with Large Language Models via Adaptive Adapter Merging
Yiran Zhao | Wenxuan Zhang | Huiming Wang | Kenji Kawaguchi | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

2024

pdf bib
Order-Agnostic Data Augmentation for Few-Shot Named Entity Recognition
Huiming Wang | Liying Cheng | Wenxuan Zhang | De Wen Soh | Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data augmentation (DA) methods have been proven to be effective for pre-trained language models (PLMs) in low-resource settings, including few-shot named entity recognition (NER). However, existing NER DA techniques either perform rule-based manipulations on words that break the semantic coherence of the sentence, or exploit generative models for entity or context substitution, which requires a substantial amount of labeled data and contradicts the objective of operating in low-resource settings. In this work, we propose order-agnostic data augmentation (OaDA), an alternative solution that exploits the often overlooked order-agnostic property in the training data construction phase of sequence-to-sequence NER methods for data augmentation. To effectively utilize the augmented data without suffering from the one-to-many issue, where multiple augmented target sequences exist for one single sentence, we further propose the use of ordering instructions and an innovative OaDA-XE loss. Specifically, by treating each permutation of entity types as an ordering instruction, we rearrange the entity set accordingly, ensuring a distinct input-output pair, while OaDA-XE assigns loss based on the best match between the target sequence and model predictions. We conduct comprehensive experiments and analyses across three major NER benchmarks and significantly enhance the few-shot capabilities of PLMs with OaDA.

pdf bib
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning
Huiming Wang | Zhaodonghui Li | Liying Cheng | De Wen Soh | Lidong Bing
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recently, large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. Existing methods have explored utilizing LLMs as data annotators to generate synthesized data for training contrastive learning based sentence embedding models such as SimCSE. However, since contrastive learning models are sensitive to the quality of sentence pairs, the effectiveness of these methods is largely influenced by the content generated from LLMs, highlighting the need for more refined generation in the context of sentence representation learning. Building upon this premise, we propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus for training base sentence embedding models into three stages (i.e., sentence generation, sentence pair construction, in-batch training) and refines the generated content at these three distinct stages, ensuring only high-quality sentence pairs are utilized to train a base contrastive learning model. Our extensive experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results. Comprehensive analyses further underscore the potential of our framework in various application scenarios and achieving better sentence representation learning with LLMs.