Gang Chen

Other people with similar names: Gang Chen, Gang Chen

Unverified author pages with similar names: Gang Chen

2025

Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models—an answer generator and a question generator—are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning.Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner.Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68× decoding speedup for LLaVA-OneVision-72B and 2.11× speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.

pdf bib abs
Ensembling Prompting Strategies for Zero-Shot Hierarchical Text Classification with Large Language Models
Mingxuan Xia | Zhijie Jiang | Haobo Wang | Junbo Zhao | Tianlei Hu | Gang Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Hierarchical text classification aims to classify documents into multiple labels within a hierarchical taxonomy, making it an essential yet challenging task in natural language processing. Recently, using Large Language Models (LLM) to tackle hierarchical text classification in a zero-shot manner has attracted increasing attention due to their cost-efficiency and flexibility. Given the challenges of understanding the hierarchy, various HTC prompting strategies have been explored to elicit the best performance from LLMs.However, our empirical study reveals that LLMs are highly sensitive to these prompting strategies—(i) within a task, different strategies yield substantially different results, and (ii) across various tasks, the relative effectiveness of a given strategy varies significantly. To address this, we propose a novel ensemble method, HiEPS, which integrates the results of diverse prompting strategies to promote LLMs’ reliability. We also introduce a path-valid voting mechanism for ensembling, which selects a valid result with the highest path frequency score. Extensive experiments on three benchmark datasets show that HiEPS boosts the performance of single prompting strategies and achieves SOTA results. The source code is available at https://github.com/MingxuanXia/HiEPS.

Multimodal learning is garnering significant attention for its capacity to represent diverse human perceptions (e.g., linguistic, acoustic, and visual signals), achieving more natural and intuitive interactions with technology.However, the frequent occurrence of incomplete data, either within a single modality (intra-modality) or across different modalities (inter-modality), presents substantial challenges in reliable semantic interpretation and model reasoning.Furthermore, there is currently no robust representation learning mechanism capable of managing both intra-modality and inter-modality real-data deficiencies.To address this challenge, we present T²DR, a two-tier deficiency-resistant framework for incomplete multimodal learning, which comprises two main modules:(1) Intra-Modal Deficiency-Resistant module (IADR): To address fine-grained deficiencies, we introduce Intra-Attn to focus on the available data while avoiding excessive suppression of the missing regions.(2) Inter-Modal Deficiency-Resistant module (IEDR): To handle coarse-grained deficiencies, we propose the shared feature prediction (SFP) to leverage cross-modal shared features for preliminary data imputation. Subsequently, we apply Inter-Attn to allocate appropriate attention to each modality based on the results from the capability-aware scorer (CAS).Extensive experiments are performed on two well-known multimodal benchmarks, CMU-MOSI and CMU-MOSEI, across various missing scenarios for sentiment analysis. Experimental results show that T²DR significantly outperforms the SOTA models. Code is available at https://github.com/LH019/T2DR.

We introduce LongTableBench, a benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains. It comprises 5,950 QA instances spanning 7 table formats (e.g., Markdown, HTML, SQL), 18 domains, and input lengths up to 128K tokens, including multi-turn and multi-table settings. To ensure data quality, we combine symbolic supervision, cross-model validation, and human review. Evaluating 52 LLMs—including general-purpose, table-specific, and reasoning-enhanced models—reveals that only the strongest models maintain robust performance under increasing context lengths and format diversity. We further show that end-to-end models outperform compression-based approaches, especially on tasks requiring semantic integration. LongTableBench provides a rigorous, scalable testbed for advancing long-context tabular understanding and highlights key limitations in current LLMs’ structural and reasoning capabilities. The code and data are available at https://github.com/liyaooi/LongTableBench.