Junyang Zhang

2026

LightMoE: Task-Aware Expert Availability Management for Memory-Efficient MoE-LLM Inference
Puhan Luo | Yunhao Yao | Junyang Wang | Junyang Zhang | Xiangyang Li
Findings of the Association for Computational Linguistics: ACL 2026

Mixture-of-Experts (MoE) models offer a promising path for scaling model capacity, yet their massive memory footprint poses significant challenges for deployment on resource-constrained edge devices. Existing solutions, such as static pruning or dynamic offloading, often struggle to balance model accuracy with inference latency due to irreversible information loss or prohibitive I/O overhead. In this paper, we propose LightMoE, a novel framework for memory-efficient MoE inference that exploits the inherent functional redundancy and temporal locality of expert activation. LightMoE employs a frequency-aware expert initialization strategy to retain a compact core of resident experts and introduces a similarity-based redirection mechanism to compensate for missing experts without incurring I/O costs. Furthermore, it incorporates a lightweight runtime manager that performs coarse-grained, task-level expert replacement to adapt to shifting data distributions. Empirical evaluations on representative edge platforms demonstrate that LightMoE achieves a superior accuracy-efficiency trade-off, improving average accuracy by 4.3% over static pruning and 2.4% over dynamic swapping methods, while maintaining inference latency comparable to strictly pruned models.

2022

pdf bib abs

We study the problem of extracting N-ary relation tuples from scientific articles. This task is challenging because the target knowledge tuples can reside in multiple parts and modalities of the document. Our proposed method ReSel decomposes this task into a two-stage procedure that first retrieves the most relevant paragraph/table and then selects the target entity from the retrieved component. For the high-level retrieval stage, ReSel designs a simple and effective feature set, which captures multi-level lexical and semantic similarities between the query and components. For the low-level selection stage, ReSel designs a cross-modal entity correlation graph along with a multi-view architecture, which models both semantic and document-structural relations between entities. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.

Co-authors

Le Song 1

Yue Yu 1

Venues

EMNLP1
Findings1

Fix author