Lemeng Wu
2026
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
Hanxian Huang | Igor Fedorov | Andrey Gromov | Bernard Beckerman | Naveen Suda | David Eriksson | Maximilian Balandat | Rylan Conway | Patrick Huber | Chinnadhurai Sankar | Ayushi Dalmia | Zechun Liu | Lemeng Wu | Tarek Elgamal | Adithya Sagar | Vikas Chandra | Raghuraman Krishnamoorthi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Hanxian Huang | Igor Fedorov | Andrey Gromov | Bernard Beckerman | Naveen Suda | David Eriksson | Maximilian Balandat | Rylan Conway | Patrick Huber | Chinnadhurai Sankar | Ayushi Dalmia | Zechun Liu | Lemeng Wu | Tarek Elgamal | Adithya Sagar | Vikas Chandra | Raghuraman Krishnamoorthi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality.This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.
2024
LanguageFlow: Advancing Diffusion Language Generation with Probabilistic Flows
Shujian Zhang | Lemeng Wu | Chengyue Gong | Xingchao Liu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Shujian Zhang | Lemeng Wu | Chengyue Gong | Xingchao Liu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent works have demonstrated success in controlling sentence attributes (e.g., sentiment) and structure (e.g., syntactic structure) based on the diffusion language model. A key component that drives theimpressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of starting from the noise and the learning steps has limited its implementation to many NLP real-world applications. This paper proposes Language Rectified Flow (LF).Our method is based on the reformulation of the standard probabilistic flow models.Language rectified flow learns (neural) ordinary differentialequation models to transport between the source distribution and the target distribution, henceproviding a unified and effective solution to generative modeling and domain transfer.From the source distribution, our language rectified flow yields fast simulation and effectively decreases the inference time. Experiments on three challenging fine-grained control tasks and multiple high-quality text editing show that our method consistently outperforms its baselines. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.