Lei Liu
2026
Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lei Liu | Hao Zhu | Xiaoyan Yang | Yue Shen | Zhixuan Chu | Jian Wang | Jinjie Gu | Kui Ren
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the pre-trained model’s own perplexity on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments on both medical and general-domain benchmarks demonstrate that our method consistently identifies near-optimal training subsets, achieving superior performance with significantly reduced data consumption.
2025
Formalizing Feature Inheritance
Gregory Kobele | Lei Liu
Proceedings of the Society for Computation in Linguistics 2025
Gregory Kobele | Lei Liu
Proceedings of the Society for Computation in Linguistics 2025
Left-corner Minimalist parsing of mixed word order preferences
Lei Liu
Proceedings of the Society for Computation in Linguistics 2025
Lei Liu
Proceedings of the Society for Computation in Linguistics 2025
2023
Processing Advantages of End-weight
Lei Liu
Proceedings of the Society for Computation in Linguistics 2023
Lei Liu
Proceedings of the Society for Computation in Linguistics 2023
2021
A Dialogue-based Information Extraction System for Medical Insurance Assessment
Shuang Peng | Mengdi Zhou | Minghui Yang | Haitao Mi | Shaosheng Cao | Zujie Wen | Teng Xu | Hongbin Wang | Lei Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Shuang Peng | Mengdi Zhou | Minghui Yang | Haitao Mi | Shaosheng Cao | Zujie Wen | Teng Xu | Hongbin Wang | Lei Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
A Systematic Investigation of KB-Text Embedding Alignment at Scale
Vardaan Pahuja | Yu Gu | Wenhu Chen | Mehdi Bahrami | Lei Liu | Wei-Peng Chen | Yu Su
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Vardaan Pahuja | Yu Gu | Wenhu Chen | Mehdi Bahrami | Lei Liu | Wei-Peng Chen | Yu Su
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Knowledge bases (KBs) and text often contain complementary knowledge: KBs store structured knowledge that can support long range reasoning, while text stores more comprehensive and timely knowledge in an unstructured way. Separately embedding the individual knowledge sources into vector spaces has demonstrated tremendous successes in encoding the respective knowledge, but how to jointly embed and reason with both knowledge sources to fully leverage the complementary information is still largely an open problem. We conduct a large-scale, systematic investigation of aligning KB and text embeddings for joint reasoning. We set up a novel evaluation framework with two evaluation tasks, few-shot link prediction and analogical reasoning, and evaluate an array of KB-text embedding alignment methods. We also demonstrate how such alignment can infuse textual information into KB embeddings for more accurate link prediction on emerging entities and events, using COVID-19 as a case study.
2018
Minimalist Parsing of Heavy NP Shift
Lei Liu
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
Lei Liu
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation