Fengze Liu
2026
TiKMiX: Efficient Semi-Dynamic Data Mixture via Data Influence for LLM Pre-training
Yifan Wang | Binbinliu | Fengze Liu | Yuanfan Guo | Jiyao Deng | Xuecheng Wu | Weidong Zhou | Xiaohuan Zhou | Taifeng Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Wang | Binbinliu | Fengze Liu | Yuanfan Guo | Jiyao Deng | Xuecheng Wu | Weidong Zhou | Xiaohuan Zhou | Taifeng Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The data mixture used in the pre-training of a language model is a cornerstone of its final performance. Static data mixing strategies in Large Language Model (LLM) pre-training are often suboptimal as they fail to adapt to the model’s evolving learning states. Conversely, fully online dynamic updates, while adaptive, incur prohibitive computational costs. To bridge this gap, we propose TiKMiX, an efficient semi-dynamic data mixing framework. Our approach is grounded in a key observation of influence ranking invariance: the relative importance of data domains exhibits strong temporal stability over long training intervals. Leveraging this insight, we propose Group Influence, an efficient approach for quantifying domain impact, and formulate data mixing as a periodic, low-overhead influence maximization problem. Compared with REGMIX, the proposed method reduces computational overhead by 80% and achieves an average performance gain of 2% across nine downstream benchmarks, thereby effectively mitigating data under-digestion.
2025
AutoCT: Automating Interpretable Clinical Trial Prediction with LLM Agents
Fengze Liu | Haoyu Wang | Joonhyuk Cho | Dan Roth | Andrew Lo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fengze Liu | Haoyu Wang | Joonhyuk Cho | Dan Roth | Andrew Lo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Clinical trials are critical for advancing medical treatments but remain prohibitively expensive and time-consuming. Accurate prediction of clinical trial outcomes can significantly reduce research and development costs and accelerate drug discovery. While recent deep learning models have shown promise by leveraging unstructured data, their black-box nature, lack of interpretability, and vulnerability to label leakage limit their practical use in high-stakes biomedical contexts. In this work, we propose AutoCT, a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning. AutoCT autonomously generates, evaluates, and refines tabular features based on public information without human input. Our method uses Monte Carlo Tree Search to iteratively optimize predictive performance. Experimental results show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations, establishing a new paradigm for scalable, interpretable, and cost-efficient clinical trial prediction.
2024
Event Causality Identification with Synthetic Control
Haoyu Wang | Fengze Liu | Jiayao Zhang | Dan Roth | Kyle Richardson
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Haoyu Wang | Fengze Liu | Jiayao Zhang | Dan Roth | Kyle Richardson
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Event causality identification (ECI), a process that extracts causal relations between events from text, is crucial for distinguishing causation from correlation. Traditional approaches to ECI have primarily utilized linguistic patterns and multi-hop relational inference, risking false causality identification due to informal usage of causality and specious graphical inference. In this paper, we adopt the Rubin Causal Model to identify event causality: given two temporally ordered events, we see the first event as the treatment and the second one as the observed outcome. Determining their causality involves manipulating the treatment and estimating the resultant change in the likelihood of the outcome. Given that it is only possible to implement manipulation conceptually in the text domain, as a work-around, we try to find a twin for the protagonist from existing corpora. This twin should have identical life experiences with the protagonist before the treatment but undergoes an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the synthetic control method to generate such a twin’ from relevant historical data, leveraging text embedding synthesis and inversion techniques. This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a causality benchmark, COPES-hard.