Ruilin Wang
2026
Optimizing Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models
Yanbing Chen | Ruilin Wang | Zihao Yang | Lavender Yao Jiang | Eric Karl Oermann
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yanbing Chen | Ruilin Wang | Zihao Yang | Lavender Yao Jiang | Eric Karl Oermann
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Packing and shuffling tokens is a common practice in training auto-regressive language models to prevent overfitting and improve efficiency. Documents are typically concatenated to chunks of maximum sequence length (MSL) and shuffled in chunks of tokens (atom-size chunk), possibly breaking context within documents. An alternative approach is padding, which only includes one document per chunk. To optimize both packing strategies (concatenation vs padding), we explored the optimal atom size for shuffling and compared performance and efficiency. We found that in the most common setup (where average document length is greater than MSL), matching atom size to MSL yields the lowest perplexity, controlling for dataset. Also, padding yields lower final perplexity than concatenation at the cost of lower efficiency. This trade-off informs the choice of shuffling and packing methods in training LMs.
2025
VenusFactory: An Integrated System for Protein Engineering with Data Retrieval and Language Model Fine-Tuning
Yang Tan | Chen Liu | Jingyuan Gao | Banghao Wu | Mingchen Li | Ruilin Wang | Lingrong Zhang | Huiqun Yu | Guisheng Fan | Liang Hong | Bingxin Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Yang Tan | Chen Liu | Jingyuan Gao | Banghao Wu | Mingchen Li | Ruilin Wang | Lingrong Zhang | Huiqun Yu | Guisheng Fan | Liang Hong | Bingxin Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating 40+ protein-related datasets and 40+ popular PLMs. All implementations are open-sourced on https://github.com/ai4protein/VenusFactory. A video introduction is available at https://www.youtube.com/watch?v=MT6lPH5kgCc.