Zihao Yang
2026
Optimizing Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models
Yanbing Chen | Ruilin Wang | Zihao Yang | Lavender Yao Jiang | Eric Karl Oermann
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Yanbing Chen | Ruilin Wang | Zihao Yang | Lavender Yao Jiang | Eric Karl Oermann
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Packing and shuffling tokens is a common practice in training auto-regressive language models to prevent overfitting and improve efficiency. Documents are typically concatenated to chunks of maximum sequence length (MSL) and shuffled in chunks of tokens (atom-size chunk), possibly breaking context within documents. An alternative approach is padding, which only includes one document per chunk. To optimize both packing strategies (concatenation vs padding), we explored the optimal atom size for shuffling and compared performance and efficiency. We found that in the most common setup (where average document length is greater than MSL), matching atom size to MSL yields the lowest perplexity, controlling for dataset. Also, padding yields lower final perplexity than concatenation at the cost of lower efficiency. This trade-off informs the choice of shuffling and packing methods in training LMs.
2023
Intriguing Effect of the Correlation Prior on ICD-9 Code Assignment
Zihao Yang | Chenkang Zhang | Muru Wu | Xujin Liu | Lavender Jiang | Kyunghyun Cho | Eric Oermann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Zihao Yang | Chenkang Zhang | Muru Wu | Xujin Liu | Lavender Jiang | Kyunghyun Cho | Eric Oermann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
The Ninth Revision of the International Classification of Diseases (ICD-9) is a standardized coding system used to classify health conditions. It is used for billing, tracking individual patient conditions, and for epidemiology. The highly detailed and technical nature of the codes and their associated medical conditions make it difficult for humans to accurately record them. Researchers have explored the use of neural networks, particularly language models, for automated ICD-9 code assignment. However, the imbalanced distribution of ICD-9 codes leads to poor performance. One solution is to use domain knowledge to incorporate a useful prior. This paper evaluates the usefulness of the correlation bias: we hypothesize that correlations between ICD-9 codes and other medical codes could help improve language models’ performance. We showed that while the correlation bias worsens the overall performance, the effect on individual class can be negative or positive. Performance on classes that are more imbalanced and less correlated with other codes is more sensitive to incorporating the correlation bias. This suggests that while the correlation bias has potential to improve ICD-9 code assignment in certain cases, the applicability criteria need to be more carefully studied.