Geng Zhang


2025

pdf bib
Data-Efficient Selection via Grammatical Complexity in Continual Pre-training of Domain-Specific LLMs
Yizhou Ying | Geng Zhang | Cui Danxin | Chengyu Du | Guanglei Yue | Sihang Jiang | Jiaqing Liang | Yifei Fu | Hailin Hu | Yanghua Xiao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for “small data, big impact,” this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.

pdf bib
From Remembering to Metacognition: Do Existing Benchmarks Accurately Evaluate LLMs?
Geng Zhang | Yizhou Ying | Sihang Jiang | Jiaqing Liang | Guanglei Yue | Yifei Fu | Hailin Hu | Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2025

Despite the rapid development of large language models (LLMs), existing benchmark datasets often focus on low-level cognitive tasks, such as factual recall and basic comprehension, while providing limited coverage of higher-level reasoning skills, including analysis, evaluation, and creation. In this work, we systematically assess the cognitive depth of popular LLM benchmarks using Bloom’s Taxonomy to evaluate both the cognitive and knowledge dimensions.Our analysis reveals a pronounced imbalance: most datasets concentrate on “Remembering” and “Understanding”, with metacognitive and creative reasoning largely underrepresented. We also find that incorporating higher-level cognitive instructions into the current instruction fine-tuning process improves model performance. These findings highlight the importance of future benchmarks incorporating metacognitive evaluations to more accurately assess and enhance model performance.