Guanglei Yue


2026

Quotation recommendation enriches writing by suggesting quotations that fit a given context, but prior systems largely focus on topical relevance and overlook what makes quotes memorable. Based on a user study, we find that preferred quotations are often unexpected yet rational, motivating the goal of selecting quotes that are contextually novel while semantically coherent. We propose NovelQR, which (1) uses a generative label agent to map quotations and contexts into multi-dimensional deep-meaning labels for label-enhanced retrieval, and (2) reranks candidates with a token-level novelty estimator that mitigates auto-regressive continuation bias. Experiments on bilingual datasets across diverse domains show that NovelQR is preferred by human judges and improves overall recommendation quality over strong baselines, while achieving competitive novelty estimation.

2025

Despite the rapid development of large language models (LLMs), existing benchmark datasets often focus on low-level cognitive tasks, such as factual recall and basic comprehension, while providing limited coverage of higher-level reasoning skills, including analysis, evaluation, and creation. In this work, we systematically assess the cognitive depth of popular LLM benchmarks using Bloom’s Taxonomy to evaluate both the cognitive and knowledge dimensions.Our analysis reveals a pronounced imbalance: most datasets concentrate on “Remembering” and “Understanding”, with metacognitive and creative reasoning largely underrepresented. We also find that incorporating higher-level cognitive instructions into the current instruction fine-tuning process improves model performance. These findings highlight the importance of future benchmarks incorporating metacognitive evaluations to more accurately assess and enhance model performance.
Data efficiency is crucial in domain-specific continual pre-training (CPT) of large language models (LLMs), especially under resource constraints. Aiming for “small data, big impact,” this work addresses the limitations of existing domain-specific data selection strategies, which often rely on scarce labeled data or computationally expensive LLMs. We introduce CDF Sampling with Grammatical Complexity (CDF-GC), an annotation-independent, efficient and interpretable data selection framework for CPT. Our approach comprehensively evaluates grammatical complexity using lexical diversity and syntactic complexity, and employs a cumulative distribution function (CDF)-based sampling strategy to balance complexity and diversity. To validate the effectiveness of CDF-GC, we conducted experiments on a financial dataset. The results demonstrate that CDF-GC significantly outperforms baselines, achieving 2.0% improvement in financial QA at the same selection ratio and even surpassing full-data training by 1.7% using only 20% of the data.