Yihan Chen


2026

The growing use of large language models (LLMs) in peer review threatens scholarly integrity. Recent conference policies allow AI tools for language polishing but prohibit their use for generating substantive content. However, existing detectors mainly rely on stylistic cues, making it difficult to distinguish between surface-level language refinement and genuine content generation. To address this, we advocate a content-based detection paradigm and introduce CoCoNUTS, a comprehensive benchmark containing 315,535 reviews covering leading AI conferences and six human-AI collaboration modes. Our evaluation shows that current detectors struggle to handle these nuanced settings. Consequently, we propose CoCoDet, an AI review detector designed to identify substantive AI-generation. Experiments demonstrate that CoCoDet achieves a macro F1-score of 98.24%. Crucially, on permissible machine-polished reviews, it maintains a low false positive rate of 3.89%, substantially outperforming the strongest baseline (7.84%). Examination on real-world reviews using CoCoDet reveals an escalating trend of substantive AI generation. Our work exposes the inadequacy of current detectors, underscoring the importance of domain-specific solutions.
We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues.We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes.In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).

2024

Philology, the study of ancient manuscripts, demands years of professional training in ex-tensive knowledge memorization and manual textual retrieval. Despite these requirements align closely with strengths of recent successful Large Language Models (LLMs), the scarcity of high-quality, specialized training data has hindered direct applications. To bridge this gap, we curated the PhiloCorpus-ZH, a rich collec-tion of ancient Chinese texts spanning a millen-nium with 30 diverse topics, including firsthand folk copies. This corpus facilitated the develop-ment of PhiloGPT, the first LLM tailored for discovering ancient Chinese manuscripts. To effectively tackle complex philological tasks like restoration, attribution, and linguistic anal-ysis, we introduced the PhiloCoP framework. Modeled on the analytical patterns of philol-ogists, PhiloCoP enhances LLM’s handling of historical linguistic peculiarities such as phonetic loans, polysemy, and syntactic inver-sions. We further integrated these tasks into the PhiloBenchmark, establishing a new standard for evaluating ancient Chinese LLMs address-ing philology tasks. Deploying PhiloGPT in practical scenarios has enabled Dunhuang spe-cialists to resolve philology tasks, such as iden-tifying duplication of copied text and assisting archaeologists with text completion, demon-strating its potential in real-world applications.