Yihan Ma

2026

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality
Zeyuan Chen | Ziqing Yang | Yihan Ma | Michael Backes | Yang Zhang
Findings of the Association for Computational Linguistics: ACL 2026

As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness.A trend of using large language models (LLMs) for assistance has emerged.In this work, we take a critical step toward improving the quality of LLM-generated reviews.We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to increase LLM-human similarity (RQ2).We first analyzed the human-written reviews with reviews generated by GPT-4o, Claude-3.7-Sonnet, and DeepSeek-V3 and found that LLMs and humans focus on different terms, e.g., LLMs prioritize theory while humans emphasize methodology and experiments.We further adopt prompt engineering, such as Chain-of-Thought (CoT), and utilize retrieval-augmented generation (RAG) to enhance the LLM-generated reviews towards human-level quality.We find CoT significantly improves the human similarity of LLM reviews, while we also discover an unexpected “RAG paradox,” i.e., experiments with RAG produce different results for various LLMs and, in some cases, even reduce review quality.Our comprehensive analysis of LLM-generated academic reviews illustrates both possibilities and limitations, contributing to a more effective, human-aligned review system.

2024

pdf bib abs

The Death and Life of Great Prompts: Analyzing the Evolution of LLM Prompts from the Structural Perspective
Yihan Ma | Xinyue Shen | Yixin Wu | Boyang Zhang | Michael Backes | Yang Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Effective utilization of large language models (LLMs), such as ChatGPT, relies on the quality of input prompts. This paper explores prompt engineering, specifically focusing on the disparity between experimentally designed prompts and real-world “in-the-wild” prompts. We analyze 10,538 in-the-wild prompts collected from various platforms and develop a framework that decomposes the prompts into eight key components. Our analysis shows that and Requirement are the most prevalent two components. Roles specified in the prompts, along with their capabilities, have become increasingly varied over time, signifying a broader range of application scenarios for LLMs. However, from the response of GPT-4, there is a marginal improvement with a specified role, whereas leveraging less prevalent components such as Capability and Demonstration can result in a more satisfying response. Overall, our work sheds light on the essential components of in-the-wild prompts and the effectiveness of these components on the broader landscape of LLM prompt engineering, providing valuable guidelines for the LLM community to optimize high-quality prompts.

Co-authors

Ziqing Yang 1

Boyang Zhang 1

Venues

EMNLP1
Findings1

Fix author