Ji Yong Cho


2025

pdf bib
Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
Hyungyu Shin | Jingyu Tang | Yoonjoo Lee | Nayoung Kim | Hyunseung Lim | Ji Yong Cho | Hwajung Hong | Moontae Lee | Juho Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Peer review underpins scientific progress, but it is increasingly strained by reviewer shortages and growing workloads. Large Language Models (LLMs) can automatically draft reviews now, but determining whether LLM-generated reviews are trustworthy requires systematic evaluation. Researchers have evaluated LLM reviews at either surface-level (e.g., BLEU and ROUGE) or content-level (e.g., specificity and factual accuracy). Yet it remains uncertain whether LLM-generated reviews attend to the same critical facets that human experts weigh—the strengths and weaknesses that ultimately drive an accept-or-reject decision. We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention across predefined facets in paper reviews. Based on the framework, we developed an automatic focus-level evaluation pipeline based on two sets of facets: target (e.g., problem, method, and experiment) and aspect (e.g., validity, clarity, and novelty), leveraging 676 paper reviews from OpenReview that consists of 3,657 strengths and weaknesses identified from human experts. The comparison of focus distributions between LLMs and human experts showed that the off-the-shelf LLMs consistently have a more biased focus towards examining technical validity while significantly overlooking novelty assessment when criticizing papers.Dataset: https://figshare.com/s/d5adf26c802527dd0f62

pdf bib
Learning to Explore and Select for Coverage-Conditioned Retrieval-Augmented Generation
Takyoung Kim | Kyungjae Lee | Young Rok Jang | Ji Yong Cho | Gangwoo Kim | Minseok Cho | Moontae Lee
Findings of the Association for Computational Linguistics: NAACL 2025

Interactions with large language models (LLMs) often yield long and detailed responses, leveraging both parametric knowledge and retrieval-augmented generation (RAG). While these responses can provide rich insights, they often include redundant or less engaging content not aligned with user interests. This issue becomes apparent when users specify particular subtopics to include or exclude – termed **coverage-conditioned (C2)** queries – as LLMs often struggle to provide tailored responses. To address this challenge, we investigate the role of query outlines, sequences of subqueries designed to guide LLMs in generating responses that meet specific user requirements. To systematically create and evaluate these outlines, we introduce **QTree**, a dataset of 10K hierarchical sets of information-seeking subqueries that define structured boundaries for outline creation and evaluation in C2 scenarios. Additionally, we develop **QPlanner**, a 7B language model trained to generate customized outlines within boundaries of QTree. We evaluate the effectiveness of the generated outlines through automatic and human judgements, focusing on their impact within retrieval-augmented generation (RAG) systems. Experimental results demonstrate that QPlanner, especially when trained with alignment techniques like DPO, generates higher-quality outlines that better fulfill diverse user needs.

pdf bib
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim | Juyoung Suk | Ji Yong Cho | Shayne Longpre | Chaeeun Kim | Dongkeun Yoon | Guijin Son | Yejin Cho | Sheikh Shafayat | Jinheon Baek | Sue Hyun Park | Hyeonbin Hwang | Jinkyung Jo | Hyowon Cho | Haebin Shin | Seongyun Lee | Hanseok Oh | Noah Lee | Namgyu Ho | Se June Joo | Miyoung Ko | Yoonjoo Lee | Hyungjoo Chae | Jamin Shin | Joel Jang | Seonghyeon Ye | Bill Yuchen Lin | Sean Welleck | Graham Neubig | Moontae Lee | Kyungjae Lee | Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.