2025
pdf
bib
abs
Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
Hyungyu Shin
|
Jingyu Tang
|
Yoonjoo Lee
|
Nayoung Kim
|
Hyunseung Lim
|
Ji Yong Cho
|
Hwajung Hong
|
Moontae Lee
|
Juho Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Peer review underpins scientific progress, but it is increasingly strained by reviewer shortages and growing workloads. Large Language Models (LLMs) can automatically draft reviews now, but determining whether LLM-generated reviews are trustworthy requires systematic evaluation. Researchers have evaluated LLM reviews at either surface-level (e.g., BLEU and ROUGE) or content-level (e.g., specificity and factual accuracy). Yet it remains uncertain whether LLM-generated reviews attend to the same critical facets that human experts weigh—the strengths and weaknesses that ultimately drive an accept-or-reject decision. We introduce a focus-level evaluation framework that operationalizes the focus as a normalized distribution of attention across predefined facets in paper reviews. Based on the framework, we developed an automatic focus-level evaluation pipeline based on two sets of facets: target (e.g., problem, method, and experiment) and aspect (e.g., validity, clarity, and novelty), leveraging 676 paper reviews from OpenReview that consists of 3,657 strengths and weaknesses identified from human experts. The comparison of focus distributions between LLMs and human experts showed that the off-the-shelf LLMs consistently have a more biased focus towards examining technical validity while significantly overlooking novelty assessment when criticizing papers.Dataset: https://figshare.com/s/d5adf26c802527dd0f62
pdf
bib
abs
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
|
Juyoung Suk
|
Ji Yong Cho
|
Shayne Longpre
|
Chaeeun Kim
|
Dongkeun Yoon
|
Guijin Son
|
Yejin Cho
|
Sheikh Shafayat
|
Jinheon Baek
|
Sue Hyun Park
|
Hyeonbin Hwang
|
Jinkyung Jo
|
Hyowon Cho
|
Haebin Shin
|
Seongyun Lee
|
Hanseok Oh
|
Noah Lee
|
Namgyu Ho
|
Se June Joo
|
Miyoung Ko
|
Yoonjoo Lee
|
Hyungjoo Chae
|
Jamin Shin
|
Joel Jang
|
Seonghyeon Ye
|
Bill Yuchen Lin
|
Sean Welleck
|
Graham Neubig
|
Moontae Lee
|
Kyungjae Lee
|
Minjoon Seo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
As language models (LMs) become capable of handling a wide range of tasks, their evaluation is becoming as challenging as their development. Most generation benchmarks currently assess LMs using abstract evaluation criteria-like helpfulness and harmlessness-which often lack the flexibility and granularity of human assessment. Additionally, these benchmarks tend to focus disproportionately on specific capabilities such as instruction following, leading to coverage bias. To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks. A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation. We apply this benchmark to assess 100 frontier LMs using five evaluator LMs. Our code, data, and evaluation results are all publicly available at https://github.com/prometheus-eval/prometheus-eval.
2024
pdf
bib
abs
ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman
|
Yoonjoo Lee
|
Aakanksha Naik
|
Pao Siangliulue
|
Raymond Fok
|
Juho Kim
|
Daniel S Weld
|
Joseph Chee Chang
|
Kyle Lo
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
When conducting literature reviews, scientists often create literature review tables—tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects despite differing surface forms. Given these tools, we evaluate LMs’ abilities to reconstruct reference tables, finding this task benefits from additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.
pdf
bib
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Tirthankar Ghosal
|
Amanpreet Singh
|
Anita Waard
|
Philipp Mayr
|
Aakanksha Naik
|
Orion Weller
|
Yoonjoo Lee
|
Shannon Shen
|
Yanxia Qin
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
pdf
bib
abs
Overview of the Fourth Workshop on Scholarly Document Processing
Tirthankar Ghosal
|
Amanpreet Singh
|
Anita De Waard
|
Philipp Mayr
|
Aakanksha Naik
|
Orion Weller
|
Yoonjoo Lee
|
Zejiang Shen
|
Yanxia Qin
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
The workshop on Scholarly Document Processing (SDP) started in 2020 to accelerate research, inform policy and educate the public on natural language processing for scientific text. The fourth iteration of the workshop, SDP24 was held at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL24) as a hybrid event. The SDP workshop saw a great increase in interest, with 57 submissions, of which 28 were accepted. The program consisted of a research track, four invited talks and two shared tasks: 1) DAGPap24: Detecting automatically generated scientific papers and 2) Context24: Multimodal Evidence and Grounding Context Identification for Scientific Claims. The program was geared towards NLP, information extraction, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.
2022
pdf
bib
abs
Interactive Children’s Story Rewriting Through Parent-Children Interaction
Yoonjoo Lee
|
Tae Soo Kim
|
Minsuk Chang
|
Juho Kim
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)
Storytelling in early childhood provides significant benefits in language and literacy development, relationship building, and entertainment. To maximize these benefits, it is important to empower children with more agency. Interactive story rewriting through parent-children interaction can boost children’s agency and help build the relationship between parent and child as they collaboratively create changes to an original story. However, for children with limited proficiency in reading and writing, parents must carry out multiple tasks to guide the rewriting process, which can incur a high cognitive load. In this work, we introduce an interface design that aims to support children and parents to rewrite stories together with the help of AI techniques. We describe three design goals determined by a review of prior literature in interactive storytelling and existing educational activities. We also propose a preliminary prompt-based pipeline that uses GPT-3 to realize the design goals and enable the interface.