Jongwon Lee
2026
Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
Jean Seo | Gibaeg Kim | Kihun Shin | Seungseop Lim | Hyunkyung Lee | Wooseok Han | Jongwon Lee | Eunho Yang
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)
We introduce EPAG, a benchmark dataset and framework designed for evaluating the pre-consultation ability of LLMs using diagnostic guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
Juhyun Oh | Nayeon Lee | Chani Jung | Jiho Jin | Junho Myung | Jongwon Lee | Taieui Song | Alice Oh
Findings of the Association for Computational Linguistics: EACL 2026
Juhyun Oh | Nayeon Lee | Chani Jung | Jiho Jin | Junho Myung | Jongwon Lee | Taieui Song | Alice Oh
Findings of the Association for Computational Linguistics: EACL 2026
Large Language Models (LLMs) often default to overly cautious and vague responses when handling sensitive topics, sacrificing helpfulness for safety. Existing evaluation frameworks lack systematic methods to identify and address specific weaknesses in responses to sensitive topics, making it difficult to improve both safety and helpfulness simultaneously. To address this, we introduce FINEST, a FINE-grained response evaluation taxonomy for Sensitive Topics, which breaks down helpfulness and harmlessness into errors across three main categories: Content, Logic, and Appropriateness. Experiments on a Korean-sensitive question dataset demonstrate that our score- and error-based improvement pipeline, guided by FINEST, significantly improves the model responses across all three categories, outperforming refinement without guidance. Notably, score-based improvement—providing category-specific scores and justifications—yields the most significant gains, reducing the error sentence ratio for Appropriateness by up to 33.09%. This work lays the foundation for a more explainable and comprehensive evaluation and improvement of LLM responses to sensitive questions.
2022
You Only Need One Model for Open-domain Question Answering
Haejun Lee | Akhil Kedia | Jongwon Lee | Ashwin Paranjape | Christopher Manning | Kyoung-Gu Woo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Haejun Lee | Akhil Kedia | Jongwon Lee | Ashwin Paranjape | Christopher Manning | Kyoung-Gu Woo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent approaches to Open-domain Question Answering refer to an external knowledge base using a retriever model, optionally rerank passages with a separate reranker model and generate an answer using another reader model. Despite performing related tasks, the models have separate parameters and are weakly-coupled during training. We propose casting the retriever and the reranker as internal passage-wise attention mechanisms applied sequentially within the transformer architecture and feeding computed representations to the reader, with the hidden representations progressively refined at each stage. This allows us to use a single question answering model trained end-to-end, which is a more efficient use of model capacity and also leads to better gradient flow. We present a pre-training method to effectively train this architecture and evaluate our model on the Natural Questions and TriviaQA open datasets. For a fixed parameter budget, our model outperforms the previous state-of-the-art model by 1.0 and 0.7 exact match scores.
KOLD: Korean Offensive Language Dataset
Younghoon Jeong | Juhyun Oh | Jongwon Lee | Jaimeen Ahn | Jihyung Moon | Sungjoon Park | Alice Oh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Younghoon Jeong | Juhyun Oh | Jongwon Lee | Jaimeen Ahn | Jihyung Moon | Sungjoon Park | Alice Oh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.