Seong-Jin Park


2025

pdf bib
Measuring and Mitigating Media Outlet Name Bias in Large Language Models
Seong-Jin Park | Kang-Min Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have achieved remarkable performance across diverse natural language processing tasks, but concerns persist regarding their potential political biases. While prior research has extensively explored political biases in LLMs’ text generation and perception, limited attention has been devoted to biases associated with media outlet names. In this study, we systematically investigate the presence of media outlet name biases in LLMs and evaluate their impact on downstream tasks, such as political bias prediction and news summarization. Our findings demonstrate that LLMs consistently exhibit biases toward the known political leanings of media outlets, with variations across model families and scales. We propose a novel metric to quantify media outlet name biases in LLMs and leverage this metric to develop an automated prompt optimization framework. Our framework effectively mitigates media outlet name biases, offering a scalable approach to enhancing the fairness of LLMs in news-related applications.

pdf bib
Leveraging Knowledge Graph-Enhanced LLMs for Context-Aware Medical Consultation
Su-Hyeong Park | Ho-Beom Kim | Seong-Jin Park | Dinara Aliyeva | Kang-Min Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in large language models have significantly influenced the field of online medical consultations. However, critical challenges remain, such as the generation of hallucinated information and the integration of up-to-date medical knowledge. To address these issues, we propose **I**nformatics **Llama** (ILlama), a novel framework that combines retrieval-augmented generation with a structured medical knowledge graph. ILlama incorporates relevant medical knowledge by transforming subgraphs from a structured medical knowledge graph into text for retrieval-augmented generation. By generating subgraphs from the medical knowledge graph in advance, specifically focusing on diseases and symptoms, ILlama is able to enhance the accuracy and relevance of its medical reasoning. This framework enables effective incorporation of causal relationships between symptoms and diseases. Also, it delivers context-aware consultations aligned with user queries. Experimental results on the two medical consultation datasets demonstrate that ILlama outperforms the strong baselines, achieving a semantic similarity F1-score of 0.884 when compared with ground truth consultation answers. Furthermore, qualitative analysis of ILlama’s responses reveals significant improvements in hallucination reduction and clinical usefulness. These results suggest that ILlama has strong potential as a reliable tool for real-world medical consultation environments.

pdf bib
Conflict and Overlap Classification in Construction Standards Using a Large Language Model
Seong-Jin Park | Youn-Gyu Jin | Hyun-Young Moon | Choi Bong-Hyuck | Lee Seung Hwan | Ohjoon Kwon | Kang-Min Kim
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Construction standards across different countries provide technical guidelines to ensure the quality and safety of buildings and facilities, with periodic revisions to accommodate advances in construction technology. However, these standards often contain overlapping or conflicting content owing to their broad scope and interdependence, complicating the revision process and creating public inconvenience. Although current expert-driven manual approaches aim to mitigate these issues, they are time-consuming, costly, and error-prone. To address these challenges, we propose conflict and overlap classification in construction standards using a large language model (COSLLM), a framework that leverages a construction domain-adapted large language model for the semantic comparison of sentences in construction standards. COSLLM utilizes a two-step reasoning process that adaptively employs chain-of-thought reasoning for the in-depth analysis of sentences suspected of overlaps or conflicts, ensuring computational and temporal efficiency while maintaining high classification accuracy. The framework achieved an accuracy of 97.9% and a macro F1-score of 0.907 in classifying real-world sentence pairs derived from Korean construction standards as overlapping, conflicting, or neutral. Furthermore, we develop and deploy a real-time web-based system powered by COSLLM to facilitate the efficient establishment and revision of construction standards.

2024

pdf bib
Large Language Models are Students at Various Levels: Zero-shot Question Difficulty Estimation
Jae-Woo Park | Seong-Jin Park | Hyun-Sik Won | Kang-Min Kim
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent advancements in educational platforms have emphasized the importance of personalized education. Accurately estimating question difficulty based on the ability of the student group is essential for personalized question recommendations. Several studies have focused on predicting question difficulty using student question-solving records or textual information about the questions. However, these approaches require a large amount of student question-solving records and fail to account for the subjective difficulties perceived by different student groups. To address these limitations, we propose the LLaSA framework that utilizes large language models to represent students at various levels. Our proposed method, LLaSA and the zero-shot LLaSA, can estimate question difficulty both with and without students’ question-solving records. In evaluations on the DBE-KT22 and ASSISTMents 2005–2006 benchmarks, the zero-shot LLaSA demonstrated a performance comparable to those of strong baseline models even without any training. When evaluated using the classification method, LLaSA outperformed the baseline models, achieving state-of-the-art performance. In addition, the zero-shot LLaSA showed a high correlation with the regressed IRT curve when compared to question difficulty derived from students’ question-solving records, highlighting its potential for real-world applications.