Soohyeong Kim
2026
TH-RAG : Topic-Based Hierarchical Knowledge Graphs for Robust Multi-hop Reasoning in Graph-based RAG Systems
JungHyoun Kim | Soohyeong Kim | Seok Jun Hwang | Jeonghyeon Park | Yong Suk Choi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
JungHyoun Kim | Soohyeong Kim | Seok Jun Hwang | Jeonghyeon Park | Yong Suk Choi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-augmented generation (RAG) enables large language models (LLMs) to incorporate external knowledge at inference. Graph-based RAG extends this by organizing corpora into knowledge graphs, improving multi-hop reasoning and offering a global understanding of the corpus. However, triplet-based graphs generated by LLMs are often fragmented and sparsely connected, which reduces coherence and hinders reasoning. Prior enrichment methods such as clustering, community detection, or approximate graph algorithms attempt to restore connectivity but incur high computational cost and risk semantic distortion. To address these issues, we propose TH-RAG, a hierarchical framework that organizes triplets into subtopics and topics, enhancing connectivity, integrating dispersed information, and supporting robust multi-hop reasoning. Experiments on abstractive and specific QA benchmarks show that TH-RAG outperforms strong baselines in accuracy and robustness while remaining efficient, providing a scalable foundation for graph-based RAG.
2025
ReGraphRAG: Reorganizing Fragmented Knowledge Graphs for Multi-Perspective Retrieval-Augmented Generation
Soohyeong Kim | Seok Jun Hwang | JungHyoun Kim | Jeonghyeon Park | Yong Suk Choi
Findings of the Association for Computational Linguistics: EMNLP 2025
Soohyeong Kim | Seok Jun Hwang | JungHyoun Kim | Jeonghyeon Park | Yong Suk Choi
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advancements in Retrieval-Augmented Generation (RAG) have improved large language models (LLMs) by incorporating external knowledge at inference time. Graph-based RAG systems have emerged as promising approaches, enabling multi-hop reasoning by organizing retrieved information into structured graphs. However, when knowledge graphs are constructed from unstructured documents using LLMs, they often suffer from fragmentation—resulting in disconnected subgraphs that limit inferential coherence and undermine the advantages of graph-based retrieval. To address these limitations, we propose ReGraphRAG, a novel framework designed to reconstruct and enrich fragmented knowledge graphs through three core components: Graph Reorganization, Perspective Expansion, and Query-aware Reranking. Experiments on four benchmarks show that ReGraphRAG outperforms state-of-the-art baselines, achieving over 80% average diversity win rate. Ablation studies highlight the key contributions of graph reorganization and especially perspective expansion to performance gains. Our code is available at: https://anonymous.4open.science/r/ReGraphRAG-7B73
2023
Bidirectional Masked Self-attention and N-gram Span Attention for Constituency Parsing
Soohyeong Kim | Whanhee Cho | Minji Kim | Yong Choi
Findings of the Association for Computational Linguistics: EMNLP 2023
Soohyeong Kim | Whanhee Cho | Minji Kim | Yong Choi
Findings of the Association for Computational Linguistics: EMNLP 2023
Attention mechanisms have become a crucial aspect of deep learning, particularly in natural language processing (NLP) tasks. However, in tasks such as constituency parsing, attention mechanisms can lack the directional information needed to form sentence spans. To address this issue, we propose a Bidirectional masked and N-gram span Attention (BNA) model, which is designed by modifying the attention mechanisms to capture the explicit dependencies between each word and enhance the representation of the output span vectors. The proposed model achieves state-of-the-art performance on the Penn Treebank and Chinese Penn Treebank datasets, with F1 scores of 96.47 and 94.15, respectively. Ablation studies and analysis show that our proposed BNA model effectively captures sentence structure by contextualizing each word in a sentence through bidirectional dependencies and enhancing span representation.