Gayeon Jung


2025

pdf bib
Can LLMs Truly Plan? A Comprehensive Evaluation of Planning Capabilities
Gayeon Jung | HyeonSeok Lim | Minjun Kim | Joon-ho Lim | KyungTae Lim | Hansaem Kim
Findings of the Association for Computational Linguistics: EMNLP 2025

The existing assessments of planning capabilities of large language models (LLMs) remain largely limited to single-language or specific representation formats. To address this gap, we introduce the Multi-Plan benchmark comprising 204 multilingual and multi-format travel planning scenarios. In experimental results obtained with state-of-the-art LLMs, the Multi-Plan benchmark effectively highlights the performance disparities among models, notably showing superior results for reasoning-specialized models. Interestingly, language differences exhibited minimal impact, whereas mathematically structured representations significantly improved planning accuracy for most models, underscoring the crucial role of the input format. These findings enhance our understanding of planning abilities of LLMs, offer valuable insights for future research, and emphasize the need for more sophisticated AI evaluation methods. This dataset is publicly available at http://huggingface.co/datasets/Bllossom/Multi-Plan.

pdf bib
Enhancing Coreference Resolution with LLM-driven Data Augmentation and Adversarial Filtering
Dohyeon Kim | Gayeon Jung | Jeongseon Cho | Jihoon Yang
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Coreference resolution is a fundamental task in natural language processing that involves linking different references to the same entity within a text. However, existing models often struggle to reliably identify referential relationships in contexts with extensive length or complex modifiers. This study proposes a data augmentation technique adding adjective phrases and employing a prompt-based adversarial filtering pipeline to address these challenges. Specifically, we generated and inserted contextually appropriate adjective phrases through the interaction between GPT-4o-mini based Few-shot Prompting and a Discriminative Language Model. The grammatical and semantic consistency of these phrases was validated via human evaluation and inter-annotator agreement (IAA) procedures. The generated synthetic dataset was integrated with existing data, leading to enhanced model performance. On the LitBank dataset, the CoNLL-F1 score increased by up to 1.7%, while the synthetic dataset improved linguistic diversity and the complexity of referential structures. The proposed pipeline represents a significant step towards developing coreference resolution models capable of better capturing linguistic variety and demonstrating robustness under challenging conditions.

2024

pdf bib
Are large language models affected by politeness? Focusing on request speech acts in Korean
Gayeon Jung | Joeun Kang | Fei Li | Hansaem Kim
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation