2025
pdf
bib
abs
DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Kyochul Jang
|
Donghyeon Lee
|
Kyusik Kim
|
Dongseok Heo
|
Taewhoo Lee
|
Woojeong Kim
|
Bongwon Suh
Findings of the Association for Computational Linguistics: ACL 2025
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available.
pdf
bib
abs
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage
Taewhoo Lee
|
Chanwoong Yoon
|
Kyochul Jang
|
Donghyeon Lee
|
Minju Song
|
Hyunjae Kim
|
Jaewoo Kang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs’ ability to leverage the entire context. Our benchmark comprises 1,986 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at https://github.com/dmis-lab/ETHIC.
2013
pdf
bib
Counseling Dialog System with 5W1H Extraction
Sangdo Han
|
Kyusong Lee
|
Donghyeon Lee
|
Gary Geunbae Lee
Proceedings of the SIGDIAL 2013 Conference
2012
pdf
bib
A Hierarchical Domain Model-Based Multi-Domain Selection Framework for Multi-Domain Dialog Systems
Seonghan Ryu
|
Donghyeon Lee
|
Injae Lee
|
Sangdo Han
|
Gary Geunbae Lee
|
Myungjae Kim
|
Kyungduk Kim
Proceedings of COLING 2012: Posters
2008
pdf
bib
Transformation-based Sentence Splitting method for Statistical Machine Translation
Jonghoon Lee
|
Donghyeon Lee
|
Gary Geunbae Lee
Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST)
2007
pdf
bib
POSSLT: A Korean to English Spoken Language Translation System
Donghyeon Lee
|
Jonghoon Lee
|
Gary Geunbae Lee
Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)