Subin An
2025
FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue across English, Chinese, and Korean
Seoyoon Park
|
Hyeji Choi
|
Minseon Kim
|
Subin An
|
Xiaonan Wang
|
Gyuri Choi
|
Hansaem Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Figurative language conveys stance, emotion, and social nuance, making its appropriate use essential in dialogue. While large language models (LLMs) often succeed in recognizing figurative expressions at the sentence level, their ability to use them coherently in conversation remains uncertain. We introduce FLUID QA, the first multilingual benchmark that evaluates figurative usage in dialogue across English, Korean, and Chinese. Each item embeds figurative choices into multi-turn contexts. To support interpretation, we include FLUTE-bi, a sentence-level diagnostic task. Results reveal a persistent gap: models that perform well on FLUTE-bi frequently fail on FLUID QA, especially in sarcasm and metaphor. These errors reflect systematic rhetorical confusion and limited discourse reasoning. FLUID QA provides a scalable framework for assessing usage-level figurative competence across languages.
Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
Subin An
|
Yugyeong Ji
|
Junyoung Kim
|
Heejin Kook
|
Yang Lu
|
Josh Seltzer
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions—effort, relevance, and complete- ness—are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.
Search
Fix author
Co-authors
- Hyeji Choi 1
- Gyuri Choi 1
- Yugyeong Ji 1
- Minseon Kim 1
- Hansaem Kim 1
- show all...