2025
pdf
bib
abs
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing
Haneul Yoo
|
Jieun Han
|
So-Yeon Ahn
|
Alice Oh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated essay scoring (AES) is a useful tool in English as a Foreign Language (EFL) writing education, offering real-time essay scores for students and instructors. However, previous AES models were trained on essays and scores irrelevant to the practical scenarios of EFL writing education and usually provided a single holistic score due to the lack of appropriate datasets. In this paper, we release DREsS, a large-scale, standard dataset for rubric-based automated essay scoring with 48.9K samples in total. DREsS comprises three sub-datasets: DREsS_New, DREsS_Std., and DREsS_CASE. We collect DREsS_New, a real-classroom dataset with 2.3K essays authored by EFL undergraduate students and scored by English education experts. We also standardize existing rubric-based essay scoring datasets as DREsS_Std. We suggest CASE, a corruption-based augmentation strategy for essays, which generates 40.1K synthetic samples of DREsS_CASE and improves the baseline results by 45.44%. DREsS will enable further research to provide a more accurate and practical AES system for EFL writing education.
pdf
bib
abs
From Selection to Generation: A Survey of LLM-based Active Learning
Yu Xia
|
Subhojyoti Mukherjee
|
Zhouhang Xie
|
Junda Wu
|
Xintong Li
|
Ryan Aponte
|
Hanjia Lyu
|
Joe Barrow
|
Hongjie Chen
|
Franck Dernoncourt
|
Branislav Kveton
|
Tong Yu
|
Ruiyi Zhang
|
Jiuxiang Gu
|
Nesreen K. Ahmed
|
Yu Wang
|
Xiang Chen
|
Hanieh Deilamsalehy
|
Sungchul Kim
|
Zhengmian Hu
|
Yue Zhao
|
Nedim Lipka
|
Seunghyun Yoon
|
Ting-Hao Kenneth Huang
|
Zichao Wang
|
Puneet Mathur
|
Soumyabrata Pal
|
Koyel Mukherjee
|
Zhehao Zhang
|
Namyong Park
|
Thien Huu Nguyen
|
Jiebo Luo
|
Ryan A. Rossi
|
Julian McAuley
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
pdf
bib
abs
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Shivalika Singh
|
Angelika Romanou
|
Clémentine Fourrier
|
David Ifeoluwa Adelani
|
Jian Gang Ngui
|
Daniel Vila-Suero
|
Peerat Limkonchotiwat
|
Kelly Marchisio
|
Wei Qi Leong
|
Yosephine Susanto
|
Raymond Ng
|
Shayne Longpre
|
Sebastian Ruder
|
Wei-Yin Ko
|
Antoine Bosselut
|
Alice Oh
|
Andre Martins
|
Leshem Choshen
|
Daphne Ippolito
|
Enzo Ferrante
|
Marzieh Fadaee
|
Beyza Ermis
|
Sara Hooker
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve.A common practice to fill this gap is to machine-translate English evaluation sets. However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on the full MMLU change when evaluated on a subset of questions explicitly marked as culturally sensitive.We release Global MMLU, a multilingual extension of MMLU across 42 languages, featuring improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts.
pdf
bib
abs
XDAC: XAI-Driven Detection and Attribution of LLM-Generated News Comments in Korean
Wooyoung Go
|
Hyoungshick Kim
|
Alice Oh
|
Yongdae Kim
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) generate human-like text, raising concerns about their misuse in creating deceptive content. Detecting LLM-generated comments (LGC) in online news is essential for preserving online discourse integrity and preventing opinion manipulation. However, effective detection faces two key challenges; the brevity and informality of news comments limit traditional methods, and the absence of a publicly available LGC dataset hinders model training, especially for languages other than English. To address these challenges, we propose a twofold approach. First, we develop an LGC generation framework to construct a high-quality dataset with diverse and complex examples. Second, we introduce XDAC (XAI-Driven Detection and Attribution of LLM-Generated Comments), a framework utilizing explainable AI, designed for the detection and attribution of short-form LGC in Korean news articles. XDAC leverages XAI to uncover distinguishing linguistic patterns at both token and character levels. We present the first large-scale benchmark dataset, comprising 1.3M human-written comments from Korean news platforms and 1M LLM-generated comments from 14 distinct models. XDAC outperforms existing methods, achieving a 98.5% F1 score in LGC detection with a relative improvement of 68.1%, and an 84.3% F1 score in attribution. To validate real-world applicability, we analyze 5.24M news comments from Naver, South Korea’s leading online news platform, identifying 27,029 potential LLM-generated comments.
pdf
bib
abs
Diffusion Models Through a Global Lens: Are They Culturally Inclusive?
Zahra Bayramli
|
Ayhan Suleymanzade
|
Na Min An
|
Huzama Ahmad
|
Eunsu Kim
|
Junyeong Park
|
James Thorne
|
Alice Oh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CULTDIFF benchmark, evaluating whether state-of-the-art diffusion models can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CULTDIFF-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.
pdf
bib
abs
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
Thang M. Pham
|
Phat T. Nguyen
|
Seunghyun Yoon
|
Viet Dac Lai
|
Franck Dernoncourt
|
Trung Bui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
While small language models (SLMs) show promises for mobile deployment, their real world performance and applications on smartphones remain underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the sweet spot between model size (ranging from 125M to 8B parameters), context length, and inference time for efficient on-device processing. SlimLM is pretrained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering, and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We provide an Android application allowing users to experience SlimLM’s document assistance capabilities, offering valuable insights for mobile developers, researchers, and companies seeking privacy-preserving on-device alternatives to server-based language models.
pdf
bib
abs
LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation
Junyeong Park
|
Seogyeong Jeong
|
Seyoung Song
|
Yohan Lee
|
Alice Oh
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
Content moderation platforms concentrate resources on English content despite serving predominantly non-English speaking users.Also, given the scarcity of native moderators for low-resource languages, non-native moderators must bridge this gap in moderation tasks such as hate speech moderation.Through a user study, we identify that non-native moderators struggle with understanding culturally-specific knowledge, sentiment, and internet culture in the hate speech.To assist non-native moderators, we present LLM-C3MOD, a human-LLM collaborative pipeline with three steps: (1) RAG-enhanced cultural context annotations; (2) initial LLM-based moderation; and (3) targeted human moderation for cases lacking LLM consensus.Evaluated on Korean hate speech dataset with Indonesian and German participants, our system achieves 78% accuracy (surpassing GPT-4o’s 71% baseline) while reducing human workload by 83.6%.In addition, cultural context annotations improved non-native moderator accuracy from 22% to 61%, with humans notably excelling at nuanced tasks where LLMs struggle.Our findings demonstrate that non-native moderators, when properly supported by LLMs, can effectively contribute to cross-cultural hate speech moderation.
pdf
bib
abs
WHEN TOM EATS KIMCHI: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts
Jun Seong Kim
|
Kyaw Ye Thu
|
Javad Ismayilzada
|
Junyeong Park
|
Eunsu Kim
|
Huzama Ahmad
|
Na Min An
|
James Thorne
|
Alice Oh
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs.For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it.However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MIXCUBE, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures
pdf
bib
abs
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
Nakyeong Yang
|
Minsung Kim
|
Seunghyun Yoon
|
Joongbo Shin
|
Kyomin Jung
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the inherent complexity and interconnectedness of knowledge, which requires careful examination. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a novel benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE leverages a regularized explainability method to localize contextual knowledge neurons, updating only these neurons using carefully selected unforgotten samples. Experimental results demonstrate that existing unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
pdf
bib
abs
Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification
Yejun Yoon
|
Jaeyoon Jung
|
Seunghyun Yoon
|
Kunwoo Park
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)
This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at
https://github.com/ssu-humane/HerO2.
pdf
bib
abs
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Kang-il Lee
|
Minbeom Kim
|
Seunghyun Yoon
|
Minsung Kim
|
Dongryeol Lee
|
Hyukhun Koh
|
Kyomin Jung
Findings of the Association for Computational Linguistics: NAACL 2025
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
pdf
bib
abs
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs
Haneul Yoo
|
Cheonbok Park
|
Sangdoo Yun
|
Alice Oh
|
Hwaran Lee
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching—the practice of language alternation in a conversation—we propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token- and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (low-resource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available.
pdf
bib
abs
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Jiho Jin
|
Woosung Kang
|
Junho Myung
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2025
Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
pdf
bib
abs
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon
|
Jaeyoon Jung
|
Seunghyun Yoon
|
Kunwoo Park
Findings of the Association for Computational Linguistics: ACL 2025
Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyze whether the generated documents contain information entailed by ground-truth evidence and assess their impact on performance. Our findings indicate that, on average, performance improvements consistently occurred for claims whose generated documents included sentences entailed by gold evidence. This suggests that knowledge leakage may be present in fact-verification benchmarks, potentially inflating the perceived performance of LLM-based query expansion methods.
pdf
bib
abs
GUI Agents: A Survey
Dang Nguyen
|
Jian Chen
|
Yu Wang
|
Gang Wu
|
Namyong Park
|
Zhengmian Hu
|
Hanjia Lyu
|
Junda Wu
|
Ryan Aponte
|
Yu Xia
|
Xintong Li
|
Jing Shi
|
Hongjie Chen
|
Viet Dac Lai
|
Zhouhang Xie
|
Sungchul Kim
|
Ruiyi Zhang
|
Tong Yu
|
Mehrab Tanjim
|
Nesreen K. Ahmed
|
Puneet Mathur
|
Seunghyun Yoon
|
Lina Yao
|
Branislav Kveton
|
Jihyung Kil
|
Thien Huu Nguyen
|
Trung Bui
|
Tianyi Zhou
|
Ryan A. Rossi
|
Franck Dernoncourt
Findings of the Association for Computational Linguistics: ACL 2025
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
pdf
bib
abs
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
Jisu Shin
|
Juhyun Oh
|
Eunsu Kim
|
Hoyun Song
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2025
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
pdf
bib
abs
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
|
Juyoung Suk
|
Seungone Kim
|
Niklas Muennighoff
|
Dongkwan Kim
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2025
We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs). This approach leverages multi-turn interactions where the LLM interviewer actively provides feedback on responses and poses follow-up questions to the evaluated LLM. At the start of the interview, the LLM interviewer dynamically modifies datasets to generate initial questions, mitigating data contamination. We apply the LLM-as-an-Interviewer framework to evaluate six models on the reasoning, factuality and instruction-following tasks. Our results show that the framework effectively provides insights into LLM performance, including the quality of initial responses, adaptability to feedback, and ability to address follow-up queries like clarification or additional knowledge requests. The framework also addresses key limitations of conventional methods like LLM-as-a-Judge, including verbosity bias and inconsistency across runs. Finally, we propose the Interview Report, which aggregates insights from the interview process, providing examples and a comprehensive analysis of the LLM’s strengths and weaknesses. This report offers a detailed snapshot of the model’s real-world applicability.
pdf
bib
abs
Culture is Everywhere: A Call for Intentionally Cultural Evaluation
Juhyun Oh
|
Inha Cha
|
Michael Saxon
|
Hyunseung Lim
|
Shaily Bhatt
|
Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2025
The prevailing “trivia-centered paradigm” for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly “neutral” evaluation settings.In this position paper, we argue for intentionally cultural evaluation: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don’t know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.
pdf
bib
abs
Uncovering Factor-Level Preference to Improve Human-Model Alignment
Juhyun Oh
|
Eunsu Kim
|
Jiseon Kim
|
Wenda Xu
|
Inha Cha
|
William Yang Wang
|
Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods’ reliance on coarse-grained comparisons and lack of explainability.To address this, we introduce PROFILE, an automated framework to uncover and measure factor-level preference alignment of humans and LLMs.Using PROFILE, we analyze preference alignment across three key tasks: summarization, instruction-following, and document-based QA. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in discrimination tasks. We demonstrate how leveraging the identified generation-discrimination gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance.Our work highlights the value of factor-level analysis for identifying hidden misalignments and provides a practical framework for improving LLM-human preference alignment.
pdf
bib
abs
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
Seyoung Song
|
Seogyeong Jeong
|
Eunsu Kim
|
Jiho Jin
|
Dongkwan Kim
|
Jay Shin
|
Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2025
Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs’ multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs’ accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy for successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (r > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.
pdf
bib
abs
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory
Junho Myung
|
Yeon Su Park
|
Sunwoo Kim
|
Shin Yoo
|
Alice Oh
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs’ decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please.
pdf
bib
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)
Sang Truong
|
Rifki Afina Putri
|
Duc Nguyen
|
Angelina Wang
|
Daniel Ho
|
Alice Oh
|
Sanmi Koyejo
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)
pdf
bib
abs
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
Seyoung Song
|
Seogyeong Jeong
|
Eunsu Kim
|
Jiho Jin
|
Dongkwan Kim
|
Jamin Shin
|
Alice Oh
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs’ multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs’ accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy for successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (r > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.
pdf
bib
abs
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata
|
Frederikus Hudi
|
Patrick Amadeus Irawan
|
David Anugraha
|
Rifki Afina Putri
|
Wang Yutong
|
Adam Nohejl
|
Ubaidillah Ariq Prathama
|
Nedjma Ousidhoum
|
Afifa Amriani
|
Anar Rzayev
|
Anirban Das
|
Ashmari Pramodya
|
Aulia Adila
|
Bryan Wilie
|
Candy Olivia Mawalim
|
Cheng Ching Lam
|
Daud Abolade
|
Emmanuele Chersoni
|
Enrico Santus
|
Fariz Ikhwantri
|
Garry Kuwanto
|
Hanyang Zhao
|
Haryo Akbarianto Wibowo
|
Holy Lovenia
|
Jan Christian Blaise Cruz
|
Jan Wira Gotama Putra
|
Junho Myung
|
Lucky Susanto
|
Maria Angelica Riera Machin
|
Marina Zhukova
|
Michael Anugraha
|
Muhammad Farid Adilazuarda
|
Natasha Christabelle Santosa
|
Peerat Limkonchotiwat
|
Raj Dabre
|
Rio Alexander Audino
|
Samuel Cahyawijaya
|
Shi-Xiong Zhang
|
Stephanie Yulia Salim
|
Yi Zhou
|
Yinxuan Gui
|
David Ifeoluwa Adelani
|
En-Shiun Annie Lee
|
Shogo Okada
|
Ayu Purwarianti
|
Alham Fikri Aji
|
Taro Watanabe
|
Derry Tanti Wijaya
|
Alice Oh
|
Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
pdf
bib
abs
CORG: Generating Answers from Complex, Interrelated Contexts
Hyunji Lee
|
Franck Dernoncourt
|
Trung Bui
|
Seunghyun Yoon
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (COrg), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. COrg consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that COrg balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.
pdf
bib
abs
Generating Diverse Hypotheses for Inductive Reasoning
Kang-il Lee
|
Hyukhun Koh
|
Dongryeol Lee
|
Seunghyun Yoon
|
Minsung Kim
|
Kyomin Jung
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Inductive reasoning — the process of inferring general rules from a small number of observations — is a fundamental aspect of human intelligence. Recent works suggest that large language models (LLMs) can engage in inductive reasoning by sampling multiple hypotheses about the rules and selecting the one that best explains the observations. However, due to the IID sampling, semantically redundant hypotheses are frequently generated, leading to significant wastage of compute. In this paper, we 1) demonstrate that increasing the temperature to enhance the diversity is limited due to text degeneration issue, and 2) propose a novel method to improve the diversity while maintaining text quality. We first analyze the effect of increasing the temperature parameter, which is regarded as the LLM’s diversity control, on IID hypotheses. Our analysis shows that as temperature rises, diversity and accuracy of hypotheses increase up to a certain point, but this trend saturates due to text degeneration. To generate hypotheses that are more semantically diverse and of higher quality, we propose a novel approach inspired by human inductive reasoning, which we call Mixture of Concepts (MoC). When applied to several inductive reasoning benchmarks, MoC demonstrated significant performance improvements compared to standard IID sampling and other approaches.
pdf
bib
abs
Team ACK at SemEval-2025 Task 2: Beyond Word-for-Word Machine Translation for English-Korean Pairs
Daniel Lee
|
Harsh Sharma
|
Jieun Han
|
Sunny Jeong
|
Alice Oh
|
Vered Shwartz
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Translating knowledge-intensive and entity-rich text between English and Korean requires transcreation to preserve language-specific and cultural nuances beyond literal, phonetic or word-for-word conversion. We evaluate 13 models (LLMs and MT systems) using automatic metrics and human assessment by bilingual annotators. Our findings show LLMs outperform traditional MT systems but struggle with entity translation requiring cultural adaptation. By constructing an error taxonomy, we identify incorrect responses and entity name errors as key issues, with performance varying by entity type and popularity level. This work exposes gaps in automatic evaluation metrics and hope to enable future work in completing culturally-nuanced machine translation.
2024
pdf
bib
abs
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
Jieun Han
|
Haneul Yoo
|
Junho Myung
|
Minsun Kim
|
Hyunseung Lim
|
Yoonsu Kim
|
Tak Yeon Lee
|
Hwajung Hong
|
Juho Kim
|
So-Yeon Ahn
|
Alice Oh
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three criteria to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding (1) quality and (2) characteristics. On the other hand, EFL learners assess their (3) learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.
pdf
bib
abs
The Generative AI Paradox in Evaluation: “What It Can Solve, It May Not Evaluate”
Juhyun Oh
|
Eunsu Kim
|
Inha Cha
|
Alice Oh
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of “the Generative AI Paradox” (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
pdf
bib
abs
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Joonho Yang
|
Seunghyun Yoon
|
ByeongJeong Kim
|
Hwanhee Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Through the advent of pre-trained language models, there have been notable advancements in abstractive summarization systems. Simultaneously, a considerable number of novel methods for evaluating factual consistency in abstractive summarization systems has been developed. But these evaluation approaches incorporate substantial limitations, especially on refinement and interpretability. In this work, we propose highly effective and interpretable factual inconsistency detection method FIZZ (Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document) for abstractive summarization systems that is based on fine-grained atomic facts decomposition. Moreover, we align atomic facts decomposed from the summary with the source document through adaptive granularity expansion. These atomic facts represent a more fine-grained unit of information, facilitating detailed understanding and interpretability of the summary’s factual inconsistency. Experimental results demonstrate that our proposed factual consistency checking system significantly outperforms existing systems. We release the code at https://github.com/plm3332/FIZZ.
pdf
bib
abs
Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models
Chani Jung
|
Dongkwan Kim
|
Jiho Jin
|
Jiseon Kim
|
Yeon Seonwoo
|
Yejin Choi
|
Alice Oh
|
Hyunwoo Kim
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
While humans naturally develop theory of mind (ToM), the capability to understand other people’s mental states and beliefs, state-of-the-art large language models (LLMs) underperform on simple ToM benchmarks. We posit that we can extend our understanding of LLMs’ ToM abilities by evaluating key human ToM precursors-perception inference and perception-to-belief inference-in LLMs. We introduce two datasets, Percept-ToMi and Percept-FANToM, to evaluate these precursory inferences for ToM in LLMs by annotating characters’ perceptions on ToMi and FANToM, respectively.Our evaluation of eight state-of-the-art LLMs reveals that the models generally perform well in perception inference while exhibiting limited capability in perception-to-belief inference (e.g., lack of inhibitory control).Based on these results, we present PercepToM, a novel ToM method leveraging LLMs’ strong perception inference capability while supplementing their limited perception-to-belief inference. Experimental results demonstrate that PercepToM significantly enhances LLM’s performance, especially in false belief scenarios.
pdf
bib
abs
Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs
Mihir Parmar
|
Hanieh Deilamsalehy
|
Franck Dernoncourt
|
Seunghyun Yoon
|
Ryan A. Rossi
|
Trung Bui
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Extractive summarization plays a pivotal role in natural language processing due to its wide-range applications in summarizing diverse content efficiently, while also being faithful to the original content. Despite significant advancement achieved in extractive summarization by Large Language Models (LLMs), these summaries frequently exhibit incoherence. An important aspect of the coherent summary is its readability for intended users. Although there have been many datasets and benchmarks proposed for creating coherent extractive summaries, none of them currently incorporate user intent to improve coherence in extractive summarization. Motivated by this, we propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback, offering valuable insights into how to improve coherence in extractive summaries. We utilize this dataset for aligning LLMs through supervised fine-tuning with natural language human feedback to enhance the coherence of their generated summaries. Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (~10% Rouge-L) in terms of producing coherent summaries. We further utilize human feedback to benchmark results over instruction-tuned models such as FLAN-T5 which resulted in several interesting findings.
pdf
bib
abs
Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese
Rifki Afina Putri
|
Faiz Ghifari Haznitrama
|
Dea Adhista
|
Alice Oh
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are increasingly being used to generate synthetic data for training and evaluating models. However, it is unclear whether they can generate a good quality of question answering (QA) dataset that incorporates knowledge and cultural nuance embedded in a language, especially for low-resource languages. In this study, we investigate the effectiveness of using LLMs in generating culturally relevant commonsense QA datasets for Indonesian and Sundanese languages. To do so, we create datasets for these languages using various methods involving both LLMs and human annotators, resulting in 4.5K questions per language (9K in total), making our dataset the largest of its kind. Our experiments show that automatic data adaptation from an existing English dataset is less effective for Sundanese. Interestingly, using the direct generation method on the target language, GPT-4 Turbo can generate questions with adequate general knowledge in both languages, albeit not as culturally ‘deep’ as humans. We also observe a higher occurrence of fluency errors in the Sundanese dataset, highlighting the discrepancy between medium- and lower-resource languages.
pdf
bib
abs
PDFTriage: Question Answering over Long, Structured Documents
Jon Saad-Falcon
|
Joe Barrow
|
Alexa Siu
|
Ani Nenkova
|
Seunghyun Yoon
|
Ryan A. Rossi
|
Franck Dernoncourt
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user’s mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
pdf
bib
abs
HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying Real-World Claims
Yejun Yoon
|
Jaeyoon Jung
|
Seunghyun Yoon
|
Kunwoo Park
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)
To tackle the AVeriTeC shared task hosted by the FEVER-24, we introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking, dubbed the
Herd of
Open LLMs for verifying real-world claims (
HerO). HerO employs multiple LLMs for each step of automated fact-checking. For evidence retrieval, a language model is used to enhance a query by generating hypothetical documents that check the veracity of a claim. We fine-tune LLMs for question generation and veracity prediction by crafting prompts with retrieved in-context samples. HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims. For future research, we make our code publicly available at
https://github.com/ssu-humane/HerO.
pdf
bib
abs
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim
|
Seunghyun Yoon
|
Trung Bui
|
Handong Zhao
|
Quan Tran
|
Franck Dernoncourt
|
Jaewoo Kang
Findings of the Association for Computational Linguistics: EACL 2024
Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 7.6% and 9.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.
pdf
bib
abs
PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck
Thang Pham
|
Peijie Chen
|
Tin Nguyen
|
Seunghyun Yoon
|
Trung Bui
|
Anh Nguyen
Findings of the Association for Computational Linguistics: NAACL 2024
CLIP-based classifiers rely on the prompt containing a class name that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB – an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (∼10× in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Stanford Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.
pdf
bib
abs
BEnQA: A Question Answering Benchmark for Bengali and English
Sheikh Shafayat
|
H M Quamran Hasan
|
Minhajur Rahman Chowdhury Mahim
|
Rifki Afina Putri
|
James Thorne
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2024
In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.
pdf
bib
abs
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
Yejun Yoon
|
Seunghyun Yoon
|
Kunwoo Park
Findings of the Association for Computational Linguistics: ACL 2024
This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors’ visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.
pdf
bib
abs
Multi-hop Database Reasoning with Virtual Knowledge Graph
Juhee Son
|
Yeon Seonwoo
|
Seunghyun Yoon
|
James Thorne
|
Alice Oh
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)
Application of LLM to database queries on natural language sentences has demonstrated impressive results in both single and multi-hop scenarios.In the existing methodologies, the requirement to re-encode query vectors at each stage for processing multi-hop queries presents a significant bottleneck to the inference speed.This paper proposes VKGFR (Virtual Knowledge Graph based Fact Retriever) that leverages large language models to extract representations corresponding to a sentence’s knowledge graph, significantly enhancing inference speed for multi-hop reasoning without performance loss.Given that both the queries and natural language database sentences can be structured as a knowledge graph, we suggest extracting a Virtual Knowledge Graph (VKG) representation from sentences with LLM.Over the pre-constructed VKG, our VKGFR conducts retrieval with a tiny model structure, showing performance improvements with higher computational efficiency. We evaluate VKGFR on the WikiNLDB and MetaQA dataset, designed for multi-hop database reasoning over text. The results indicate 13x faster inference speed on the WikiNLDB dataset without performance loss.
pdf
bib
abs
Multi-hop Database Reasoning with Virtual Knowledge Graph
Juhee Son
|
Yeon Seonwoo
|
Seunghyun Yoon
|
James Thorne
|
Alice Oh
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)
Application of LLM to database queries on natural language sentences has demonstrated impressive results in both single and multi-hop scenarios.In the existing methodologies, the requirement to re-encode query vectors at each stage for processing multi-hop queries presents a significant bottleneck to the inference speed.This paper proposes VKGFR (Virtual Knowledge Graph based Fact Retriever) that leverages large language models to extract representations corresponding to a sentence’s knowledge graph, significantly enhancing inference speed for multi-hop reasoning without performance loss.Given that both the queries and natural language database sentences can be structured as a knowledge graph, we suggest extracting a Virtual Knowledge Graph (VKG) representation from sentences with LLM.Over the pre-constructed VKG, our VKGFR conducts retrieval with a tiny model structure, showing performance improvements with higher computational efficiency. We evaluate VKGFR on the WikiNLDB and MetaQA dataset, designed for multi-hop database reasoning over text. The results indicate 13x faster inference speed on the WikiNLDB dataset without performance loss.
pdf
bib
abs
KaPQA: Knowledge-Augmented Product Question-Answering
Swetha Eppalapally
|
Daksh Dangi
|
Chaithra Bhat
|
Ankita Gupta
|
Ruiyi Zhang
|
Shubham Agarwal
|
Karishma Bagga
|
Seunghyun Yoon
|
Nedim Lipka
|
Ryan Rossi
|
Franck Dernoncourt
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP
Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced.
pdf
bib
abs
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
Eunsu Kim
|
Juyoung Suk
|
Philhoon Oh
|
Haneul Yoo
|
James Thorne
|
Alice Oh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Despite the rapid development of large language models (LLMs) for the Korean language, there remains an obvious lack of benchmark datasets that test the requisite Korean cultural and linguistic knowledge. Because many existing Korean benchmark datasets are derived from the English counterparts through translation, they often overlook the different cultural contexts. For the few benchmark datasets that are sourced from Korean data capturing cultural knowledge, only narrow tasks such as hate speech detection are offered. To address this gap, we introduce a benchmark of Cultural and Linguistic Intelligence in Korean (CLIcK), a dataset comprising 1,995 QA pairs. CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture. For each instance in click, we provide fine-grained annotation of which cultural and linguistic knowledge is required to correctly answer the question. Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension. CLIcK offers the first large-scale comprehensive Korean-centric analysis of LLMs’ proficiency in Korean language and culture.
pdf
bib
abs
RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education
Jieun Han
|
Haneul Yoo
|
Junho Myung
|
Minsun Kim
|
Tak Yeon Lee
|
So-Yeon Ahn
|
Alice Oh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 212 college students in English as Foreign Language (EFL) writing courses. During the study, students engaged in dialogues with ChatGPT to revise their essays. RECIPE4U includes comprehensive records of these interactions, including conversation logs, students’ intent, students’ self-rated satisfaction, and students’ essay edit histories. In particular, we annotate the students’ utterances in RECIPE4U with 13 intention labels based on our coding schemes. We establish baseline results for two subtasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. As a foundational step, we explore student-ChatGPT interaction patterns through RECIPE4U and analyze them by focusing on students’ dialogue, essay data statistics, and students’ essay edits. We further illustrate potential applications of RECIPE4U dataset for enhancing the incorporation of LLMs in educational frameworks. RECIPE4U is publicly available at https://zeunie.github.io/RECIPE4U/.
pdf
bib
abs
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis
Nayeon Lee
|
Chani Jung
|
Junho Myung
|
Jiho Jin
|
Jose Camacho-Collados
|
Juho Kim
|
Alice Oh
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Most hate speech datasets neglect the cultural diversity within a single language, resulting in a critical shortcoming in hate speech detection. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. To construct CREHate, we follow a two-step procedure: 1) cultural post collection and 2) cross-cultural annotation. We sample posts from the SBIC dataset, which predominantly represents North America, and collect posts from four geographically diverse English-speaking countries (Australia, United Kingdom, Singapore, and South Africa) using culturally hateful keywords we retrieve from our survey. Annotations are collected from the four countries plus the United States to establish representative labels for each country. Our analysis highlights statistically significant disparities across countries in hate speech annotations. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%. Qualitative analysis shows that label disagreement occurs mostly due to different interpretations of sarcasm and the personal bias of annotators on divisive topics. Lastly, we evaluate large language models (LLMs) under a zero-shot setting and show that current LLMs tend to show higher accuracies on Anglosphere country labels in CREHate.Our dataset and codes are available at: https://github.com/nlee0212/CREHate
pdf
bib
abs
KoBBQ: Korean Bias Benchmark for Question Answering
Jiho Jin
|
Jiseon Kim
|
Nayeon Lee
|
Haneul Yoo
|
Alice Oh
|
Hwaran Lee
Transactions of the Association for Computational Linguistics, Volume 12
Warning: This paper contains examples of stereotypes and biases. The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs), but it is not simple to adapt this benchmark to cultural contexts other than the US because social biases depend heavily on the cultural context. In this paper, we present KoBBQ, a Korean bias benchmark dataset, and we propose a general framework that addresses considerations for cultural adaptation of a dataset. Our framework includes partitioning the BBQ dataset into three classes—Simply-Transferred (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture)—and adding four new categories of bias specific to Korean culture. We conduct a large-scale survey to collect and validate the social biases and the targets of the biases that reflect the stereotypes in Korean culture. The resulting KoBBQ dataset comprises 268 templates and 76,048 samples across 12 categories of social bias. We use KoBBQ to measure the accuracy and bias scores of several state-of-the-art multilingual LMs. The results clearly show differences in the bias of LMs as measured by KoBBQ and a machine-translated version of BBQ, demonstrating the need for and utility of a well-constructed, culturally aware social bias benchmark.
2023
pdf
bib
abs
SQuARe: A Large-Scale Dataset of Sensitive Questions and Acceptable Responses Created through Human-Machine Collaboration
Hwaran Lee
|
Seokhee Hong
|
Joonsuk Park
|
Takyoung Kim
|
Meeyoung Cha
|
Yejin Choi
|
Byoungpil Kim
|
Gunhee Kim
|
Eun-Ju Lee
|
Yong Lim
|
Alice Oh
|
Sangchul Park
|
Jung-Woo Ha
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. Existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. However, discussions on sensitive issues can become toxic even if the users are well-intentioned. For safer models in such scenarios, we present the Sensitive Questions and Acceptable Response (SQuARe) dataset, a large-scale Korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. The dataset was constructed leveraging HyperCLOVA in a human-in-the-loop manner based on real news headlines. Experiments show that acceptable response generation significantly improves for HyperCLOVA and GPT-3, demonstrating the efficacy of this dataset.
pdf
bib
abs
Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation
Soyoung Yoon
|
Sungjoon Park
|
Gyuwan Kim
|
Junhee Cho
|
Kihyo Park
|
Gyu Tae Kim
|
Minjoon Seo
|
Alice Oh
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.
pdf
bib
abs
Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations
Hyunjae Kim
|
Jaehyo Yoo
|
Seunghyun Yoon
|
Jaewoo Kang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
pdf
bib
abs
Rethinking Annotation: Can Language Learners Contribute?
Haneul Yoo
|
Rifki Afina Putri
|
Changyoon Lee
|
Youngin Lee
|
So-Yeon Ahn
|
Dongyeop Kang
|
Alice Oh
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Researchers have traditionally recruited native speakers to provide annotations for the widely used benchmark datasets. But there are languages for which recruiting native speakers is difficult, and it would help to get learners of those languages to annotate the data. In this paper, we investigate whether language learners can contribute annotations to the benchmark datasets. In a carefully controlled annotation experiment, we recruit 36 language learners, provide two types of additional resources (dictionaries and machine-translated sentences), and perform mini-tests to measure their language proficiency. We target three languages, English, Korean, and Indonesian, and four NLP tasks, sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension. We find that language learners, especially those with intermediate or advanced language proficiency, are able to provide fairly accurate labels with the help of additional resources. Moreover, we show that data annotation improves learners’ language proficiency in terms of vocabulary and grammar. The implication of our findings is that broadening the annotation task to include language learners can open up the opportunity to build benchmark datasets for languages for which it is difficult to recruit native speakers.
pdf
bib
abs
MeetingQA: Extractive Question-Answering on Meeting Transcripts
Archiki Prasad
|
Trung Bui
|
Seunghyun Yoon
|
Hanieh Deilamsalehy
|
Franck Dernoncourt
|
Mohit Bansal
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the ubiquitous use of online meeting platforms and robust automatic speech recognition systems, meeting transcripts have emerged as a promising domain for natural language tasks. Most recent works on meeting transcripts primarily focus on summarization and extraction of action items. However, meeting discussions also have a useful question-answering (QA) component, crucial to understanding the discourse or meeting content, and can be used to build interactive interfaces on top of long transcripts. Hence, in this work, we leverage this inherent QA component of meeting discussions and introduce MeetingQA, an extractive QA dataset comprising of questions asked by meeting participants and corresponding responses. As a result, questions can be open-ended and actively seek discussions, while the answers can be multi-span and distributed across multiple speakers. Our comprehensive empirical study of several robust baselines including long-context language models and recent instruction-tuned models reveals that models perform poorly on this task (F1 = 57.3) and severely lag behind human performance (F1 = 84.6), thus presenting a challenging new task for the community to improve upon.
pdf
bib
abs
Ranking-Enhanced Unsupervised Sentence Representation Learning
Yeon Seonwoo
|
Guoyin Wang
|
Changmin Seo
|
Sajal Choudhary
|
Jiwei Li
|
Xiang Li
|
Puyang Xu
|
Sunghyun Park
|
Alice Oh
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Unsupervised sentence representation learning has progressed through contrastive learning and data augmentation methods such as dropout masking. Despite this progress, sentence encoders are still limited to using only an input sentence when predicting its semantic vector. In this work, we show that the semantic meaning of a sentence is also determined by nearest-neighbor sentences that are similar to the input sentence. Based on this finding, we propose a novel unsupervised sentence encoder, RankEncoder. RankEncoder predicts the semantic vector of an input sentence by leveraging its relationship with other sentences in an external corpus, as well as the input sentence itself. We evaluate RankEncoder on semantic textual benchmark datasets. From the experimental results, we verify that 1) RankEncoder achieves 80.07% Spearman’s correlation, a 1.1% absolute improvement compared to the previous state-of-the-art performance, 2) RankEncoder is universally applicable to existing unsupervised sentence embedding methods, and 3) RankEncoder is specifically effective for predicting the similarity scores of similar sentence pairs.
pdf
bib
abs
Hate Speech Classifiers are Culturally Insensitive
Nayeon Lee
|
Chani Jung
|
Alice Oh
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
Increasingly, language models and machine translation are becoming valuable tools to help people communicate with others from diverse cultural backgrounds. However, current language models lack cultural awareness because they are trained on data representing only the culture within the dataset. This presents a problem in the context of hate speech classification, where cultural awareness is especially critical. This study aims to quantify the cultural insensitivity of three monolingual (Korean, English, Arabic) hate speech classifiers by evaluating their performance on translated datasets from the other two languages. Our research has revealed that hate speech classifiers evaluated on datasets from other cultures yield significantly lower F1 scores, up to almost 50%. In addition, they produce considerably higher false negative rates, with a magnitude up to five times greater, demonstrating the extent of the cultural gap. The study highlights the severity of cultural insensitivity of language models in hate speech classification.
pdf
bib
abs
PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search
Thang Pham
|
Seunghyun Yoon
|
Trung Bui
|
Anh Nguyen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC—a dataset of ∼28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking-models’ accuracy and remarkably pushes span selection (SS) models (i.e., predicting the start and end index of the target phrase) near human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (∼60% EM) and in estimating the similarity between two different phrases in the same context (∼70% EM).
pdf
bib
abs
Time-Aware Representation Learning for Time-Sensitive Question Answering
Jungbin Son
|
Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2023
Time is one of the crucial factors in real-world question answering (QA) problems. However, language models have difficulty understanding the relationships between time specifiers, such as ‘after’ and ‘before’, and numbers, since existing QA datasets do not include sufficient time expressions. To address this issue, we propose a Time-Context aware Question Answering (TCQA) framework. We suggest a Time-Context dependent Span Extraction (TCSE) task, and build a time-context dependent data generation framework for model training. Moreover, we present a metric to evaluate the time awareness of the QA model using TCSE. The TCSE task consists of a question and four sentence candidates classified as correct or incorrect based on time and context. The model is trained to extract the answer span from the sentence that is both correct in time and context. The model trained with TCQA outperforms baseline models up to 8.5 of the F1-score in the TimeQA dataset. Our dataset and code are available at https://github.com/sonjbin/TCQA
pdf
bib
abs
PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning
Yongil Kim
|
Yerin Hwang
|
Hyeongu Yun
|
Seunghyun Yoon
|
Trung Bui
|
Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2023
Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning. This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3,000 images in five languages. In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, while maintaining a strong correlation with human judgments.
2022
pdf
bib
abs
Virtual Knowledge Graph Construction for Zero-Shot Domain-Specific Document Retrieval
Yeon Seonwoo
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Trung Bui
|
Alice Oh
Proceedings of the 29th International Conference on Computational Linguistics
Domain-specific documents cover terminologies and specialized knowledge. This has been the main challenge of domain-specific document retrieval systems. Previous approaches propose domain-adaptation and transfer learning methods to alleviate this problem. However, these approaches still follow the same document representation method in previous approaches; a document is embedded into a single vector. In this study, we propose VKGDR. VKGDR represents a given corpus into a graph of entities and their relations (known as a virtual knowledge graph) and computes the relevance between queries and documents based on the graph representation. We conduct three experiments 1) domain-specific document retrieval, 2) comparison of our virtual knowledge graph construction method with previous approaches, and 3) ablation study on each component of our virtual knowledge graph. From the results, we see that unsupervised VKGDR outperforms baselines in a zero-shot setting and even outperforms fully-supervised bi-encoder. We also verify that our virtual knowledge graph construction method results in better retrieval performance than previous approaches.
pdf
bib
abs
Virtual Knowledge Graph Construction for Zero-Shot Domain-Specific Document Retrieval
Yeon Seonwoo
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Trung Bui
|
Alice Oh
Proceedings of the 29th International Conference on Computational Linguistics
Domain-specific documents cover terminologies and specialized knowledge. This has been the main challenge of domain-specific document retrieval systems. Previous approaches propose domain-adaptation and transfer learning methods to alleviate this problem. However, these approaches still follow the same document representation method in previous approaches; a document is embedded into a single vector. In this study, we propose VKGDR. VKGDR represents a given corpus into a graph of entities and their relations (known as a virtual knowledge graph) and computes the relevance between queries and documents based on the graph representation. We conduct three experiments 1) domain-specific document retrieval, 2) comparison of our virtual knowledge graph construction method with previous approaches, and 3) ablation study on each component of our virtual knowledge graph. From the results, we see that unsupervised VKGDR outperforms baselines in a zero-shot setting and even outperforms fully-supervised bi-encoder. We also verify that our virtual knowledge graph construction method results in better retrieval performance than previous approaches.
pdf
bib
abs
Medical Question Understanding and Answering with Knowledge Grounding and Semantic Self-Supervision
Khalil Mrini
|
Harpreet Singh
|
Franck Dernoncourt
|
Seunghyun Yoon
|
Trung Bui
|
Walter W. Chang
|
Emilia Farcas
|
Ndapa Nakashole
Proceedings of the 29th International Conference on Computational Linguistics
Current medical question answering systems have difficulty processing long, detailed and informally worded questions submitted by patients, called Consumer Health Questions (CHQs). To address this issue, we introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision. Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss. Then, our system performs a two-step retrieval to return answers. The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document. In the absence of labels for question matching or answer relevance, we design 3 novel, self-supervised and semantically-guided losses. We evaluate our model against two strong retrieval-based question answering baselines. Evaluators ask their own questions and rate the answers retrieved by our baselines and own system according to their relevance. They find that our system retrieves more relevant answers, while achieving speeds 20 times faster. Our self-supervised losses also help the summarizer achieve higher scores in ROUGE, as well as in human evaluation metrics.
pdf
bib
abs
MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction
Amir Pouran Ben Veyseh
|
Nicole Meister
|
Seunghyun Yoon
|
Rajiv Jain
|
Franck Dernoncourt
|
Thien Huu Nguyen
Proceedings of the 29th International Conference on Computational Linguistics
Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). Challenges of AE in other languages and domains are mainly unexplored. As such, lacking annotated datasets in multiple languages and domains has been a major issue to prevent research in this direction. To address this limitation, we propose a new dataset for multilingual and multi-domain AE. Specifically, 27,200 sentences in 6 different languages and 2 new domains, i.e., legal and scientific, are manually annotated for AE. Our experiments on the dataset show that AE in different languages and learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.
pdf
bib
abs
Offensive Content Detection via Synthetic Code-Switched Text
Cesa Salaam
|
Franck Dernoncourt
|
Trung Bui
|
Danda Rawat
|
Seunghyun Yoon
Proceedings of the 29th International Conference on Computational Linguistics
The prevalent use of offensive content in social media has become an important reason for concern for online platforms (customer service chat-boxes, social media platforms, etc). Classifying offensive and hate-speech content in online settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de, and a synthetic code-switched textual dataset containing ~30k samples for training In this paper, we describe the process for gathering the human-generated data and our algorithm for creating synthetic code-switched offensive content data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.
pdf
bib
abs
Keyphrase Prediction from Video Transcripts: New Dataset and Directions
Amir Pouran Ben Veyseh
|
Quan Hung Tran
|
Seunghyun Yoon
|
Varun Manjunatha
|
Hanieh Deilamsalehy
|
Rajiv Jain
|
Trung Bui
|
Walter W. Chang
|
Franck Dernoncourt
|
Thien Huu Nguyen
Proceedings of the 29th International Conference on Computational Linguistics
Keyphrase Prediction (KP) is an established NLP task, aiming to yield representative phrases to summarize the main content of a given document. Despite major progress in recent years, existing works on KP have mainly focused on formal texts such as scientific papers or weblogs. The challenges of KP in informal-text domains are not yet fully studied. To this end, this work studies new challenges of KP in transcripts of videos, an understudied domain for KP that involves informal texts and non-cohesive presentation styles. A bottleneck for KP research in this domain involves the lack of high-quality and large-scale annotated data that hinders the development of advanced KP models. To address this issue, we introduce a large-scale manually-annotated KP dataset in the domain of live-stream video transcripts obtained by automatic speech recognition tools. Concretely, transcripts of 500+ hours of videos streamed on the behance.net platform are manually labeled with important keyphrases. Our analysis of the dataset reveals the challenging nature of KP in transcripts. Moreover, for the first time in KP, we demonstrate the idea of improving KP for long documents (i.e., transcripts) by feeding models with paragraph-level keyphrases, i.e., hierarchical extraction. To foster future research, we will publicly release the dataset and code.
pdf
bib
abs
How does fake news use a thumbnail? CLIP-based Multimodal Detection on the Unrepresentative News Image
Hyewon Choi
|
Yejun Yoon
|
Seunghyun Yoon
|
Kunwoo Park
Proceedings of the Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situations
This study investigates how fake news use the thumbnail image for a news article. We aim at capturing the degree of semantic incongruity between news text and image by using the pretrained CLIP representation. Motivated by the stylistic distinctiveness in fake news text, we examine whether fake news tends to use an irrelevant image to the news content. Results show that fake news tends to have a high degree of semantic incongruity than general news. We further attempt to detect such image-text incongruity by training classification models on a newly generated dataset. A manual evaluation suggests our method can find news articles of which the thumbnail image is semantically irrelevant to news text with an accuracy of 0.8. We also release a new dataset of image and news text pairs with the incongruity label, facilitating future studies on the direction.
pdf
bib
abs
Simple Questions Generate Named Entity Recognition Datasets
Hyunjae Kim
|
Jaehyo Yoo
|
Seunghyun Yoon
|
Jinhyuk Lee
|
Jaewoo Kang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent named entity recognition (NER) models often rely on human-annotated datasets requiring the vast engagement of professional knowledge on the target domain and entities. This work introduces an ask-to-generate approach, which automatically generates NER datasets by asking simple natural language questions to an open-domain question answering system (e.g., “Which disease?”). Despite using fewer training resources, our models solely trained on the generated datasets largely outperform strong low-resource models by 19.5 F1 score across six popular NER benchmarks. Our models also show competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by 5.2 F1 score on three benchmarks and achieve new state-of-the-art performance.
pdf
bib
abs
IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension
Rifki Afina Putri
|
Alice Oh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Machine Reading Comprehension (MRC) has become one of the essential tasks in Natural Language Understanding (NLU) as it is often included in several NLU benchmarks (Liang et al., 2020; Wilie et al., 2020). However, most MRC datasets only have answerable question type, overlooking the importance of unanswerable questions. MRC models trained only on answerable questions will select the span that is most likely to be the answer, even when the answer does not actually exist in the given passage (Rajpurkar et al., 2018). This problem especially remains in medium- to low-resource languages like Indonesian. Existing Indonesian MRC datasets (Purwarianti et al., 2007; Clark et al., 2020) are still inadequate because of the small size and limited question types, i.e., they only cover answerable questions. To fill this gap, we build a new Indonesian MRC dataset called I(n)don’tKnow- MRC (IDK-MRC) by combining the automatic and manual unanswerable question generation to minimize the cost of manual dataset construction while maintaining the dataset quality. Combined with the existing answerable questions, IDK-MRC consists of more than 10K questions in total. Our analysis shows that our dataset significantly improves the performance of Indonesian MRC models, showing a large improvement for unanswerable questions.
pdf
bib
abs
KOLD: Korean Offensive Language Dataset
Younghoon Jeong
|
Juhyun Oh
|
Jongwon Lee
|
Jaimeen Ahn
|
Jihyung Moon
|
Sungjoon Park
|
Alice Oh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.
pdf
bib
abs
Two-Step Question Retrieval for Open-Domain QA
Yeon Seonwoo
|
Juhee Son
|
Jiho Jin
|
Sang-Woo Lee
|
Ji-Hoon Kim
|
Jung-Woo Ha
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2022
The retriever-reader pipeline has shown promising performance in open-domain QA but suffers from a very slow inference speed. Recently proposed question retrieval models tackle this problem by indexing question-answer pairs and searching for similar questions. These models have shown a significant increase in inference speed, but at the cost of lower QA performance compared to the retriever-reader models. This paper proposes a two-step question retrieval model, SQuID (Sequential Question-Indexed Dense retrieval) and distant supervision for training. SQuID uses two bi-encoders for question retrieval. The first-step retriever selects top-k similar questions, and the second-step retriever finds the most similar question from the top-k questions. We evaluate the performance and the computational efficiency of SQuID. The results show that SQuID significantly increases the performance of existing question retrieval models with a negligible loss on inference speed.
pdf
bib
abs
Multimodal Intent Discovery from Livestream Videos
Adyasha Maharana
|
Quan Tran
|
Franck Dernoncourt
|
Seunghyun Yoon
|
Trung Bui
|
Walter Chang
|
Mohit Bansal
Findings of the Association for Computational Linguistics: NAACL 2022
Individuals, educational institutions, and businesses are prolific at generating instructional video content such as “how-to” and tutorial guides. While significant progress has been made in basic video understanding tasks, identifying procedural intent within these instructional videos is a challenging and important task that remains unexplored but essential to video summarization, search, and recommendations. This paper introduces the problem of instructional intent identification and extraction from software instructional livestreams. We construct and present a new multimodal dataset consisting of software instructional livestreams and containing manual annotations for both detailed and abstract procedural intent that enable training and evaluation of joint video and text understanding models. We then introduce a multimodal cascaded cross-attention model to efficiently combine the weaker and noisier video signal with the more discriminative text signal. Our experiments show that our proposed model brings significant gains compared to strong baselines, including large-scale pretrained multimodal models. Our analysis further identifies that the task benefits from spatial as well as motion features extracted from videos, and provides insight on how the video signal is preferentially used for intent discovery. We also show that current models struggle to comprehend the nature of abstract intents, revealing important gaps in multimodal understanding and paving the way for future work.
pdf
bib
abs
Fine-grained Image Captioning with CLIP Reward
Jaemin Cho
|
Seunghyun Yoon
|
Ajinkya Kale
|
Franck Dernoncourt
|
Trung Bui
|
Mohit Bansal
Findings of the Association for Computational Linguistics: NAACL 2022
Modern image captioning models are usually trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with the text similarity objectives tend to ignore specific and detailed aspects of an image that distinguish it from others. Towards more descriptive and distinctive caption generation, we propose to use CLIP, a multimodal encoder trained on huge image-text pairs from the web, to calculate multi-modal similarity and use it as a reward function. We also propose a simple finetuning strategy of CLIP text encoder to improve grammar that does not require extra text annotation. This completely eliminates the need for reference captions during the reward computation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria: overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEroptimized model. We also show that our unsupervised grammar finetuning of the CLIP text encoder alleviates the degeneration problem of the naive CLIP reward. Lastly, we show human analysis where the annotators strongly prefer CLIP reward to CIDEr and MLE objectives on diverse criteria.
pdf
bib
abs
HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea
Haneul Yoo
|
Jiho Jin
|
Juhee Son
|
JinYeong Bak
|
Kyunghyun Cho
|
Alice Oh
Findings of the Association for Computational Linguistics: NAACL 2022
Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
pdf
bib
abs
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son
|
Jiho Jin
|
Haneul Yoo
|
JinYeong Bak
|
Kyunghyun Cho
|
Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2022
The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.The Annals were originally written in an archaic Korean writing system, ‘Hanja’, and were translated into Korean from 1968 to 1993.The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade.In parallel, expert translators are working on English translation, also at a slow pace and produced only one king’s records in English so far.Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English.We compare our method against two baselines:a recent model that simultaneously learns to restore and translate Hanja historical documentand a Transformer based model trained only on newly translated corpora.The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations.We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.
pdf
bib
abs
Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT
Jaimeen Ahn
|
Hwaran Lee
|
Jinhwa Kim
|
Alice Oh
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Knowledge distillation is widely used to transfer the language understanding of a large model to a smaller model. However, after knowledge distillation, it was found that the smaller model is more biased by gender compared to the source large model. This paper studies what causes gender bias to increase after the knowledge distillation process. Moreover, we suggest applying a variant of the mixup on knowledge distillation, which is used to increase generalizability during the distillation process, not for augmentation. By doing so, we can significantly reduce the gender bias amplification after knowledge distillation. We also conduct an experiment on the GLUE benchmark to demonstrate that even if the mixup is applied, it does not have a significant adverse effect on the model’s performance.
pdf
bib
abs
Factual Error Correction for Abstractive Summaries Using Entity Retrieval
Hwanhee Lee
|
Cheoneum Park
|
Seunghyun Yoon
|
Trung Bui
|
Franck Dernoncourt
|
Juae Kim
|
Kyomin Jung
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Despite the recent advancements in abstractive summarization systems leveraged from large-scale datasets and pre-trained language models, the factual correctness of the summary is still insufficient. One line of trials to mitigate this problem is to include a post-editing process that can detect and correct factual errors in the summary. In building such a system, it is strongly required that 1) the process has a high success rate and interpretability and 2) it has a fast running time. Previous approaches focus on the regeneration of the summary, resulting in low interpretability and high computing resources. In this paper, we propose an efficient factual error correction system RFEC based on entity retrieval. RFEC first retrieves the evidence sentences from the original document by comparing the sentences with the target summary to reduce the length of the text to analyze. Next, RFEC detects entity-level errors in the summaries using the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences. Experimental results show that our proposed error correction system shows more competitive performance than baseline methods in correcting factual errors with a much faster speed.
pdf
bib
abs
CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course
Changyoon Lee
|
Yeon Seonwoo
|
Alice Oh
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We introduce CS1QA, a dataset for code-based question answering in the programming education domain. CS1QA consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python, and 17,698 unannotated chat data with code. Each question is accompanied with the student’s code, and the portion of the code relevant to answering the question. We carefully design the annotation process to construct CS1QA, and analyze the collected dataset in detail. The tasks for CS1QA are to predict the question type, the relevant code snippet given the question and the code and retrieving an answer from the annotated corpus. Results for the experiments on several baseline models are reported and thoroughly analyzed. The tasks for CS1QA challenge models to understand both the code and natural language. This unique dataset can be used as a benchmark for source code comprehension and question answering in the educational setting.
2021
pdf
bib
abs
A Gradually Soft Multi-Task and Data-Augmented Approach to Medical Question Understanding
Khalil Mrini
|
Franck Dernoncourt
|
Seunghyun Yoon
|
Trung Bui
|
Walter Chang
|
Emilia Farcas
|
Ndapa Nakashole
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Users of medical question answering systems often submit long and detailed questions, making it hard to achieve high recall in answer retrieval. To alleviate this problem, we propose a novel Multi-Task Learning (MTL) method with data augmentation for medical question understanding. We first establish an equivalence between the tasks of question summarization and Recognizing Question Entailment (RQE) using their definitions in the medical domain. Based on this equivalence, we propose a data augmentation algorithm to use just one dataset to optimize for both tasks, with a weighted MTL loss. We introduce gradually soft parameter-sharing: a constraint for decoder parameters to be close, that is gradually loosened as we move to the highest layer. We show through ablation studies that our proposed novelties improve performance. Our method outperforms existing MTL methods across 4 datasets of medical question pairs, in ROUGE scores, RQE accuracy and human evaluation. Finally, we show that our method fares better than single-task learning under 4 low-resource settings.
pdf
bib
abs
UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning
Hwanhee Lee
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Trung Bui
|
Kyomin Jung
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references.
pdf
bib
abs
UCSD-Adobe at MEDIQA 2021: Transfer Learning and Answer Sentence Selection for Medical Summarization
Khalil Mrini
|
Franck Dernoncourt
|
Seunghyun Yoon
|
Trung Bui
|
Walter Chang
|
Emilia Farcas
|
Ndapa Nakashole
Proceedings of the 20th Workshop on Biomedical Language Processing
In this paper, we describe our approach to question summarization and multi-answer summarization in the context of the 2021 MEDIQA shared task (Ben Abacha et al., 2021). We propose two kinds of transfer learning for the abstractive summarization of medical questions. First, we train on HealthCareMagic, a large question summarization dataset collected from an online healthcare service platform. Second, we leverage the ability of the BART encoder-decoder architecture to model both generation and classification tasks to train on the task of Recognizing Question Entailment (RQE) in the medical domain. We show that both transfer learning methods combined achieve the highest ROUGE scores. Finally, we cast the question-driven extractive summarization of multiple relevant answer documents as an Answer Sentence Selection (AS2) problem. We show how we can preprocess the MEDIQA-AnS dataset such that it can be trained in an AS2 setting. Our AS2 model is able to generate extractive summaries achieving high ROUGE scores.
pdf
bib
abs
Mitigating Language-Dependent Ethnic Bias in BERT
Jaimeen Ahn
|
Alice Oh
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
In this paper, we study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual BERT for English, German, Spanish, Korean, Turkish, and Chinese. To observe and quantify ethnic bias, we develop a novel metric called Categorical Bias score. Then we propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models. We compare our proposed methods with monolingual BERT and show that these methods effectively alleviate the ethnic bias. Which of the two methods works better depends on the amount of NLP resources available for that language. We additionally experiment with Arabic and Greek to verify that our proposed methods work for a wider variety of languages.
pdf
bib
abs
Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning
Seonghyeon Ye
|
Jiseon Kim
|
Alice Oh
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We introduce EfficientCL, a memory-efficient continual pretraining method that applies contrastive learning with novel data augmentation and curriculum learning. For data augmentation, we stack two types of operation sequentially: cutoff and PCA jittering. While pretraining steps proceed, we apply curriculum learning by incrementing the augmentation degree for each difficulty step. After data augmentation is finished, contrastive learning is applied on projected embeddings of original and augmented examples. When finetuned on GLUE benchmark, our model outperforms baseline models, especially for sentence-level tasks. Additionally, this improvement is capable with only 70% of computational memory compared to the baseline model.
pdf
bib
abs
Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning
Jianguo Zhang
|
Trung Bui
|
Seunghyun Yoon
|
Xiang Chen
|
Zhiwei Liu
|
Congying Xia
|
Quan Hung Tran
|
Walter Chang
|
Philip Yu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar. We present a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. Specifically, we first conduct self-supervised contrastive pre-training on collected intent datasets, which implicitly learns to discriminate semantically similar utterances without using any labels. We then perform few-shot intent detection together with supervised contrastive learning, which explicitly pulls utterances from the same intent closer and pushes utterances across different intents farther. Experimental results show that our proposed method achieves state-of-the-art performance on three challenging intent detection datasets under 5-shot and 10-shot settings.
pdf
bib
abs
Dimensional Emotion Detection from Categorical Emotion
Sungjoon Park
|
Jiseon Kim
|
Seonghyeon Ye
|
Jaeyeol Jeon
|
Hee Young Park
|
Alice Oh
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We present a model to predict fine-grained emotions along the continuous dimensions of valence, arousal, and dominance (VAD) with a corpus with categorical emotion annotations. Our model is trained by minimizing the EMD (Earth Mover’s Distance) loss between the predicted VAD score distribution and the categorical emotion distributions sorted along VAD, and it can simultaneously classify the emotion categories and predict the VAD scores for a given sentence. We use pre-trained RoBERTa-Large and fine-tune on three different corpora with categorical labels and evaluate on EmoBank corpus with VAD scores. We show that our approach reaches comparable performance to that of the state-of-the-art classifiers in categorical emotion classification and shows significant positive correlations with the ground truth VAD scores. Also, further training with supervision of VAD labels leads to improved performance especially when dataset is small. We also present examples of predictions of appropriate emotion words that are not part of the original annotations.
pdf
bib
abs
Learning Bill Similarity with Annotated and Augmented Corpora of Bills
Jiseon Kim
|
Elden Griggs
|
In Song Kim
|
Alice Oh
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Bill writing is a critical element of representative democracy. However, it is often overlooked that most legislative bills are derived, or even directly copied, from other bills. Despite the significance of bill-to-bill linkages for understanding the legislative process, existing approaches fail to address semantic similarities across bills, let alone reordering or paraphrasing which are prevalent in legal document writing. In this paper, we overcome these limitations by proposing a 5-class classification task that closely reflects the nature of the bill generation process. In doing so, we construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level and release this annotated dataset to the research community. To augment the dataset, we generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process. We use BERT variants and apply multi-stage training, sequentially fine-tuning our models with synthetic and human-labeled datasets. We find that the predictive performance significantly improves when training with both human-labeled and synthetic data. Finally, we apply our trained model to infer section- and bill-level similarities. Our analysis shows that the proposed methodology successfully captures the similarities across legal documents at various levels of aggregation.
pdf
bib
Weakly Supervised Pre-Training for Multi-Hop Retriever
Yeon Seonwoo
|
Sang-Woo Lee
|
Ji-Hoon Kim
|
Jung-Woo Ha
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
abs
Knowledge-Enhanced Evidence Retrieval for Counterargument Generation
Yohan Jo
|
Haneul Yoo
|
JinYeong Bak
|
Alice Oh
|
Chris Reed
|
Eduard Hovy
Findings of the Association for Computational Linguistics: EMNLP 2021
Finding counterevidence to statements is key to many tasks, including counterargument generation. We build a system that, given a statement, retrieves counterevidence from diverse sources on the Web. At the core of this system is a natural language inference (NLI) model that determines whether a candidate sentence is valid counterevidence or not. Most NLI models to date, however, lack proper reasoning abilities necessary to find counterevidence that involves complex inference. Thus, we present a knowledge-enhanced NLI model that aims to handle causality- and example-based inference by incorporating knowledge graphs. Our NLI model outperforms baselines for NLI tasks, especially for instances that require the targeted inference. In addition, this NLI model further improves the counterevidence retrieval system, notably finding complex counterevidence better.
pdf
bib
abs
QACE: Asking Questions to Evaluate an Image Caption
Hwanhee Lee
|
Thomas Scialom
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Kyomin Jung
Findings of the Association for Computational Linguistics: EMNLP 2021
In this paper we propose QACE, a new metric based on Question Answering for Caption Evaluation to evaluate image captioning based on Question Generation(QG) and Question Answering(QA) systems. QACE generates questions on the evaluated caption and check its content by asking the questions on either the reference caption or the source image. We first develop QACE_Ref that compares the answers of the evaluated caption to its reference, and report competitive results with the state-of-the-art metrics. To go further, we propose QACE_Img, that asks the questions directly on the image, instead of reference. A Visual-QA system is necessary for QACE_Img. Unfortunately, the standard VQA models are actually framed a classification among only few thousands categories. Instead, we propose Visual-T5, an abstractive VQA system. The resulting metric, QACE_Img is multi-modal, reference-less and explainable. Our experiments show that QACE_Img compares favorably w.r.t. other reference-less metrics.
pdf
bib
abs
KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
Hwanhee Lee
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Doo Soon Kim
|
Trung Bui
|
Joongbo Shin
|
Kyomin Jung
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics in various datasets. Code for KPQA-metric will be available at
https://github.com/hwanheelee1993/KPQA.
2020
pdf
bib
abs
Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning
Joongbo Shin
|
Yoonhyung Lee
|
Seunghyun Yoon
|
Kyomin Jung
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Even though BERT has achieved successful performance improvements in various supervised learning tasks, BERT is still limited by repetitive inferences on unsupervised tasks for the computation of contextual language representations. To resolve this limitation, we propose a novel deep bidirectional language model called a Transformer-based Text Autoencoder (T-TA). The T-TA computes contextual language representations without repetition and displays the benefits of a deep bidirectional architecture, such as that of BERT. In computation time experiments in a CPU environment, the proposed T-TA performs over six times faster than the BERT-like model on a reranking task and twelve times faster on a semantic similarity task. Furthermore, the T-TA shows competitive or even better accuracies than those of BERT on the above tasks. Code is available at
https://github.com/joongbo/tta.
pdf
bib
abs
Speaker Sensitive Response Evaluation Model
JinYeong Bak
|
Alice Oh
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Automatic evaluation of open-domain dialogue response generation is very challenging because there are many appropriate responses for a given context. Existing evaluation models merely compare the generated response with the ground truth response and rate many of the appropriate responses as inappropriate if they deviate from the ground truth. One approach to resolve this problem is to consider the similarity of the generated response with the conversational context. In this paper, we propose an automatic evaluation model based on that idea and learn the model parameters from an unlabeled conversation corpus. Our approach considers the speakers in defining the different levels of similar context. We use a Twitter conversation corpus that contains many speakers and conversations to test our evaluation model. Experiments show that our model outperforms the other existing evaluation metrics in terms of high correlation with human annotation scores. We also show that our model trained on Twitter can be applied to movie dialogues without any additional training. We provide our code and the learned parameters so that they can be used for automatic evaluation of dialogue response generation models.
pdf
bib
abs
Context-Aware Answer Extraction in Question Answering
Yeon Seonwoo
|
Ji-Hoon Kim
|
Jung-Woo Ha
|
Alice Oh
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Extractive QA models have shown very promising performance in predicting the correct answer to a question for a given passage. However, they sometimes result in predicting the correct answer text but in a context irrelevant to the given question. This discrepancy becomes especially important as the number of occurrences of the answer text in a passage increases. To resolve this issue, we propose BLANC (BLock AttentioN for Context prediction) based on two main ideas: context prediction as an auxiliary task in multi-task learning manner, and a block attention method that learns the context prediction task. With experiments on reading comprehension, we show that BLANC outperforms the state-of-the-art QA models, and the performance gap increases as the number of answer text occurrences increases. We also conduct an experiment of training the models using SQuAD and predicting the supporting facts on HotpotQA and show that BLANC outperforms all baseline models in this zero-shot setting.
pdf
bib
abs
Suicidal Risk Detection for Military Personnel
Sungjoon Park
|
Kiwoong Park
|
Jaimeen Ahn
|
Alice Oh
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
We analyze social media for detecting the suicidal risk of military personnel, which is especially crucial for countries with compulsory military service such as the Republic of Korea. From a widely-used Korean social Q&A site, we collect posts containing military-relevant content written by active-duty military personnel. We then annotate the posts with two groups of experts: military experts and mental health experts. Our dataset includes 2,791 posts with 13,955 corresponding expert annotations of suicidal risk levels, and this dataset is available to researchers who consent to research ethics agreement. Using various fine-tuned state-of-the-art language models, we predict the level of suicide risk, reaching .88 F1 score for classifying the risks.
pdf
bib
abs
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
Hwanhee Lee
|
Seunghyun Yoon
|
Franck Dernoncourt
|
Doo Soon Kim
|
Trung Bui
|
Kyomin Jung
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.
pdf
bib
abs
Propagate-Selector: Detecting Supporting Sentences for Question Answering via Graph Neural Networks
Seunghyun Yoon
|
Franck Dernoncourt
|
Doo Soon Kim
|
Trung Bui
|
Kyomin Jung
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this study, we propose a novel graph neural network called propagate-selector (PS), which propagates information over sentences to understand information that cannot be inferred when considering sentences in isolation. First, we design a graph structure in which each node represents an individual sentence, and some pairs of nodes are selectively connected based on the text structure. Then, we develop an iterative attentive aggregation and a skip-combine method in which a node interacts with its neighborhood nodes to accumulate the necessary information. To evaluate the performance of the proposed approaches, we conduct experiments with the standard HotpotQA dataset. The empirical results demonstrate the superiority of our proposed approach, which obtains the best performances, compared to the widely used answer-selection models that do not consider the intersentential relationship.
2019
pdf
bib
abs
Variational Hierarchical User-based Conversation Model
JinYeong Bak
|
Alice Oh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Generating appropriate conversation responses requires careful modeling of the utterances and speakers together. Some recent approaches to response generation model both the utterances and the speakers, but these approaches tend to generate responses that are overly tailored to the speakers. To overcome this limitation, we propose a new model with a stochastic variable designed to capture the speaker information and deliver it to the conversational context. An important part of this model is the network of speakers in which each speaker is connected to one or more conversational partner, and this network is then used to model the speakers better. To test whether our model generates more appropriate conversation responses, we build a new conversation corpus containing approximately 27,000 speakers and 770,000 conversations. With this corpus, we run experiments of generating conversational responses and compare our model with other state-of-the-art models. By automatic evaluation metrics and human evaluation, we show that our model outperforms other models in generating appropriate responses. An additional advantage of our model is that it generates better responses for various new user scenarios, for example when one of the speakers is a known user in our corpus but the partner is a new user. For replicability, we make available all our code and data.
pdf
bib
abs
Additive Compositionality of Word Vectors
Yeon Seonwoo
|
Sungjoon Park
|
Dongkwan Kim
|
Alice Oh
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Additive compositionality of word embedding models has been studied from empirical and theoretical perspectives. Existing research on justifying additive compositionality of existing word embedding models requires a rather strong assumption of uniform word distribution. In this paper, we relax that assumption and propose more realistic conditions for proving additive compositionality, and we develop a novel word and sub-word embedding model that satisfies additive compositionality under those conditions. We then empirically show our model’s improved semantic representation performance on word similarity and noisy sentence similarity.
pdf
bib
abs
Conversation Model Fine-Tuning for Classifying Client Utterances in Counseling Dialogues
Sungjoon Park
|
Donghyun Kim
|
Alice Oh
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
The recent surge of text-based online counseling applications enables us to collect and analyze interactions between counselors and clients. A dataset of those interactions can be used to learn to automatically classify the client utterances into categories that help counselors in diagnosing client status and predicting counseling outcome. With proper anonymization, we collect counselor-client dialogues, define meaningful categories of client utterances with professional counselors, and develop a novel neural network model for classifying the client utterances. The central idea of our model, ConvMFiT, is a pre-trained conversation model which consists of a general language model built from an out-of-domain corpus and two role-specific language models built from unlabeled in-domain dialogues. The classification result shows that ConvMFiT outperforms state-of-the-art comparison models. Further, the attention weights in the learned model confirm that the model finds expected linguistic patterns for each category.
pdf
bib
abs
Surf at MEDIQA 2019: Improving Performance of Natural Language Inference in the Clinical Domain by Adopting Pre-trained Language Model
Jiin Nam
|
Seunghyun Yoon
|
Kyomin Jung
Proceedings of the 18th BioNLP Workshop and Shared Task
While deep learning techniques have shown promising results in many natural language processing (NLP) tasks, it has not been widely applied to the clinical domain. The lack of large datasets and the pervasive use of domain-specific language (i.e. abbreviations and acronyms) in the clinical domain causes slower progress in NLP tasks than that of the general NLP tasks. To fill this gap, we employ word/subword-level based models that adopt large-scale data-driven methods such as pre-trained language models and transfer learning in analyzing text for the clinical domain. Empirical results demonstrate the superiority of the proposed methods by achieving 90.6% accuracy in medical domain natural language inference task. Furthermore, we inspect the independent strengths of the proposed approaches in quantitative and qualitative manners. This analysis will help researchers to select necessary components in building models for the medical domain.
2018
pdf
bib
abs
Conversational Decision-Making Model for Predicting the King’s Decision in the Annals of the Joseon Dynasty
JinYeong Bak
|
Alice Oh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Styles of leaders when they make decisions in groups vary, and the different styles affect the performance of the group. To understand the key words and speakers associated with decisions, we initially formalize the problem as one of predicting leaders’ decisions from discussion with group members. As a dataset, we introduce conversational meeting records from a historical corpus, and develop a hierarchical RNN structure with attention and pre-trained speaker embedding in the form of a, Conversational Decision Making Model (CDMM). The CDMM outperforms other baselines to predict leaders’ final decisions from the data. We explain why CDMM works better than other methods by showing the key words and speakers discovered from the attentions as evidence.
pdf
bib
abs
Hierarchical Dirichlet Gaussian Marked Hawkes Process for Narrative Reconstruction in Continuous Time Domain
Yeon Seonwoo
|
Alice Oh
|
Sungjoon Park
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
In news and discussions, many articles and posts are provided without their related previous articles or posts. Hence, it is difficult to understand the context from which the articles and posts have occurred. In this paper, we propose the Hierarchical Dirichlet Gaussian Marked Hawkes process (HD-GMHP) for reconstructing the narratives and thread structures of news articles and discussion posts. HD-GMHP unifies three modeling strategies in previous research: temporal characteristics, triggering event relations, and meta information of text in news articles and discussion threads. To show the effectiveness of the model, we perform experiments in narrative reconstruction and thread reconstruction with real world datasets: articles from the New York Times and a corpus of Wikipedia conversations. The experimental results show that HD-GMHP outperforms the baselines of LDA, HDP, and HDHP for both tasks.
pdf
bib
abs
Learning to Rank Question-Answer Pairs Using Hierarchical Recurrent Encoder with Latent Topic Clustering
Seunghyun Yoon
|
Joongbo Shin
|
Kyomin Jung
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
In this paper, we propose a novel end-to-end neural architecture for ranking candidate answers, that adapts a hierarchical recurrent neural network and a latent topic clustering module. With our proposed model, a text is encoded to a vector representation from an word-level to a chunk-level to effectively capture the entire meaning. In particular, by adapting the hierarchical structure, our model shows very small performance degradations in longer text comprehension while other state-of-the-art recurrent neural network models suffer from it. Additionally, the latent topic clustering module extracts semantic information from target samples. This clustering module is useful for any text related tasks by allowing each data sample to find its nearest topic cluster, thus helping the neural network model analyze the entire data. We evaluate our models on the Ubuntu Dialogue Corpus and consumer electronic domain question answering dataset, which is related to Samsung products. The proposed model shows state-of-the-art results for ranking question-answer pairs.
pdf
bib
abs
Subword-level Word Vector Representations for Korean
Sungjoon Park
|
Jeongmin Byun
|
Sion Baek
|
Yongseok Cho
|
Alice Oh
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Research on distributed word representations is focused on widely-used languages such as English. Although the same methods can be used for other languages, language-specific knowledge can enhance the accuracy and richness of word vector representations. In this paper, we look at improving distributed word representations for Korean using knowledge about the unique linguistic structure of Korean. Specifically, we decompose Korean words into the jamo-level, beyond the character-level, allowing a systematic use of subword information. To evaluate the vectors, we develop Korean test sets for word similarity and analogy and make them publicly available. The results show that our simple method outperforms word2vec and character-level Skip-Grams on semantic and syntactic similarity and analogy tasks and contributes positively toward downstream NLP tasks such as sentiment analysis.
pdf
bib
abs
Comparative Studies of Detecting Abusive Language on Twitter
Younghun Lee
|
Seunghyun Yoon
|
Kyomin Jung
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
The context-dependent nature of online aggression makes annotating large collections of data extremely difficult. Previously studied datasets in abusive language detection have been insufficient in size to efficiently train deep learning models. Recently, Hate and Abusive Speech on Twitter, a dataset much greater in size and reliability, has been released. However, this dataset has not been comprehensively studied to its potential. In this paper, we conduct the first comparative study of various learning models on Hate and Abusive Speech on Twitter, and discuss the possibility of using additional features and context data for improvements. Experimental results show that bidirectional GRU networks trained on word-level features, with Latent Topic Clustering modules, is the most accurate model scoring 0.805 F1.
2017
pdf
bib
abs
Rotated Word Vector Representations and their Interpretability
Sungjoon Park
|
JinYeong Bak
|
Alice Oh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Vector representation of words improves performance in various NLP tasks, but the high dimensional word vectors are very difficult to interpret. We apply several rotation algorithms to the vector representation of words to improve the interpretability. Unlike previous approaches that induce sparsity, the rotated vectors are interpretable while preserving the expressive performance of the original vectors. Furthermore, any prebuilt word vector representation can be rotated for improved interpretability. We apply rotation to skipgrams and glove and compare the expressive power and interpretability with the original vectors and the sparse overcomplete vectors. The results show that the rotated vectors outperform the original and the sparse overcomplete vectors for interpretability and expressiveness tasks.
pdf
bib
abs
Joint Modeling of Topics, Citations, and Topical Authority in Academic Corpora
Jooyeon Kim
|
Dongwoo Kim
|
Alice Oh
Transactions of the Association for Computational Linguistics, Volume 5
Much of scientific progress stems from previously published findings, but searching through the vast sea of scientific publications is difficult. We often rely on metrics of scholarly authority to find the prominent authors but these authority indices do not differentiate authority based on research topics. We present Latent Topical-Authority Indexing (LTAI) for jointly modeling the topics, citations, and topical authority in a corpus of academic papers. Compared to previous models, LTAI differs in two main aspects. First, it explicitly models the generative process of the citations, rather than treating the citations as given. Second, it models each author’s influence on citations of a paper based on the topics of the cited papers, as well as the citing papers. We fit LTAI into four academic corpora: CORA, Arxiv Physics, PNAS, and Citeseer. We compare the performance of LTAI against various baselines, starting with the latent Dirichlet allocation, to the more advanced models including author-link topic model and dynamic author citation topic model. The results show that LTAI achieves improved accuracy over other similar models when predicting words, citations and authors of publications.
2016
pdf
bib
Proceedings of the First Workshop on NLP and Computational Social Science
David Bamman
|
A. Seza Doğruöz
|
Jacob Eisenstein
|
Dirk Hovy
|
David Jurgens
|
Brendan O’Connor
|
Alice Oh
|
Oren Tsur
|
Svitlana Volkova
Proceedings of the First Workshop on NLP and Computational Social Science
2015
pdf
bib
Five Centuries of Monarchy in Korea: Mining the Text of the Annals of the Joseon Dynasty
JinYeong Bak
|
Alice Oh
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
2014
pdf
bib
Self-disclosure topic model for classifying and analyzing Twitter conversations
JinYeong Bak
|
Chin-Yew Lin
|
Alice Oh
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
pdf
bib
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media
Alice Oh
|
Benjamin Van Durme
|
David Yarowsky
|
Oren Tsur
|
Svitlana Volkova
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media
pdf
bib
Self-disclosure topic model for Twitter conversations
JinYeong Bak
|
Chin-Yew Lin
|
Alice Oh
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media
2012
pdf
bib
Self-Disclosure and Relationship Strength in Twitter Conversations
JinYeong Bak
|
Suin Kim
|
Alice Oh
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
2008
pdf
bib
Generating Baseball Summaries from Multiple Perspectives by Reordering Content
Alice Oh
|
Howard Shrobe
Proceedings of the Fifth International Natural Language Generation Conference
2000
pdf
bib
Stochastic Language Generation for Spoken Dialogue Systems
Alice H. Oh
|
Alexander I. Rudnicky
ANLP-NAACL 2000 Workshop: Conversational Systems