Hyo Jin Do


2025

pdf bib
Multi-Level Explanations for Generative Language Models
Lucas Monteiro Paes | Dennis Wei | Hyo Jin Do | Hendrik Strobelt | Ronny Luss | Amit Dhurandhar | Manish Nagireddy | Karthikeyan Natesan Ramamurthy | Prasanna Sattigeri | Werner Geyer | Soumya Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.

pdf bib
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
Martín Santillán Cooper | Zahra Ashktorab | Hyo Jin Do | Erik Miehling | Werner Geyer | Jasmina Gajcin | Elizabeth M. Daly | Qian Pan | Michael Desmond
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a synthetic data generation tool integrated into EvalAssist. EvalAssist is a web-based application designed to assist human-centered evaluation of language model outputs by allowing users to refine LLM-as-a-Judge evaluation criteria. The synthetic data generation tool in EvalAssist is tailored for evaluation contexts and informed by findings from user studies with AI practitioners. Participants identified key pain points in current workflows including circularity risks (where models are judged by criteria derived by themselves), compounded bias (amplification of biases across multiple stages of a pipeline), and poor support for edge cases, and expressed a strong preference for real-world grounding and fine-grained control. In response, our tool supports flexible prompting, RAG-based grounding, persona diversity, and iterative generation workflows. We also incorporate features for quality assurance and edge case discovery.

2015

pdf bib
Korean Twitter Emotion Classification Using Automatically Built Emotion Lexicons and Fine-Grained Features
Hyo Jin Do | Ho-Jin Choi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters