Heondeuk Lee

2026

E-star 12B: Reliable Rubric-Following and Domain-Adaptive SLM Evaluator for Korean Industrial Settings
Yonghoon Kwon | Heondeuk Lee | Barom Kang
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)

Automatic evaluation in industrial settings requires models to interpret and apply natural language rubrics reliably under language and domain shift. This challenge is compounded when reference answers are unavailable and proprietary models cannot be deployed due to data-governance constraints. We present E-Star-12B, a 12B-parameter evaluator for Korean industrial environments that jointly addresses rubric following and domain adaptation. Our approach combines a structured evaluation format—feedback, highlight, and decision—with a 6K high-confidence training set via multi-stage consensus-based filtering. We introduce two benchmarks: Ko Feedback Bench for rubric-following evaluation under Korean language transfer, and RAG Quality Bench for domain-specific evaluation in financial and legal settings. E-Star-12B achieves the strongest rubric alignment among small language models on Ko Feedback Bench, improving Pearson correlation by +0.173 over its base model. On RAG Quality Bench, the domain-adapted variant approaches frontier-model performance with more stable adaptation than general instruct models. Strong rubric-following capability serves as a reliable scaffold for subsequent domain adaptation.

2025

pdf bib abs

CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks
Sunguk Choi | Yonghoon Kwon | Heondeuk Lee
Findings of the Association for Computational Linguistics: EMNLP 2025

Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) — a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well — structured explanations. Despite its simplicity, our synthetic method with general-purpose LLMs yields a high-quality training quality. CAC-CoT achieves ≈ 85% on GSM8K and ≈ 40% on GPQA (System-2) while also achieving ≈ 85% on S1-Bench (System-1), surpassing the baseline by over 20%. Its reasoning traces average ≈ 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.

Co-authors

Venues

Fix author