Yulong He
2026
Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log Pš(code ⣠task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges ā particularly larger models ā perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.