Dmitriy Fedrushkov
2026
Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log Pš(code ⣠task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges ā particularly larger models ā perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.
2025
Predictive Modeling of Human Developersā Evaluative Judgment of Generated Code as a Decision Process
Sergey Kovalchuk | Yanyu Li | Dmitriy Fedrushkov
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
Sergey Kovalchuk | Yanyu Li | Dmitriy Fedrushkov
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
The paper presents early results in the development of an approach to predictive modeling of human developer perceiving of code generated in question-answering scenarios with Large Language Model (LLM) applications. The study is focused on building a model for the description and prediction of human implicit behavior during evaluative judgment of generated code through evaluation of its consistency, correctness, and usefulness as subjective perceiving characteristics. We used Markov Decision Process (MDP) as a basic framework to describe the human developer and his/her perceiving. We consider two approaches (regression-based and classification-based) to identify MDP parameters so it can be used as an āartificialā developer for human-centered code evaluation. An experimental evaluation of the proposed approach was performed with survey data previously collected for several code generation LLMs in a question-answering scenario. The results show overall good performance of the proposed model in acceptance rate prediction (accuracy 0.82) and give promising perspectives for further development and application.