Sergey Kovalchuk
2026
Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log Pš(code ⣠task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges ā particularly larger models ā perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.
2025
Predictive Modeling of Human Developersā Evaluative Judgment of Generated Code as a Decision Process
Sergey Kovalchuk | Yanyu Li | Dmitriy Fedrushkov
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
Sergey Kovalchuk | Yanyu Li | Dmitriy Fedrushkov
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)
The paper presents early results in the development of an approach to predictive modeling of human developer perceiving of code generated in question-answering scenarios with Large Language Model (LLM) applications. The study is focused on building a model for the description and prediction of human implicit behavior during evaluative judgment of generated code through evaluation of its consistency, correctness, and usefulness as subjective perceiving characteristics. We used Markov Decision Process (MDP) as a basic framework to describe the human developer and his/her perceiving. We consider two approaches (regression-based and classification-based) to identify MDP parameters so it can be used as an āartificialā developer for human-centered code evaluation. An experimental evaluation of the proposed approach was performed with survey data previously collected for several code generation LLMs in a question-answering scenario. The results show overall good performance of the proposed model in acceptance rate prediction (accuracy 0.82) and give promising perspectives for further development and application.
2022
Human perceiving behavior modeling in evaluation of code generation models
Sergey Kovalchuk | Vadim Lomshakov | Artem Aliev
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Sergey Kovalchuk | Vadim Lomshakov | Artem Aliev
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Within this study, we evaluated a series of code generation models based on CodeGen and GPTNeo to compare the metric-based performance and human evaluation. For a deeper analysis of human perceiving within the evaluation procedure weāve implemented a 5-level Likert scale assessment of the model output using a perceiving model based on the Theory of Planned Behavior (TPB). Through such analysis, we showed an extension of model assessment as well as a deeper understanding of the quality and applicability of generated code for practical question answering. The approach was evaluated with several model settings in order to assess diversity in quality and style of answer. With the TPB-based model, we showed a different level of perceiving the model result, namely personal understanding, agreement level, and readiness to use the particular code. With such analysis, we investigate a series of issues in code generation as natural language generation (NLG) problems observed in a practical context of programming question-answering with code.