Chanjin Zheng

2026

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
Wei Xia | Jin Wu | Haoran Shi | Xiangyu Wang | Chanjin Zheng
Findings of the Association for Computational Linguistics: ACL 2026

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners’ proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgments and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

pdf bib abs

KidsArtBench: Multi-Dimensional Children’s Art Evaluation with Attribute-Aware MLLMs
Mingrui Ye | Chanjin Zheng | Zengyi Yu | Chenyu Xiang | Zhixue Zhao | Zheng Yuan | Helen Yannakoudakis
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal Large Language Models (MLLMs) show progress across many visual–language tasks; however, their capacity to evaluate artistic expression remains limited: aesthetic concepts are inherently abstract and open-ended, and multimodal artwork annotations are scarce. We introduce KidsArtBench, a new benchmark of over 1k children’s artworks (ages 5-15) annotated by 12 expert educators across 9 rubric-aligned dimensions, together with expert comments for feedback. Unlike prior aesthetic datasets that provide single scalar scores on adult imagery, KidsArtBench targets children’s artwork and pairs multi-dimensional annotations with comment supervision to enable both ordinal assessment and formative feedback. Building on this resource, we propose an attribute-specific multi-LoRA approach – where each attribute corresponds to a distinct evaluation dimension (e.g., Realism, Imagination) in the scoring rubric – with Regression-Aware Fine-Tuning (RAFT) to align predictions with ordinal scales. On Qwen2.5-VL-7B, our method increases correlation from 0.468 to 0.653, with the largest gains on perceptual dimensions and narrowed gaps on higher-order attributes. Our results show that educator-aligned supervision and attribute-aware training yield pedagogically meaningful evaluations and establish a rigorous testbed for sustained progress in educational AI. We release data and code with ethics documentation.

2025

pdf bib abs

Decoding LLM Personality Measurement: Forced-Choice vs. Likert
Xiaoyu Li | Haoran Shi | Zengyi Yu | Yukun Tu | Chanjin Zheng
Findings of the Association for Computational Linguistics: ACL 2025

Recent research has focused on investigating the psychological characteristics of Large Language Models (LLMs), emphasizing the importance of comprehending their behavioral traits. Likert scale personality questionnaires have become the primary tool for assessing these characteristics in LLMs. However, such scales can be skewed by factors such as social desirability, distorting the assessment of true personality traits. To address this issue, we firstly incorporate the forced-choice test, a method known for reducing response bias in human personality assessments, into the evaluation of LLM. Specifically, we evaluated six LLMs: Llama-3.1-8B, GLM-4-9B, GPT-3.5-turbo, GPT-4o, Claude-3.5-sonnet, and Deepseek-V3. We compared the Likert scale and forced-choice test results for LLMs’ Big Five personality scores, as well as their reliability. In addition, we looked at how temperature parameter and language affected LLM personality scores. The results show that the forced-choice test better captures differences between LLMs across various personality dimensions and is less influenced by temperature parameters. Furthermore, we found both broad trends and specific variations in personality scores across models and languages.

Co-authors

Jin Wu 1

Wei Xia 1

Chenyu Xiang 1

Helen Yannakoudakis 1

Mingrui Ye 1

Zheng Yuan 1

Zhixue Zhao 1

Venues

Findings2
EACL1

Fix author