Zhi Li

Other people with similar names: Zhi Li

Unverified author pages with similar names: Zhi Li


2026

Reinforcement learning with verifiable rewards (RLVR) is a standard post-training paradigm for large language models (LLMs), typically relying on group-wise reward and advantage normalization for stability. In set-valued multi-answer tasks, where multiple outputs may be simultaneously correct, this normalization can over-amplify a small number of early high-reward samples, suppressing learning signals from other valid generations and leading to overly concentrated updates. We propose Entropy-Aware Reshaping of Reinforcement Signals (EARS), a framework that reshapes how learning signals are normalized and aggregated. EARS uses token-level predictive entropy as an uncertainty cue to compute entropy-weighted reward statistics for advantage normalization, encouraging broader exploration and more balanced learning-signal allocation early in training. An adaptive decay schedule then anneals uncertainty-aware reweighting back to standard group normalization to ensure stable convergence. EARS further incorporates a correctness-gated multi-head process reward that provides auxiliary supervision on reasoning traces while remaining aligned with verifiable correctness. Experiments on MCTACO and MMLU-Multi using Qwen2.5-7B and Llama-3.1-8B-Instruct demonstrate consistent improvements in exact-set accuracy, training stability, and cross-dataset transfer performance on set-valued multi-answer reasoning.

2025

In recent years, large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks. Nevertheless, their capacity to process and reflect core human experiences remains underexplored. Current benchmarks for LLM evaluation typically focus on a single aspect of linguistic understanding, thus failing to capture the full breadth of its abstract reasoning about the world. To address this gap, we propose a multidimensional paradigm to investigate the capacity of LLMs to perceive the world through temporal, spatial, sentimental, and causal aspects. We conduct extensive experiments by partitioning datasets according to different distributions and employing various prompting strategies. Our findings reveal significant differences and shortcomings in how LLMs handle temporal granularity, multi-hop spatial reasoning, subtle sentiments, and implicit causal relationships. While sophisticated prompting approaches can mitigate some of these limitations, substantial challenges persist in effectively capturing human abstract perception, highlighting the discrepancy between model reasoning and human behavior. We aspire that this work, which assesses LLMs from multiple perspectives of human understanding of the world, will guide more instructive research on the LLMs’ perception or cognition.