Hangliang Ren


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
LSRL: Process-Supervised GRPO on Latent Recurrent States Improves Mathematical Reasoning
Hangliang Ren
Findings of the Association for Computational Linguistics: EMNLP 2025

Latent-recurrent language models solve tasks by iteratively refining hidden states rather than emitting chain-of-thought tokens, yet the opacity of those hidden trajectories hinders credit assignment and limits mathematical reasoning accuracy. We propose Latent-State Supervised Reinforcement Learning (LSRL), a process-supervised variant of Guided Reward Policy Optimization (GRPO) that delivers dense rewards at every latent step. We decode each recurrent depth of a 3.5-billion-parameter Huginn model and score the partial solutions with a GPT-4.1-nano grader aligned to final-answer correctness. Using LoRA adapters, we update the policy on a single NVIDIA L40S GPU with only 500 GSM-8K training problems. Relative to the depth-8 supervised Huginn baseline, LSRL improves absolute accuracy by +4.27 points on GSM-8K and +2.06 points on MathQA. These results demonstrate that rewarding latent steps provides an efficient route to stronger mathematical reasoning in latent-recurrent language models.
Search
Co-authors
    Venues
    Fix data