Energy Matching based Preference Learning for Diffusion Language Models

Shiv Shankar


Abstract
Policy-gradient reinforcement learning (RL) is widely used to improve language model reasoning, but existing methods are not compatible with diffusion language models. The primary reason for this is the difficulty of likelihood estimation with such models. We propose EMBR, a scalable off-policy framework that reformulates KL-regularized RL as an energy-based distribution matching problem. By aligning policy updates with reward signals through energy matching,EMBR avoids the overhead of on-policy learning and the variance of importance weighting. We further derive a principled upper bound for the energy matching objective which can be used to fine-tune dLLMs. Experiments on multiple benchmarks in both online and offline setting show that EMBR matches or surpasses the performance of diffu-GRPO and related baselines in the online case, and of DPO in the offline case. Our approach provides a practical alternative for post-training of diffusion LMs.
Anthology ID:
2026.eacl-srw.57
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
776–786
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.57/
DOI:
Bibkey:
Cite (ACL):
Shiv Shankar. 2026. Energy Matching based Preference Learning for Diffusion Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 776–786, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Energy Matching based Preference Learning for Diffusion Language Models (Shankar, EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.57.pdf