MARD: Module-Aware Reasoning Distillation for Language Models with Adaptive Supervision

Wenqi Yang, Jianjun Li, Zhibo Zhang, Mingqian Ding, Yushen Fang


Abstract
Multi-step reasoning remains challenging for language models with limited capacity. While recent reasoning distillation approaches transfer chain-of-thought supervision from large teacher models, they typically apply uniform supervision across all Transformer components, overlooking the fact that different modules contribute unequally to reasoning. We propose Module-Aware Reasoning Distillation, a parameter-efficient framework that explicitly targets key Transformer components for effective reasoning transfer. Through systematic analysis, we identify the feed-forward network projections and the output projection of self-attention as primary bottlenecks for reasoning. Based on these findings, we introduce lightweight adapter modules at these components while freezing the backbone parameters, enabling focused and efficient distillation. Our approach adopts an offline distillation setting, where a strong teacher model provides reasoning trajectories in advance, and incorporates an adaptive supervision strategy that adjusts the strength of reasoning-related losses according to problem difficulty. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements over strong baselines, and ablation studies confirm the importance of both module-aware placement and adaptive supervision.
Anthology ID:
2026.acl-long.1749
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
37680–37695
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1749/
DOI:
Bibkey:
Cite (ACL):
Wenqi Yang, Jianjun Li, Zhibo Zhang, Mingqian Ding, and Yushen Fang. 2026. MARD: Module-Aware Reasoning Distillation for Language Models with Adaptive Supervision. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 37680–37695, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MARD: Module-Aware Reasoning Distillation for Language Models with Adaptive Supervision (Yang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1749.pdf
Checklist:
 2026.acl-long.1749.checklist.pdf