Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts

Rabin Tiwari

Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts

Abstract

Post-training methods such as supervised fine-tuning and preference optimization are widely used to align large language models, yet how their benefits distributeacross architectural components and transfer across tasks and prompts remains unclear. In this work, we analyze component-level transfer in aMixture-of-Experts language model by selectively replacing routers, attention modules, and expert networks between two post-trained Mixture of Experts models trained with different post-training recipes and dataset mixtures. Starting from a SFT+DPO checkpoint, we systematically replace its components (routers, attention, experts) with those from a Tulu3 checkpoint and evaluate the impact of each replacement and their combinations on mathematical and scientific reasoningand a general-purpose classification task under zero-shot, few-shot and Chain of Thought prompting. We find strong component-specific specialization: expert networksaccount for most gains on mathematical and scientific reasoning, while attention mechanisms consistently outweigh expert transfer on general tasksand router transfer alone provides minimal benefit or harms performance. Prompting strategy further modulates these effects, with expert transfer degrading zero-shot scienceperformance but improving few-shot reasoning. Strategically combining components from different model versions can in some cases match or exceed the performance of the best available model, motivating principled techniques for composing post-trained models into task- and prompt-specific systems without additional training.

Anthology ID:: 2026.gem-main.7
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 77–83
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.7/
DOI:
Bibkey:
Cite (ACL):: Rabin Tiwari. 2026. Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 77–83, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts (Tiwari, GEM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.7.pdf

PDF Cite Search Fix data