Gary Ushaw


2026

Credit assignment is a fundamental challenge in cooperative multi-agent reinforcement learning, particularly in embodied AI settings characterized by limited and delayed feedback as well as dynamically changing numbers of active agents. We propose MARS-RA, a framework that reformulates credit assignment as a rank aggregation problem using contribution-based pairwise comparisons among agents generated by large multimodal models. This shift from absolute to relative estimation ensures robustness against noise and dynamic agent participation, converting comparison results into contribution scores for potential-based reward shaping. We provide theoretical justification for the convergence and robustness of the proposed framework, and show that Shapley values can be used as an interpretive reference. Experimental results on challenging tasks of different types indicate that MARS-RA can guide agents toward effective cooperation.
Reasoning capability is fundamental in enabling Large Language Models to perform complex multi-step inference. By sampling multiple reasoning paths and selecting the most frequent answer, Self Consistency (SC) remains highly effective but fails on challenging tasks where incorrect answers dominate the majority. Inspired by Metropolis Light Transport in physically-based rendering, where discovered high-contribution light paths guide subsequent sampling toward illumination sources, we propose Metropolis Self Consistency and its multi-LLM extension, Metropolis Cross Consistency, a probabilistic self- and cross-consistency verification framework for mathematical reasoning. Our approach employs an accept-reject mechanism to encourage high-quality reasoning paths, concentrating sampling in regions more likely to yield correct answers. Experiments on 9 LLMs across 4 challenging mathematical benchmarks demonstrate consistent improvements over SC. Even when combining models of vastly different capabilities, MCC maintains performance virtually matching the most capable model while significantly reducing computational cost compared to SC with the strongest model alone. While our implementation is training-free, adds minimal token overhead beyond SC, and requires no external reward model, our approach provides a flexible paradigm that can accommodate any scalar reward representing path correctness.