tldr: Finetuning many agents end-to-end offers a workaround to continual learning since different agents can specialize without catastrophic forgetting. Yet doing so is hard due to credit assignment and sample efficiency. We found that using AI feedback as per-action process rewards holds promise for addressing these challenges and unlocks a new axis for scaling post-training.
Paper: https://arxiv.org/abs/2601.23228
Code: https://github.com/ltjed/multiagent-coaching
Blog: https://ltjed.github.io/MAPPA/
---
What we did
We have an off-the-shelf LLM act as a coach that watches multi-agent systems during training. It scores each action as it happens—process supervision rather than just outcome rewards at the end. The coach sees what each agent produced plus tool outputs (stdout, stderr, errors).
The parts I find interesting
Credit assignment just... works? When agent... (read more)