Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

year: 2021/9
paper: https://arxiv.org/abs/2109.11251
website:
code:
connections: PPO, TRPO, MAPPO, MARL

HAPPO vs MAPPO

MAPPO is a straightforward application of PPO to multi-agent settings, often forcing all agents to share one set of policy parameters. This can be suboptimal when agents need different behaviors.
HAPPO relaxes this constraint by:

Sequentially updating each agent in a random permutation $i_{1}, \dots, i_{n}$ , where agent $i_{m}$ updates using the newly updated policies of agents $i_{1 : m - 1}$ .

Leveraging the Multi-Agent Advantage Decomposition Theorem, which yields a valid trust‐region update for each agent’s local advantage $A_{π}^{i_{m}} (o, a^{i_{1 : m - 1}}, a^{i_{m}})$ . Specifically, HAPPO applies the standard PPO (clipped) objective to each agent while treating previously updated agents’ policies as “fixed.” Because the decomposition theorem ensures the local advantage aligns with the global return, any PPO‐style step that respects the trust region (i.e., doesn’t deviate too far from the old policy) guarantees a monotonic improvement for the entire multi-agent policy.

Maintaining distinct policy parameters for truly heterogeneous capabilities.

This ensures a monotonic improvement guarantee for the joint return – something MAPPO’s parameter-sharing approach can’t guarantee. As a result, HAPPO typically performs better than MAPPO in scenarios where agents must learn different skillsets.
However, one drawback of HAPPO is that agent’s policies has to follow the sequential update scheme (in the permutation), thus it cannot be run in parallel.

Max Wolf's Second Brain

Explorer

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Graph View

Backlinks