The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

year: 2021
paper: https://arxiv.org/pdf/2103.01955
website: https://sites.google.com/view/mappo
code: https://github.com/marlbenchmark/on-policy
connections: PPO, MARL

MAPPO objective

The objective of MAPPO is equivalent to that of PPO, but equipping each agent with one shared set of parameters, i.e. $θ^{i} = θ^{j}, \forall i, j \in N = {1, \dots n}$ . It uses the agent’s combined trajectories for the shared policy’s update.
So the objective for optimizing the policy parameters $θ_{k + 1}$ at iteration $k + 1$ is:
$i = 1 \sum n E_{o \sim π_{θ_{k}}, a \sim π_{θ_{k}}} [min (\frac{π _{θ} ( a ^{i} ∣ o )}{π _{θ_{k}} ( a ^{i} ∣ o )} A_{π_{θ_{k}}} (o, a), clip (\frac{π _{θ} ( a ^{i} ∣ o )}{π _{θ_{k}} ( a ^{i} ∣ o )}, 1 \pm ϵ) A_{π_{θ_{k}}} (o, a))]$
The constraint on the joint policy space imposed by shared parameters can lead to an exponentially-worse sub-optimal outcome (details; page 4).

Max Wolf's Second Brain

Explorer

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Graph View

Backlinks