year: 2022/05
paper: https://arxiv.org/abs/2205.14953
website: https://www.reddit.com/r/MachineLearning/comments/v2af3k/r_multiagent_reinforcement_learning_can_now_be/ | https://sites.google.com/view/multi-agent-transformer
code: https://github.com/PKU-MARL/Multi-Agent-Transformer
connections: MARL, transformer


Preliminaries

Multi-Agent Advantage Function

The advantage function measures how much better (or worse) the joint value will be when a specific subset of agents take actions , after another subset of agents have already committed to actions :

- The value when both groups of agents ( and ) take their respective actions
- The baseline value considering only the actions of agents

When (no prior agents have committed actions), this reduces to comparing the value of actions against the baseline value of the whole team (expected return under the current policy).

Note: The advantage function doesn’t model some agents being “inactive” - rather, it measures the relative value of specific action choices by a subset of agents compared to letting all agents act according to their default policy behavior, the baseline Value

Example

Consider a team of 3 agents where:
Agent 1 commits to action first () and we want to evaluate the advantage of agents 2 and 3’s joint action ()
The advantage function would measure how much better/worse agents 2 and 3’s chosen actions are compared to the expected value after only agent 1’s action.

Multi-Agent Advantage Decomposition Theorem

Let be a permutation of agents. For any joint observation and joint action :

The total advantage of a joint action can be decomposed into a sum of individual advantages, where each agent’s advantage is conditioned on the actions chosen by previous agents in the permutation.

Instead of searching the entire joint action space , we can search each agent’s action space separately .
→ Each agent can make decisions based information about what others have done, incrementally improving actions.
→ The computational savings get more dramatic as you add more agents

Link to original

MAPPO objective

The objective of MAPPO is equivalent to that of PPO, but equipping each agent with one shared set of parameters, i.e. . It uses the agent’s combined trajectories for the shared policy’s update.
So the objective for optimizing the policy parameters at iteration is:

The constraint on the joint policy space imposed by shared parameters can lead to an exponentially-worse sub-optimal outcome (details; page 4).

Link to original

HAPPO vs MAPPO

MAPPO is a straightforward application of PPO to multi-agent settings, often forcing all agents to share one set of policy parameters. This can be suboptimal when agents need different behaviors.
HAPPO relaxes this constraint by:

  1. Sequentially updating each agent in a random permutation , where agent updates using the newly updated policies of agents .
  2. Leveraging the Multi-Agent Advantage Decomposition Theorem, which yields a valid trust‐region update for each agent’s local advantage . Specifically, HAPPO applies the standard PPO (clipped) objective to each agent while treating previously updated agents’ policies as “fixed.” Because the decomposition theorem ensures the local advantage aligns with the global return, any PPO‐style step that respects the trust region (i.e., doesn’t deviate too far from the old policy) guarantees a monotonic improvement for the entire multi-agent policy.
  3. Maintaining distinct policy parameters for truly heterogeneous capabilities.

This ensures a monotonic improvement guarantee for the joint return – something MAPPO’s parameter-sharing approach can’t guarantee. As a result, HAPPO typically performs better than MAPPO in scenarios where agents must learn different skillsets.
However, one drawback of HAPPO is that agent’s policies has to follow the sequential update scheme (in the permutation), thus it cannot be run in parallel.

Link to original

center

Implementation

Algorithm

Steps:

  1. Generate encoded joint agents observations:
  2. Generate state-value estimates from observations:
  3. Autoregressively decode agent actions, while cross-attending the encoded observations:

Slap PPO on top… that’s it!

center