year: 2022/05
paper: https://arxiv.org/abs/2205.14953
website: https://www.reddit.com/r/MachineLearning/comments/v2af3k/r_multiagent_reinforcement_learning_can_now_be/ | https://sites.google.com/view/multi-agent-transformer
code: https://github.com/PKU-MARL/Multi-Agent-Transformer
connections: MARL, transformer
Preliminaries
Multi-Agent Advantage Function
The advantage function measures how much better (or worse) the joint value will be when a specific subset of agents take actions , after another subset of agents have already committed to actions :
- The value when both groups of agents ( and ) take their respective actions
- The baseline value considering only the actions of agentsWhen (no prior agents have committed actions), this reduces to comparing the value of actions against the baseline value of the whole team (expected return under the current policy).
Note: The advantage function doesn’t model some agents being “inactive” - rather, it measures the relative value of specific action choices by a subset of agents compared to letting all agents act according to their default policy behavior, the baseline Value
Example
Consider a team of 3 agents where:
Agent 1 commits to action first () and we want to evaluate the advantage of agents 2 and 3’s joint action ()
The advantage function would measure how much better/worse agents 2 and 3’s chosen actions are compared to the expected value after only agent 1’s action.
Link to originalMulti-Agent Advantage Decomposition Theorem
Let be a permutation of agents. For any joint observation and joint action :
The total advantage of a joint action can be decomposed into a sum of individual advantages, where each agent’s advantage is conditioned on the actions chosen by previous agents in the permutation.
This enables sequential decision making.
Agent goes first, picks an action with aiming for positive advantage
Agent , knowing , chooses for positive
Agent , knowing , …Instead of searching the entire joint action space , we can search each agent’s action space separately .
→ Each agent can make decisions based information about what others have done, incrementally improving actions.
→ The computational savings get more dramatic as you add more agents
Link to originalMAPPO objective
The objective of MAPPO is equivalent to that of PPO, but equipping each agent with one shared set of parameters, i.e. . It uses the agent’s combined trajectories for the shared policy’s update.
So the objective for optimizing the policy parameters at iteration is:The constraint on the joint policy space imposed by shared parameters can lead to an exponentially-worse sub-optimal outcome (details; page 4).
Link to originalHAPPO vs MAPPO
MAPPO is a straightforward application of PPO to multi-agent settings, often forcing all agents to share one set of policy parameters. This can be suboptimal when agents need different behaviors.
HAPPO relaxes this constraint by:
- Sequentially updating each agent in a random permutation , where agent updates using the newly updated policies of agents .
- Leveraging the Multi-Agent Advantage Decomposition Theorem, which yields a valid trust‐region update for each agent’s local advantage . Specifically, HAPPO applies the standard PPO (clipped) objective to each agent while treating previously updated agents’ policies as “fixed.” Because the decomposition theorem ensures the local advantage aligns with the global return, any PPO‐style step that respects the trust region (i.e., doesn’t deviate too far from the old policy) guarantees a monotonic improvement for the entire multi-agent policy.
- Maintaining distinct policy parameters for truly heterogeneous capabilities.
This ensures a monotonic improvement guarantee for the joint return – something MAPPO’s parameter-sharing approach can’t guarantee. As a result, HAPPO typically performs better than MAPPO in scenarios where agents must learn different skillsets.
However, one drawback of HAPPO is that agent’s policies has to follow the sequential update scheme (in the permutation), thus it cannot be run in parallel.
Implementation
Algorithm
Steps:
- Generate encoded joint agents observations:
- Generate state-value estimates from observations:
- Autoregressively decode agent actions, while cross-attending the encoded observations:
Slap PPO on top… that’s it!