Dream To Control - Learning Behaviours By Latent Imagination

year: 2020
paper: https://arxiv.org/pdf/1912.01603.pdf
website: https://danijar.com/project/dreamer/
code: https://github.com/danijar/dreamer
connections: world model, RL, danijar hafner

YK YT

574|center

TLDR:

Learning to plan from an encoded latent space (previous models already do that, as in (a))
This model predicts further actions without seeing the observations (b).
In contrast to MuZero, dreamer does this without MCTS by learning a one-shot policy that goes from observation to action, as in (c) (inference)
In contrast to Recurrent World Models Facilitate Policy Evolution, we don’t learn from a random policy, but does the following in a loop:
- (a) learning the encoding function by sampling from the environment using the policy (initially random)
- (b) learning a good policy for the environment (in latent space)
- going back to (a) with the new policy

725|center

In the dynamics learning phase (a), $(action, observation, reward)$ are given and we want to learn a representation and a transition (to the next state) and predict the next reward.
Encoder (blue) can be a CNN and the hidden state (green) can be a LSTM for example. Use any method for the representation learning, example given in the paper/image: encoder+decoder like in a VAE, but as opposed to VAE, we are interested in the Encoder and not the Decoder, of course.

In the behaviour learning phase (b), we want to learn the action and the value parameters. We are imagining future hidden states and rewards while using the policy to choose actions. All like regular RL, but the backprop now only has to go through the small lstm network (and not through the encoder at all? or just once?).

Here, $ϕ \leftarrow ϕ + α \nabla_{ϕ} \sum_{r = t}^{t + H} V_{λ} (s_{t})$ , to learn the policy net, we maximize the value target (max total future reward).
Here, $ψ \leftarrow ψ + α \nabla_{ψ} \sum_{r = t}^{t + H} \frac{1}{2} ∥ v_{ψ} (s_{t}) - V_{λ} (s_{t}) ∥^{2}$ , to learn the value net, we minimize the difference between the predicted value and the (ominous) value target.

value target: $V_{λ}$ (workhorse of this paper, according to YK)

many envs might have more than a couple of dozens or hundreds of steps (what an lstm could reasonably backprop through)
the value target is the main component for extending this time-range

V_{R} (s_{τ}) V_{N}^{k} (s_{τ}) V_{λ} (s_{τ}) := E_{q_{0}, q_{s}} (n = τ \sum τ + H r_{n}), := E_{q_{0}, q_{s}} (n = τ \sum h - 1 γ^{n - τ} r_{n} + γ^{h - τ} ψ (s_{h})) with h = min (τ + k, t + H), := (1 - λ) n = 1 \sum H - 1 λ^{n - 1} V_{N}^{n} (s_{τ}) + λ^{H - 1} V_{N}^{H} (s_{τ}),

Finish / Update Note

Explanation left as exercise to the reader.
Seems simmilar to bootstrapping. Too tired, heading out now. TODO
Apparently:

This is called TD-lambda which essentially is the mixture of 1-step lookahead, 2-step lookahead, … H-step lookahead where H is the horizon, the same exact method is described in Sutton&Barto (eq 7.6)

world model

Max Wolf's Second Brain

Explorer

Dream To Control - Learning Behaviours By Latent Imagination

Graph View

Backlinks

Max Wolf's Second Brain

Explorer

Dream To Control - Learning Behaviours By Latent Imagination

Related

Graph View

Backlinks