RL is useful if you want to adjust your world model while you are operating in the real world. The way to adjust your world model for the situation at hand is to explore parts of the space where the world model is inaccurate → curiosity / play.

center

The picture shows the value propagation (“backup”) of a single update step with these three methods:

MC methods require complete episodes to learn, using actual returns to update value estimates. They’re unbiased (as long as there is exploration) but have high variance and require complete episodes (and thus more samples to achieve an accurate estimate, so bad for expensive environments), though truncated returns can also work.

TD bootstraps from existing value estimates after single steps, combining actual rewards with predicted future values. This introduces bias but reduces variance and enables online learning.

DP methods use the full model of the environment to consider all possible next states simultaneously, computing expected values directly. This is most sample-efficient (you don’t need to sample in order to approximate values – at the cost of exhaustive search) but requires a complete model of the environment, and are impractical for large state spaces, even with a perfect model (curse of dimensionality).

MC (left) vs TD

Types of policies

Deterministic policy:

Stochastic policy:

Two major types are categorical policies for discrete action spaces and diagonal gaussian policies for continuous action spaces.

Categorical Policy

A categorical policy is just a classifier over discrete actions.
So like usual you have feature extraction → logitssoftmaxsample an action.
Get the log-likelihood for the action at state by indexing the log of the output vector: .

Diagonal Gaussian Policy

Diagonal gaussian policies map from observations to mean actions .
There’s two common ways to represent represent the covariance matrix:

A parameter vector of log standard deviations , which is not a function of state.

Network layers mapping from states to log standard deviations , may share params with .

We can then just sample from the distribution to get an action.

Log standard deviations are used so we don’t have to constrain the ANN output to be nonnegative, and can simply exponentiate the log outputs to obtain , without loosing anything.

Link to original

While IRL, organisms receive intrinsic rewards, in contemporary RL this is turned on its head: The agent receives extrinsic rewards from the environment, which tells him what’s good and what’s bad. If we want to create agentic, open-ended AI / ALIFE, we should think about how to make them intrinsically motivated.

Best paper I’ve read all year (24) on this: Embracing curiosity eliminates the exploration-exploitation dilemma.

Messy stuff below.


Difficulties in RL with data: 1

  • non-stationary (changing distribution)
  • depends on your model
  • credit assignment (unclear which actions was good)
    Stuff that makes or breaks trainings in practice:
  • keeping state/state-action values in a reasonable range (equation 3.10 in Sutton & Barto can help you upper bound the magnitudes, an upper bound on the order of 1e1 usually works for me)
  • balancing exploration and exploitation right
  • effective learning rates - the value of n in n-step returns (5 to 20 is usually good)

Danie Hafner (YT):
Deep Hierarchical Planning from Pixels
Dream to Control: Learning Behaviors by Latent Imagination
v2 (Talks abt code and more shit aswell!!)


Introduction to RL

Deep Reinforcement Learning: Pong from Pixels
Lessons Learned Reproducing a Deep Reinforcement Learning Paper
https://lilianweng.github.io/posts/2018-02-19-rl-overview/

Types of RL

https://www.reddit.com/r/reinforcementlearning/comments/utnhia/what_is_offline_reinforcement_learning/?rdt=45931
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

online learning
off-policy
on-policy

Resources

Intro

Code

stable-baselines blog is good (contains only policy gradients) but papers are the best resource.
policy gradients : https://stable-baselines3.readthedocs.io/en/master/guide/algos.html, using ssl and data augmentation along with policy gradients, Dreamer and other world models, Decision Transformer, sim2real and many more.
The book got boring after some chapters so i just implemented papers from this blog and it’s references https://lilianweng.github.io/posts/2018-02-19-rl-overview/

Environments

CRAFTAX

Evals

bsuite: Google - gym?

Footnotes

  1. George Hotz | Researching | RL is dumb and doesn’t work (theory) | Reinforcement Learning | Part 3)