jurgen-sucess-story-algorithm.mp4 via X
Banger sides: https://people.idsia.ch/~juergen/ssa/sld001.htm


Motivation

  • RL with a single lifelong trial
  • Most RL trains with the assumption of repeatable trials, reset every time
  • Humans are not (hard) reset, can discover patterns across trials
  • General RL must assume a single life-long trial, which may or may not be divisible into sub-trials
  • Must explicitly take into account that policy changes in early stages of life may affect properties of later policy changes and later trials near the end of the life (recursively: … → meta meta learning → meta learning → learning)
  • Policy should be able to in principle excecute any (meta-)learning algorithm
  • To do that, you have to include actions that can be triggered by the policy itself, which change the policy (explicitly).
  • You need a framework that tracks which self-modifications were useful for obtaining more reward per time, so you can recursively establish a chain of self-modifications, where each outperforms all previous ones.
  • Modifications include the modification strategy itself.
  • Beneficial mutations are statistically more likely to stick around, bad ones get reverted, even if it takes a long time to discover past mistakes. Greedy “reward acceleration”.
  • Credit assignment is much coarser than in the traditional RL sense: It never asks which modification was the good one, only “was everything that happened since checkpoint net posiive?”
  • No structural assumptions needed (states, transitions, rewards, markov property/stationarity).

Combine SSA with population-based methods?

It’s similar to evolutionary search, but more general/principled, in that it is self-referential — it can modify any part of itself, including the modification strategy.
It’s greedy local search with (a conservative) undo – so it’s succeptible to the same issues … that population-based/quality diversity algorithms address. Could combine? SSA-style lifetime learning… population-based updates of the — yeah nvm. this is the DGM! See below.

But first a small detour:

… how did Schmidhuber develop the idea?

  1. SSA (1994-97): Greedy self-improvement with undo. Works but limited by the instruction set and the conservatism of “revert if not better.”
  2. Richer instruction sets: don’t change SSA itself, just give it more powerful primitives to work with. If your primitive learning algorithms include actual learning algorithms (not just simple parameter tweaks), SSA can compose and select among them. The “Shifting Inductive Bias” paper (1997) combines SSA with adaptive levin search to get systematic exploration instead of random mutations.
  3. OOPS (2004, Optimal Ordered Problem Solver): takes the “search more systematically” idea further. Uses optimal universal search to find provably fastest programs for solving sequences of tasks, reusing solutions from earlier tasks to speed up later ones. Still empirical evaluation, but with optimality guarantees on the search process itself.
  4. Gödel Machine (2003-2009): the logical endpoint. Replaces empirical “did reward improve?” with formal proof search. The machine only self-modifies when it can prove the modification will improve its expected future performance. Provably optimal in the strongest possible sense: no other self-improving algorithm can do better, given the same axiom system. In practice, finding such proofs is astronomically hard, so it’s more of a theoretical ceiling than a practical system.

Empirical selection (SSA) → systematic search (OOPS) → provable self-improvement (Godel machine). Each step trades practical tractability for stronger theoretical guarantees.

In comes… the Darwin Gödel Machine … replaces the provably optimal with the empirical but practical.


Jürgen Schmidhuber