jurgen-sucess-story-algorithm.mp4 via X
Banger sides: https://people.idsia.ch/~juergen/ssa/sld001.htm
Motivation
- RL with a single lifelong trial
- Most RL trains with the assumption of repeatable trials, reset every time
- Humans are not (hard) reset, can discover patterns across trials
- General RL must assume a single life-long trial, which may or may not be divisible into sub-trials
- Must explicitly take into account that policy changes in early stages of life may affect properties of later policy changes and later trials near the end of the life (recursively: … → meta meta learning → meta learning → learning)
- Policy should be able to in principle excecute any (meta-)learning algorithm
- To do that, you have to include actions that can be triggered by the policy itself, which change the policy (explicitly).
- You need a framework that tracks which self-modifications were useful for obtaining more reward per time, so you can recursively establish a chain of self-modifications, where each outperforms all previous ones.
Success-Story Algorithm
The learner lives from time to , executing actions that may modify its own policy (a self-modifying policy, SMP). Policy modifications are stored on a stack so they can be undone. Modifications are made by primitive learning algorithms (PLAs): actions in the instruction set that change the policy. What the PLAs are is unspecified: could be “flip a weight,” could be “run Q-learning,” could be an LLM rewriting the codebase (→ DGM). The richer the PLAs, the more powerful the self-modifications SSA can evaluate.
Checkpoints partition the learner’s lifetime. Between checkpoints, the policy is protected from evaluation. When to set checkpoints is itself determined by the policy (so the system learns when to evaluate itself).
: time-varying set of surviving checkpoints (the “success history”). : the -th element of . : cumulative reward up to time .
Success-Story Criterion (SSC) — satisfied at time if is empty or:
Each surviving checkpoint must mark the start of a reward rate acceleration (not just total reward increase). The time consumed by modifications counts against them.
SSA, invoked at every checkpoint:
- While SSC not satisfied: undo all policy modifications since the most recent checkpoint in , remove that checkpoint from .
- Add current checkpoint to .
The while loop can undo multiple epochs in a single call, cascading back through the stack until what remains is a consistent history of accelerations.
Checkpoint timing is learned. Reward delays are unknown, so there’s no a priori good time to evaluate. Since checkpoint timing depends on SMP, the policy learns when to evaluate itself.
No modification is safe forever. SSC re-evaluates every surviving checkpoint against the current time at every SSA call. If modifications after some old checkpoint go badly, the rate from to now drops, SSC breaks, and the while loop reverts everything since , removes it, then checks , and so on. Old checkpoints aren’t protected by age; they’re continuously re-tested. What age buys: the rate is averaged over more data, so a lucky fluke is less likely to survive many SSA calls.
- Modifications include the modification strategy itself.
- Beneficial mutations are statistically more likely to stick around, bad ones get reverted, even if it takes a long time to discover past mistakes. Greedy “reward acceleration”.
- Credit assignment is much coarser than in the traditional RL sense: It never asks which modification was the good one, only “was everything that happened since checkpoint net posiive?”
- No structural assumptions needed (states, transitions, rewards, markov property/stationarity).
Combine SSA with population-based methods?
It’s similar to evolutionary search, but more general/principled, in that it is self-referential — it can modify any part of itself, including the modification strategy.
It’s greedy local search with (a conservative) undo – so it’s succeptible to the same issues … that population-based/quality diversity algorithms address. Could combine? SSA-style lifetime learning… population-based updates of the — yeah nvm. this is the DGM! See below.
But first a small detour:
… how did Schmidhuber develop the idea?
- SSA (1994-97): Greedy self-improvement with undo. Works but limited by the instruction set and the conservatism of “revert if not better.”
- Richer instruction sets: don’t change SSA itself, just give it more powerful primitives to work with. If your primitive learning algorithms include actual learning algorithms (not just simple parameter tweaks), SSA can compose and select among them. The “Shifting Inductive Bias” paper (1997) combines SSA with adaptive levin search to get systematic exploration instead of random mutations.
- OOPS (2004, Optimal Ordered Problem Solver): takes the “search more systematically” idea further. Uses optimal universal search to find provably fastest programs for solving sequences of tasks, reusing solutions from earlier tasks to speed up later ones. Still empirical evaluation, but with optimality guarantees on the search process itself.
- Gödel Machine (2003-2009): the logical endpoint. Replaces empirical “did reward improve?” with formal proof search. The machine only self-modifies when it can prove the modification will improve its expected future performance. Provably optimal in the strongest possible sense: no other self-improving algorithm can do better, given the same axiom system. In practice, finding such proofs is astronomically hard, so it’s more of a theoretical ceiling than a practical system.
Empirical selection (SSA) → systematic search (OOPS) → provable self-improvement (Godel machine). Each step trades practical tractability for stronger theoretical guarantees.
In comes… the Darwin Gödel Machine … replaces the provably optimal with the empirical but practical.