general value functions

→ general_value_functions_sutton_slides.pdf ←

GVFs haven't clicked yet — build this note up properly

From the $(π, γ, c)$ triple, show the main value function, subproblem values, and option models as instances of one prediction type; what the off-policy learning algorithms (GTD, emphatic TD, retrace) buy; Horde (one behavior stream, many predictions in parallel). Then fix the “anything that can be learned can also be planned” gloss in oak architecture.

V^{π, γ, c} (s) = E_{π} [k = 0 \sum \infty (j = 0 \prod k γ (S_{t + j})) c (S_{t + k + 1}) ∣ S_{t} = s]

Cumulant $c$ … the signal being predicted - this replaces the reward signal in traditional value functions. It could be anything measurable: sensor readings, state features, auxiliary rewards, or even other value function outputs
Policy $π$ … the behavior policy being followed, which can be different from the policy you’re trying to improve.
Termination function $γ$ … a state-dependent discount factor that can model episodic boundaries more flexibly than a constant discount.

References

value function
curiosity
off-policy
prediction

reinforcement learning

http://incompleteideas.net/Talks/luganoreduced.pdf

Fun comparison in explanation between 3.5 Sonnet (I think) and 4.0 Opus:
https://claude.ai/chat/96e338c0-ba8b-4e53-aaac-ce4b5abc2b54
https://claude.ai/chat/f5b96432-c579-48ac-b8b1-a0c60d7cbcfb

Graph View

general value functions

References

Backlinks