→ general_value_functions_sutton_slides.pdf ←
Cumulant … the signal being predicted - this replaces the reward signal in traditional value functions. It could be anything measurable: sensor readings, state features, auxiliary rewards, or even other value function outputs
Policy … the behavior policy being followed, which can be different from the policy you’re trying to improve.
Termination function … a state-dependent discount factor that can model episodic boundaries more flexibly than a constant discount.
References
value function
curiosity
off-policy
prediction
http://incompleteideas.net/Talks/luganoreduced.pdf
Fun comparison in explanation between 3.5 Sonnet (I think) and 4.0 Opus:
https://claude.ai/chat/96e338c0-ba8b-4e53-aaac-ce4b5abc2b54
https://claude.ai/chat/f5b96432-c579-48ac-b8b1-a0c60d7cbcfb