value function

Value function

$V_{π} (s) V (s) V (s) V (s) = E (t \sum γ^{t} r_{t} ∣ S = s) = π max E (t = 0 \sum \infty γ^{t} r_{t} ∣ s_{0} = s) = π max E (r_{0} + t = 1 \sum \infty γ^{t} r_{t} ∣ s_{1} = s^{'}) = π max E (r_{0} + γV (s^{'})) approximate value function for a policy π optimal value function decompose: current + next... recursive definition: Bellman Equation$
Through this recursive definition (Bellman Equation), we can break problems up into sub-problems which we can optimize locally and still have an optimal solution (dynamic programming):
$π = ar g π max E (r_{0} + γV (s^{'}))$
So when we either know the value function or the optimal policy $π_{*}$ , we can break up an optimization problem into a sequence of steps, like decomposing a chess game into a series of next-step-predictions, we only need to learn to assign values to states.
In dynamic programming (where the model of the environment is fully known), we keep track of which sub-problems have already been solved and recursively fill up a table of subproblems (and exact values), whereas in chess we (need to) approximate values with neural networks.

Credit assignment problem

The typical problem formulation in reinforcement learning is to maximize the expected total reward of a policy. A key source of difficulty is the long time delay between actions and their positive or negative effect on rewards; this issue is called the credit assignment problem in the reinforcement learning literature (Minsky, 1961; Sutton & Barto 1998), and the distal reward problem in the behavioral literature (Hull, 1943).

value functions offer an elegant solution to the credit assignment problem – they allow us to estimate the goodness of an action before the delayed reward arrives.

Link to original

The value function is always an estimate (which, if it’s estimated under a perfect policy / estimator matches the real value), so the hat $^$ is is often dropped.

Max Wolf's Second Brain

Explorer

value function

Graph View

Backlinks