options framework

In standard RL, actions are single-step: observe, act, observe again. Options extend this to temporally extended actions. An option like “navigate to the hallway” runs its own internal policy for many steps before terminating and handing control back.

So you’re learning task-specific behaviors / skills – I think this is closely adjacent to Chollet’s program synthesis approach / skill-acquisition-efficiency view on intelligence.

Option (Sutton, Precup & Singh, 1999)

A tuple $(π, β)$ :
$π : S \times A \to [0, 1]$ … intra-option policy, how to act while the option runs
$β : S \to [0, 1]$ … termination condition, probability of stopping in each state

An agent with options learns a policy $μ : S \to O$ over the set of options $O$ rather than over primitive actions directly.
Once initiated, actions follow $π$ until the option terminates stochastically according to $β$ .

A primitive action is a degenerate option: $β (s) = 1$ everywhere, terminates after one step.

Because options have variable duration, the resulting decision process is a semi-MDP: transitions take variable amounts of time. The Bellman update generalizes to a multi-step TD target. For an option lasting $τ$ steps:

Q (s_{t}, o) \leftarrow r_{t + 1} + γ r_{t + 2} + \dots + γ^{τ - 1} r_{t + τ} + γ^{τ} o^{'} max Q (s_{t + τ}, o^{'})

The discounted rewards collected while the option ran, plus the discounted value of the state you land in. Same structure as a regular TD update, just spanning $τ$ steps instead of one. This is model-free: $Q (s, o)$ says how good an option is here, not what it does.

Separate from the option (know-how) is its model (know-what): a prediction of what happens when you run it. Where does it come from? Take an option that walks $s \to s_{1} \to s_{2} \to d$ in three steps, collecting rewards $r_{1}, r_{2}, r_{3}$ on the way, with the landing state worth $v (d)$ . The return from $s$ , grouped into during the option and after it:

G = r (s, o) r_{1} + γ r_{2} + γ^{2} r_{3} + p (d ∣ s, o) γ^{3} v (d)

The grouping is the model: $r (s, o)$ absorbs the rewards collected en route, $p (d ∣ s, o)$ absorbs both where the option hands back control and how much of that state’s value survives the delay. In general (random duration $k$ , stochastic landing state):

Option model

$r (s, o) = E [r_{t + 1} + γ r_{t + 2} + \dots + γ^{k - 1} r_{t + k} ∣ S_{t} = s]$ … reward collected while $o$ runs
$p (s^{'} ∣ s, o) = E [γ^{k} 1 (S_{t + k} = s^{'}) ∣ S_{t} = s]$ … probability of terminating in $s^{'}$ , shrunk by $γ^{k}$ for the $k$ steps it took

So $p$ is not a probability distribution: it sums to $E [γ^{k}] \leq 1$ , and the missing mass is the time cost. With $γ = 0.9$ , an option reaching $d$ in 3 steps has $p (d ∣ s, o) = γ^{3} \approx 0.73$ ; one needing 40 steps has $γ^{40} \approx 0.015$ , i.e. the landing state is nearly worthless because it lies so far in the future. Duration is encoded inside the transition prediction itself.

That is the point of folding $γ^{k}$ into $p$ : the Bellman Equation over options keeps the exact one-step form,

v (s) = o max [r (s, o) + s^{'} \sum p (s^{'} ∣ s, o) v (s^{'})]

with no $γ$ in front of the sum (it’s inside $p$ ). Any planning algorithm therefore runs unchanged with options in place of actions, each backup jumping a whole behavior instead of one timestep. A primitive action is the $k = 1$ case: $r (s, a) = E [r_{t + 1}]$ , $p (s^{'} ∣ s, a) = γ Pr (s^{'} ∣ s, a)$ , recovering the ordinary one-step model.

Models also compose:

p (s^{''} ∣ s, o_{1} o_{2}) = s^{'} \sum p (s^{'} ∣ s, o_{1}) p (s^{''} ∣ s^{'}, o_{2})

run $o_{1}$ , land in some intermediate $s^{'}$ , run $o_{2}$ from there; the time discounts multiply into $γ^{k_{1} + k_{2}}$ on their own. Composing models means evaluating behavior sequences that were never executed, which is what planning is. And unlike $Q (s, o)$ , the transition part is task-independent: where an option lands you stays true whatever the current reward is.

In practice, options are often implemented by expanding the action space. If the original game has $∣ A ∣$ actions and you offer $∣ T ∣$ possible durations, the agent picks from $∣ A ∣ \times ∣ T ∣$ options (e.g. “move left for 4 steps”, “fire for 1 step”). The intra-option policy is trivial here (repeat the chosen action), but the framework is general: $π$ can be an arbitrary policy, like a learned sub-network that handles a whole subtask.

Automatically discovering good options (reusable sub-behaviors that transfer across tasks) is a core problem in hierarchical RL.

Graph View

options framework

Backlinks