In standard RL, actions are single-step: observe, act, observe again. Options extend this to temporally extended actions. An option like “navigate to the hallway” runs its own internal policy for many steps before terminating and handing control back.

Option (Sutton, Precup & Singh, 1999)

A 3-tuple :
… initiation set, states where the option can start
… intra-option policy, how to act while the option runs
… termination condition, probability of stopping in each state

Available in state iff . Once initiated, actions follow until the option terminates stochastically according to .

A primitive action is a degenerate option: everywhere, terminates after one step. An agent with options learns a policy over the set of options rather than over primitive actions directly.

Because options have variable duration, the resulting decision process is a semi-MDP: transitions take variable amounts of time. The Bellman update generalizes to a multi-step TD target. For an option lasting steps:

The discounted rewards collected while the option ran, plus the discounted value of the state you land in. Same structure as a regular TD update, just spanning steps instead of one.

In practice, options are often implemented by expanding the action space. If the original game has actions and you offer possible durations, the agent picks from options (e.g. “move left for 4 steps”, “fire for 1 step”). The intra-option policy is trivial here (repeat the chosen action), but the framework is general: can be an arbitrary policy, like a learned sub-network that handles a whole subtask.

Automatically discovering good options (reusable sub-behaviors that transfer across tasks) is a core problem in hierarchical RL.