MDP

Markov Decision Process

A 5-tuple $(S, A, P, R, γ)$ :

$S$ : set of states

$A$ : set of actions

$P (s^{'} ∣ s, a)$ : transition probabilities

$R (s, a)$ : reward function

$γ \in [0, 1)$ : discount factor

Extends a markov chain by adding actions (the agent chooses transitions, not just observes them) and rewards (there’s something to optimize).

The MDP framework rests on the markov property.

POMDP

Partially observable MDP: the environment is still Markovian, but the agent can’t see the full state, only observations $o \sim O (s)$ . The fix is to maintain a belief state (distribution over “which true state am I in?”), updated via Bayes’ rule with each observation. This belief state is itself Markovian, so in theory you’ve reduced the POMDP back to a regular MDP. In practice, the state space of that MDP is now a continuous probability distribution, which makes it intractable for anything but small problems.

Graph View

MDP

Backlinks