![]()
Markov Decision Process
A 5-tuple :
- : set of states
- : set of actions
- : transition probabilities
- : reward function
- : discount factor
Extends a markov chain by adding actions (the agent chooses transitions, not just observes them) and rewards (there’s something to optimize).
The MDP framework rests on the markov property.
POMDP
Partially observable MDP: the environment is still Markovian, but the agent can’t see the full state, only observations . The fix is to maintain a belief state (distribution over “which true state am I in?”), updated via Bayes’ rule with each observation. This belief state is itself Markovian, so in theory you’ve reduced the POMDP back to a regular MDP. In practice, the state space of that MDP is now a continuous probability distribution, which makes it intractable for anything but small problems.