Markov property

The probability of the next state and reward only depends on the current state and action.
The reward does not depend on the history of states and actions.

Stationary dynamics: and are fixed functions. The environment is assumed to be fixed.

If the "state" is rich enough (full history, the agent's memory, the world's hidden variables), isn't anything markovian?

Yes, but then you have a huge state space and all the nice properties of MDP (tractable value functions, convergence guarantees) evaporate.

MDP-based RL/algorithms assumes you can condense everything relevant into a manageable state representation.

In an open-ended real-world setting (an embodied agent, a lifelong learner, a scientist), the “state” that would make things markov is the entire history of everything that ever happened.

It also assumes the world doesn’t change: the Bellman Equation only has a fixed point to converge to because and are stationary.

MDP algorithm families, roughly

Value based: learn a value function, derive a policy from it (Q-learning, DQN, SARSA)
actor critic: both policy + value function (PPO, SAC, A3C)
model-based: learn a modelof the environment, plan with it (Dyna, MuZero, world models)