Markov Decision Process - Bernhard Pfann, CFA

Given the [[State-Action Paradigm]], the Markov Decision Process (”MDP”) additionally satisfies the [[Markov Chain|Markov Property]], that transition probabilities $T$ only depend on the current state and action. The goal of MDPs is to optimize the agent's behavior. We define a utility function $U$ and maximize its expectation. **Finite horizon:** The utility function is the sum of rewards after acting for a fixed number $n$ steps. $ U(s_0, s_1, \dots,s_n)= \sum_{t=0}^n R(s_t) $ >[!note:] Note that, this setup might violate the Markov property, as actions could also depend on the fact that only $k$ steps are left up to the final step $n$ (e.g. closer rewards will be preferred over rewards that are unreachable within $k$ steps). **Infinite horizon:** An MDP that goes on infinitely, might have an infinite sum of reward. In this case we cannot compare rewards of different strategies. Therefore we introduce $\gamma \in [0 \le \gamma <1]$ as a discounting factor for future rewards. $ \begin{align} U &=U(s_0, s_1,\dots, s_\infty)\\[8pt] &=R(s_0)+ \gamma R(s_1)+ \gamma^2R(s_2)+\dots \\[2pt] &=\sum_{t=0}^\infty \gamma^tR(s_t) \end{align} $ We find the upper bound of this infinite sum by assuming that all rewards $R(s_t)$ are equal to the maximum achievable reward from any state $s$. Thus we know that $U$ can never be greater than that. $ U \le R_\text{max} \sum_{t=0}^\infty \gamma^t = \frac{R_\text{max}}{1-\gamma} $ >[!note:] >We transformed the geometric series into a fraction of $1\over(1-\gamma)$.