Value Iteration Update - Bernhard Pfann, CFA

## State-Value Iteration We express $V_{k+1}^\star (s)$ as the expected [[State-Action Paradigm#^01e643|reward]] from [[State-Action Paradigm#^045824|state]] $s$ within $k+1$ steps. It consists of the reward for the first step, plus the value of the new state $V_k^\star(s^\prime)$, which only looks $k$ steps into the future. $ V_{k+1}^\star(s)=\max_a \left[ \sum_{s^\prime} T(s,a,s^\prime)*\Big(R(s,a,s^\prime) + \gamma V_k^\star(s^\prime)\Big) \right] $ For the value iteration algorithm we initialize all state values at zero. $ V_0^\star(s) = 0 $ Now we loop over $k$ starting with $k=0$, which corresponds to $V_{1}^\star(s)$. We continue the iterations until state values for all states $S$ have converged. The underlying assumption is that if $k \to \infty$, then $V_k^\star(s) \to V(s)$. The convergence will happen since the $\gamma$ is discounting more and more the future steps. $ V_k^\star (s) \simeq V_{k+1}^\star(s) \, \,\forall s $ Once we have all the state values, we can compute all Q-values. These allow us to identify the optimal action for each state. This optimal action is just the $\arg \max_a$ over all actions in that given state. $ \pi^\star(s)= \arg \max_a Q^\star(s,a) $ ## Q-Value Iteration Instead of the state values we can also define the [[Bellman Equations#^f47c81|Q-Values]] recursively. We define $Q_{k+1}^\star(s,a)$ as a weighted average, where the weights are the [[State-Action Paradigm#^c23f52|transition probabilities]] to all possible states $s^\prime$. For each $s^\prime$ we obtain a reward, and use $\gamma*V(s^\prime)$ to account for future rewards assuming optimal behavior thereafter. However the new state value $V(s^\prime)$ can also be expressed in terms of $Q$ by considering one more action beyond. $ Q_{k+1}^\star(s, a) = \sum_{s^\prime} T(s,a,s^\prime)* \Big(R(s,a,s^\prime)+\gamma* \overbrace{\max_{a^\prime} Q_{k}^*(s^\prime, a^\prime)}^{V(s^\prime)}\Big) $