Bellman Equations - Bernhard Pfann, CFA

**Policy:** $\pi^\star$ is a set of instructions for the agent on how to act optimally from every [[State-Action Paradigm#^045824|state]]. Optimality means to maximize the utility function. $ \pi^\star: s \mapsto a $ **Q-value:** $Q^\star(s,a)$ is the expected [[State-Action Paradigm#^01e643|reward]] starting at state $s$ taking a (potentially non-optimal) [[State-Action Paradigm#^d84ec0|action]] $a$ and acting optimally from thereon. ^f47c81 **State value:** $V^\star (s)$ quantifies how good the current state is. It is measured in terms of expected reward starting at that state and acting optimally from thereon. We can express $V^\star(s)$ in terms of the Q-value, as a state’s values is governed by the optimal action $\max_a$ or $\pi^\star (s)$. $ V^\star (s) = \max_a Q^\star(s,a) = Q^\star\big(s, \pi^\star(s)\big) \tag{1}$ The Q-value takes a weighted sums over all possible destination states $s^\prime$, where the weights come from the transition matrix $T$. Thus, $Q$ is just the weighted sum of rewards for the next step $R(s,a,s^\prime)$, plus the next state value $V^\star (s^\prime)$ which reflects discounted future rewards of subsequent steps. $ Q^\star(s,a)=\sum_{s^\prime} T(s,a,s^\prime)*\Big(R(s,a,s^\prime) + \gamma V^\star(s^\prime)\Big) \tag{2}$ Taking the definition of equation $(2)$ to plug it into equation $(1)$ for $V^\star$: $ V^\star(s)=\max_a \left[ \sum_{s^\prime} T(s,a,s^\prime)*\Big(R(s,a,s^\prime) + \gamma V^\star(s^\prime)\Big) \right] $ Taking the definition of equation $(1)$ to plug it into equation $(2)$ for $Q^\star$: $ Q^\star(s,a) = \sum_{s^\prime} T(s,a,s^\prime)* \Big(R(s,a,s^\prime) + \gamma \max_{a'} Q^*(s^\prime,a^\prime)\Big) $