Sample Based Q-Learning - Bernhard Pfann, CFA

In a real-world setup we usually know the set of states and available actions ([[State-Action Paradigm]]). However the transition probabilities $T(s,a,s')$ and rewards $R(s,a,s^\prime)$ are unknown to us. $ Q_{k+1}^\star(s, a) = \sum_{s^\prime} \overbrace{T(s,a,s^\prime)}^{\text{unknown}}* \Big( \overbrace{R(s,a,s^\prime)}^{\text{unknown}}+\gamma* \overbrace{\max_{a^\prime} Q_{k}^*(s^\prime, a^\prime)}^{V(s^\prime)}\Big) $ We need to estimate $\hat T, \hat R$ to set up a [[Markov Decision Process]]. The simplest approach is to empirically observe the agent in the real world and collect samples. $ \begin{align} \text{sample}_1 : R(s,a, s_1^\prime) + \gamma \max_{a^\prime} Q(s_1^\prime, a^\prime) \\ \text{sample}_k : R(s,a, s_k^\prime) + \gamma \max_{a^\prime} Q(s_k^\prime, a^\prime) \end{align} $ We have observed the outcome of a given state-action pair $k$ times, bringing us to different target states $[s_1^\prime, \dots , s_k^\prime]$. In the long-run the frequencies of these target states $s^\prime$ should converge to the transition probabilities $T(s,a,s^\prime)$. The $Q(s,a)$ is defined as the average outcome from all samples with state-action pair $s,a$. Again, it is the reward plus next state value (i.e. Q-value with optimal action). $ Q(s,a)= \frac{1}{k} \sum_{i=1}^k R(s,a,s_i^\prime)+\gamma \max_{a^\prime} Q(s_i^\prime, a^\prime) $ >[!note:] >In reality we usually cannot repeat the state-action combination $s,a$ mutliple times, because the agent has moved onto $s^\prime$ and we have to wait until he reaches $s$ again. >[!note:] >Also we want to recursively update $Q$ already during the sample collection, and not wait until the sample collection has finished. For that purpose we use the exponential running average. ## Exponential Running Average $ \bar x_n = \frac {x_n+(1-\alpha)*x_{n-1}+(1-\alpha)^2*x_{n-2}+\dots} {1+(1-\alpha)+(1-\alpha)^2+\dots} $ We can also write it in recursive from, where the new running average $\bar x_n$ equals the latest sample $x_n$ weighted by $\alpha$, plus the previous running average $\bar x_{n-1}$ which is also weighted. We will see this update rule in the algorithm. $ \bar x_n = \alpha x_n+(1-\alpha)\bar x_{n-1} $ ## Algorithm 1. Initialization of $Q(s,a)=0 \quad \forall s,a$ 2. Iterate until convergence 1. Collect samples for $(s,a, s^\prime)$ and their respective reward $R(s,a,s^\prime)$ 2. Update $Q(s,a)$ $ \begin{align} Q_{i+1}(s,a)&=\alpha*\text{sample}+(1-\alpha)*Q_i(s,a) \\[4pt] Q_{i+1}(s,a)&=\alpha *\left(R(s,a,s^\prime)+\gamma \max_{a^\prime} Q_i(s^\prime, a^\prime)\right)+(1-\alpha)*Q_i(s,a) \end{align} $