In a real-world setup we usually know the set of states and available actions ([[State-Action Paradigm]]). However the transition probabilities $T(s,a,s')$ and rewards $R(s,a,s^\prime)$ are unknown to us.
$
Q_{k+1}^\star(s, a) = \sum_{s^\prime} \overbrace{T(s,a,s^\prime)}^{\text{unknown}}* \Big( \overbrace{R(s,a,s^\prime)}^{\text{unknown}}+\gamma* \overbrace{\max_{a^\prime} Q_{k}^*(s^\prime, a^\prime)}^{V(s^\prime)}\Big)
$
We need to estimate $\hat T, \hat R$ to set up a [[Markov Decision Process]]. The simplest approach is to empirically observe the agent in the real world and collect samples.
$
\begin{align}
\text{sample}_1 : R(s,a, s_1^\prime) + \gamma \max_{a^\prime} Q(s_1^\prime, a^\prime) \\
\text{sample}_k : R(s,a, s_k^\prime) + \gamma \max_{a^\prime} Q(s_k^\prime, a^\prime)
\end{align}
$
We have observed the outcome of a given state-action pair $k$ times, bringing us to different target states $[s_1^\prime, \dots , s_k^\prime]$. In the long-run the frequencies of these target states $s^\prime$ should converge to the transition probabilities $T(s,a,s^\prime)$.
The $Q(s,a)$ is defined as the average outcome from all samples with state-action pair $s,a$. Again, it is the reward plus next state value (i.e. Q-value with optimal action).
$ Q(s,a)= \frac{1}{k} \sum_{i=1}^k R(s,a,s_i^\prime)+\gamma \max_{a^\prime} Q(s_i^\prime, a^\prime) $
>[!note:]
>In reality we usually cannot repeat the state-action combination $s,a$ mutliple times, because the agent has moved onto $s^\prime$ and we have to wait until he reaches $s$ again.
>[!note:]
>Also we want to recursively update $Q$ already during the sample collection, and not wait until the sample collection has finished. For that purpose we use the exponential running average.
## Exponential Running Average
$
\bar x_n = \frac
{x_n+(1-\alpha)*x_{n-1}+(1-\alpha)^2*x_{n-2}+\dots}
{1+(1-\alpha)+(1-\alpha)^2+\dots}
$
We can also write it in recursive from, where the new running average $\bar x_n$ equals the latest sample $x_n$ weighted by $\alpha$, plus the previous running average $\bar x_{n-1}$ which is also weighted. We will see this update rule in the algorithm.
$ \bar x_n = \alpha x_n+(1-\alpha)\bar x_{n-1} $
## Algorithm
1. Initialization of $Q(s,a)=0 \quad \forall s,a$
2. Iterate until convergence
1. Collect samples for $(s,a, s^\prime)$ and their respective reward $R(s,a,s^\prime)$
2. Update $Q(s,a)$
$
\begin{align}
Q_{i+1}(s,a)&=\alpha*\text{sample}+(1-\alpha)*Q_i(s,a) \\[4pt]
Q_{i+1}(s,a)&=\alpha *\left(R(s,a,s^\prime)+\gamma \max_{a^\prime} Q_i(s^\prime, a^\prime)\right)+(1-\alpha)*Q_i(s,a)
\end{align}
$