The proof of convergence of Expected Sarsa is presented in A Theoretical & Empirical Analysis of Expected Sarsa. This proof is similar to the proof of convergence of Sarsa, presented in Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms.
In both proofs, the authors rely on the following lemma, reproduced here for clarity:
Lemma 1
Consider a stochastic process $(\zeta_t, \Delta_t, F_t)$, where $\zeta_t, \Delta_t, F_t : \mathcal{X} \rightarrow \mathbb{R}$ satisfy the equations
$$ \Delta_{t+1}(x_t) = (1-\zeta_t(x_t))\Delta_t(x_t)+\zeta_t(x_t)F_t(x_t), $$
where $x_t \in X$ and $t=0, 1, 2, \ldots$. Let $P_t$ be a sequence of increasing $\sigma$-fields ($\sigma$-algebras) such that $\zeta_0$ and $\Delta_0$ are $P_0$-measurable and $\zeta_t$, $\Delta_t$, and $F_{t-1}$ are $P_t$-measurable, with $t\ge 1$. Assume that the following hold:
- The $\mathcal{X}$ set is finite,
- $\zeta_t(x_t) \in [0, 1]$, $\sum_t \zeta_t(x_t) = \infty, \sum_t(\zeta_t(x_t))^2<\infty$ with probability one and $\forall x \neq x_t : \zeta_t(x) = 0$,
- $\|\mathbb{E}\{F_t|P_t\| \le \kappa\|\Delta_t\| + c_t$, where $\kappa \in [0, 1)$ and $c_t$ converges to zero with probability one,
- $\mathrm{Var}\{F_t(x_t)|P_t\} \le K(1+\kappa\|\Delta_t\|)^2$, where $K$ is some constant,
where $\|\cdot\|$ denotes a maximum norm. Then, $\Delta_t$ converges to zero with probability one.
The proof
The authors then go on to apply Lemma 1 with $\mathcal{X} = \mathcal{S} \times \mathcal{A}$ ($\mathcal{S}$ are the states, $\mathcal{A}$ are the actions), $P_t = \{Q_0, s_0, a_0, r_0, \alpha_0, s_1, a_1, \ldots, s_t, a_t\}$, $x_t=(s_t, a_t)$ (a pair of sampled state, action), $\zeta(x_t)=\alpha_t(s_t, a_t)$ ($\alpha_t$ is the step size) and $\Delta_t(x) = Q_t(s_t, a_t) - Q^*(s_t, a_t)$. In this case, maximum norm of $\Delta_t$ is:
$$ \|\Delta_t\| = \max_s\max_a |Q_t(s, a) - Q^*(s, a)| $$
And they go on to map the assumptions of Lemma 1 to the setting of the Expected Sarsa algorithm. ($\mathcal{S}$ and $\mathcal{A}$ are finite, the sum of learning rates is unbounded but its square is bounded, the policy is greedy in the limit with infinite exploration and the reward function is bounded.)
My question (the confusing step)
The authors state that they can derive $F_t$ (which is the only missing step to use Lemma 1) as follows:
$$ F_t = \frac{(\Delta_{t+1}-(1-\zeta_t(x_t))\Delta_t(x_t)}{\zeta_t(x_t)} = \frac{\left(\Delta_{t+1}-(1-\alpha_t)\Delta_t\right)}{\alpha_t} $$
The confusing step is that they conclude:
$$ F_t = r_t + \gamma\sum_a\pi_t(s_{t+1}, a)Q_t(s_{t+1}, a)-Q^*(s_t, a_t) $$
How did the authors get to that equality? They mention that "all the values are taken over the state action pair $(s_t, a_t)$, except when specified differently". I don't see any explicit mention of a different state-action pair.