Confusing step in the proof of convergence of Expected Sarsa

Question

The proof of convergence of Expected Sarsa is presented in A Theoretical & Empirical Analysis of Expected Sarsa. This proof is similar to the proof of convergence of Sarsa, presented in Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms.

In both proofs, the authors rely on the following lemma, reproduced here for clarity:

Lemma 1

Consider a stochastic process $(\zeta_t, \Delta_t, F_t)$, where $\zeta_t, \Delta_t, F_t : \mathcal{X} \rightarrow \mathbb{R}$ satisfy the equations

$$ \Delta_{t+1}(x_t) = (1-\zeta_t(x_t))\Delta_t(x_t)+\zeta_t(x_t)F_t(x_t), $$

where $x_t \in X$ and $t=0, 1, 2, \ldots$. Let $P_t$ be a sequence of increasing $\sigma$-fields ($\sigma$-algebras) such that $\zeta_0$ and $\Delta_0$ are $P_0$-measurable and $\zeta_t$, $\Delta_t$, and $F_{t-1}$ are $P_t$-measurable, with $t\ge 1$. Assume that the following hold:

The $\mathcal{X}$ set is finite,
$\zeta_t(x_t) \in [0, 1]$, $\sum_t \zeta_t(x_t) = \infty, \sum_t(\zeta_t(x_t))^2<\infty$ with probability one and $\forall x \neq x_t : \zeta_t(x) = 0$,
$\|\mathbb{E}\{F_t|P_t\| \le \kappa\|\Delta_t\| + c_t$, where $\kappa \in [0, 1)$ and $c_t$ converges to zero with probability one,
$\mathrm{Var}\{F_t(x_t)|P_t\} \le K(1+\kappa\|\Delta_t\|)^2$, where $K$ is some constant,

where $\|\cdot\|$ denotes a maximum norm. Then, $\Delta_t$ converges to zero with probability one.

The proof

The authors then go on to apply Lemma 1 with $\mathcal{X} = \mathcal{S} \times \mathcal{A}$ ($\mathcal{S}$ are the states, $\mathcal{A}$ are the actions), $P_t = \{Q_0, s_0, a_0, r_0, \alpha_0, s_1, a_1, \ldots, s_t, a_t\}$, $x_t=(s_t, a_t)$ (a pair of sampled state, action), $\zeta(x_t)=\alpha_t(s_t, a_t)$ ($\alpha_t$ is the step size) and $\Delta_t(x) = Q_t(s_t, a_t) - Q^*(s_t, a_t)$. In this case, maximum norm of $\Delta_t$ is:

$$ \|\Delta_t\| = \max_s\max_a |Q_t(s, a) - Q^*(s, a)| $$

And they go on to map the assumptions of Lemma 1 to the setting of the Expected Sarsa algorithm. ($\mathcal{S}$ and $\mathcal{A}$ are finite, the sum of learning rates is unbounded but its square is bounded, the policy is greedy in the limit with infinite exploration and the reward function is bounded.)

My question (the confusing step)

The authors state that they can derive $F_t$ (which is the only missing step to use Lemma 1) as follows:

$$ F_t = \frac{(\Delta_{t+1}-(1-\zeta_t(x_t))\Delta_t(x_t)}{\zeta_t(x_t)} = \frac{\left(\Delta_{t+1}-(1-\alpha_t)\Delta_t\right)}{\alpha_t} $$

The confusing step is that they conclude:

$$ F_t = r_t + \gamma\sum_a\pi_t(s_{t+1}, a)Q_t(s_{t+1}, a)-Q^*(s_t, a_t) $$

How did the authors get to that equality? They mention that "all the values are taken over the state action pair $(s_t, a_t)$, except when specified differently". I don't see any explicit mention of a different state-action pair.

score 1 · Accepted Answer · answered Mar 18 '18 at 14:00

I've figured this out. The key here is to use the Expected Sarsa update:

$$ Q(s_t, a_t)\leftarrow Q(s_t, a_t) + \alpha[r_{t+1}+\gamma\sum_a\pi(s_{t+1},a)-Q(s_t, a_t)] $$

Notice that we can define this update as $Q_{t+1}$, and this is key to the proof. For simplicity, let us define the target of the Expected Sarsa update as

$$ \tau_{t+1}=r_{t+1}+\gamma\sum_a\pi(s_{t+1}, a)Q(s_{t+1}, a) $$

Now, we can expand the confusing step as follows:

$$ F_t = \frac{(\Delta_{t+1}-(1-\alpha_t)\Delta_t}{\alpha_t} = \frac{Q_{t+1}(s_t, a_t)-Q^*(s_t, a_t) - (1-\alpha_t)(Q_t(s_t, a_t)-Q*(s_t, a_t))}{\alpha_t} =\\ = \frac{Q_{t+1}(s_t,a_t)-Q_t(s_t,a_t)+\alpha_t(Q_t(s_t,a_t)-Q^*(s_t,a_t)}{\alpha_t} $$

Now, substituting $Q_{t+1}(s_t, a_t)$ by the Expected Sarsa update:

$$ F_t = \frac{Q_t(s_t,a_t)+\alpha[\tau_{t+1}-Q(s_t, a_t)]-Q_t(s_t,a_t)+\alpha_t(Q_t(s_t,a_t)-Q^*(s_t,a_t))}{\alpha_t} = \\ = \tau_{t+1}-Q^*(s_t,a_t) $$

Which gives us

$$ F_t = r_{t+1}+\gamma\sum_a\pi_t(s_{t+1}, a_t)Q_t(s_{t+1}, a_t) - Q^*(s_t, a_t) $$

As you might have noticed, this expansion uses $r_{t+1}$ instead of $r_t$, as in the proof. I'm assuming that's a typo (since they dropped the $t$ subscript in the action $a$ as well), since the convention used in the paper is that the immediate reward of action $a_t$ on state $s_t$ is represented by $r_{t+1}$.

Confusing step in the proof of convergence of Expected Sarsa

Lemma 1

The proof

My question (the confusing step)

1 Answers1