1

Under the policy $\pi(\phi,a)$, the sequence of loss functions \begin{equation} L_i(\theta_i) = \mathbb{E}_{\phi,a\sim \pi(.)}[(y_i - Q_i)^2], \end{equation} is minimized, in order to train the Q-network.

How do I read the $\mathbb{E}_{\phi,a\sim \pi(.)}$ part? Expected $\phi,a$ based on policy $\pi$?

1 Answers1

1

Talking with a colleague, we determined it is

The expected value of $(y_i - Q_i)^2$ where $\phi,a$ are sampled from the policy $\pi$