I am following the course CS 285 Deep Reinforcement Learning from UC Berkeley. In lecture 4, part 1 (around 17:00), prof. introduces an expectation of rewards of a trajectory over a policy. He first introduces the probability of a trajectory over theta as follows. \begin{aligned} \underbrace{p_{\theta}\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}\right)}_{p_{\theta}(\tau)}=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \underbrace{\pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)}_{\text {Markov chain on }(\mathbf{s}, \mathbf{a})} \end{aligned}
Then he says that we can use linearity of expectation at this point and writes the following equality.
\begin{aligned} \theta^{\star} &=\arg \max _{\theta} E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \\ &=\arg \max _{\theta} \sum_{t=1}^{T} E_{\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right) \sim p_{\theta}\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)}\left[r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] \end{aligned}
And tells that $p_{\theta}(\mathbf{s}_t, \mathbf{a}_t)$ is state-action marginal. I am trying to understand this. Even though we have markov assumption, $p_{\theta}(\mathbf{s}_t, \mathbf{a}_t)$ still depends on $(\mathbf{s}_{t-1}, \mathbf{a}_{t-1})$ so the elements of the joint distribution $p_\theta(\tau)$ are not independent. Don't we need to use conditional probabilities for such expansion of the joint probability in the second equation? Isn't this kind of equal to saying $E_{x,y \sim p(x,y)}[f(x,y)] = E_x[f(x,y)] + E_y[f(x,y)]$. Maybe this is doable when $f(x,y)$ is something like $x+y$ which is I believe similar to the example.