I am not sure why you have pairs of states. Just because you have more than one player, doesn't mean that you get more states. At each state, every agent observes the same state; it's just that only one of those players gets to decide on an action in a particular state.
So, I have come up with the following: A $l$-player MDP is $ ((S_i)_{i\in[l]},P,A,(R_i)_{i\in[l]},\gamma) $ where the pieces mean the following:
- $S_i$ is the set of states where it's player $i$'s turn; all $S_i$ are pairwise disjoint
- $P(s'|s,a)$ is the probability to get from $s$ to $s'$ with action $a$.
- $A$ is set of actions; we assume the same action set for each state
- $R_i(s,a)$ is the real reward for player $i$ doing action $a$ in state $s$
- $0 \leq \gamma < 1$ is the discount factor
Now, let's define the value function $v_i$ for each player $i$, where $v_i(s)$ is the expected discounted sum of rewards for player $i$ from state $s$ onwards. If $s\in S_j$, then
$$ v_i(s) = R(s,a(s)) + \sum_{s'} P(s'|s, a(s) )\cdot \gamma v_i(s'), $$ where
$$a(s) = argmax_{a}\; R(s,a(s)) + \sum_{s'} P(s'|s,a )\cdot \gamma v_j(s'). $$
So basically, the value of state $s$ for player $i$ is the value of the expected next state if the action is chosen by the player whose turn it is in state $s$. Note that if $i=j$, so if it's player $i$'s turn, then the above formula becomes:
$$ v_i(s) = \max_{a} R(s,a) + \sum_{s'} P(s'|s, a )\cdot \gamma v_i(s') $$
which is exactly as in a $1$-player MDP.