Limit theorem of Markov chains applied to higher order Markov chains

Question

I have a second order Markov chain with 4 states {A,T,C,G} (the 4 DNA nucleotides).

the transition matrix looks like this:

    A    T    C    G
AA[0.1, 0.6, 0.2, 0.1]
AT[0.3, 0.1, 0.5, 0.1]
AC[0.5, 0.3,  0,  0.2]
AG[..., ..., ..., ...]
TA[..., ..., ..., ...]
TT[..., ..., ..., ...]
TC[..., ..., ..., ...]
TG[..., ..., ..., ...]
CA[..., ..., ..., ...]
CT[..., ..., ..., ...]
CG[..., ..., ..., ...]
GA[..., ..., ..., ...]
GT[..., ..., ..., ...]
GC[..., ..., ..., ...]
GG[..., ..., ..., ...]

I wanted to calculate the stationary probability vector for the 4 states to which this matrix converges. The Markov chain is regular.

In case of first order Markov chains this is easily done by calculating the limit of $P^n$ with $n\rightarrow \infty$.

I do not know how to approach the problem in case of second order Markov chains.

Also, having a limited dataset from which to determine the transition matrix, can I consider the stationary distribution of the 4 nucleotides as being the theoretical distribution I would have if I had a much larger pool from which to draw (with the same transition matrix)?

In other words, can I consider the stationary distribution like an estimation of the theoretical nucleotide frequency given the transition matrix obtained from limited data?

Did · Accepted Answer · 2013-05-06T15:07:35.317

4

A second order Markov chain is a random process $(X_n)_n$ on an alphabet $A$, whose distribution is specified by its transition probabilities $Q(x\mid y,z)=P[X_n=x\mid X_{n-1}=y,X_{n-2}=z]$, for every $(x,y,z)$ in $A\times A\times A$ (and by an initial distribution on $A\times A$). A stationary distribution of $(X_n)$ is a probability measure $\pi$ on $A\times A$ such that, if $\pi(x,y)=P[X_n=x,X_{n-1}=y]$ for every $(x,y)$ in $A\times A$ and some $n$, then $\pi(x,y)=P[X_{n+1}=x,X_{n}=y]$ for every $(x,y)$ in $A\times A$.

Thus, one asks that, for every $(x,y)$ in $A\times A$, $$ \pi(x,y)=\sum_{z\in A}Q(x\mid y,z)\pi(y,z). $$ As in the first order case, this linear system, together with the normalizing condition $$ \sum_{(x,y)\in A\times A}\pi(x,y)=1, $$ fully determines $\pi$ as soon as $(X_n)_n$ is irreducible. A new feature, absent of the first order case, is that every stationary distribution $\pi$ has identical marginals, that is, for every $x$ in $A$, $$ \varrho(x)=\sum_{y\in A}\pi(x,y)=\sum_{y\in A}\pi(y,x). $$ Finally, the MLE of $\pi$ based on $(X_k)_{0\leqslant k\leqslant n}$ is $\hat\pi_n$ defined by $$ \hat\pi_n(x,y)=\frac1n\sum_{k=1}^n\mathbf 1_{X_k=x,X_{k-1}=y}. $$ The MLE is consistent, that is, $\hat\pi_n(x,y)\to\pi(x,y)$ almost surely, for every $(x,y)$ in $A\times A$, when $n\to\infty$. In particular, the frequency of $x$ in $A$ stabilizes, since $$ \frac1n\sum_{k=1}^n\mathbf 1_{X_k=x}=\sum_{y\in A}\hat\pi_n(x,y)\to\varrho(x). $$

edited May 06 '13 at 15:07

answered May 06 '13 at 11:29

Did

279,727

Would it be correct to think of the stationary distribution pi as the theoretical frequency distribution, of every couple (x,y) in AxA, in an infinitely long chain generated from the starting transition matrix? Also, would there be a way to extrapolate the frequencies of x in A instead of (x,y) in AxA? – Tito Candelli May 06 '13 at 13:29
Would it be correct to... Yes. // extrapolate the frequencies of x in A... See Edit. – Did May 06 '13 at 15:04
i apologize but i am not a mathematician and i have trouble understanding the 2 last formulas, in particular i am not familiar with the rho notation. if you could elaborate a little bit it would be awesome. Also, do you think that extrapolating transition matrix from limited data and using it to calculate the limit frequencies of each state is more effective than directly using the frequencies calculated from the limited data? – Tito Candelli May 06 '13 at 15:24
There is no rho formula, rho(x) is just a notation for the sum on the RHS of the equation introducing the notation. // Sorry but I do not really understand the alternative you explain in the last sentence. Whichever estimation procedure one uses, in the end the MLE for pi(x,y) is the fraction of time the observed sequence spends at x-then-y and the MLE for rho(x) is the fraction of time the observed sequence spends at x. – Did May 06 '13 at 15:44
my question stemmed from an experimental problem with which i am faced. given a limited amount of data (in the form of a markov chain of finite length) and wanting give the best possible estimate of the frequency distribution of the states of the chain; is it better to empirically extrapolate the transition matrix from this data and then calculate the theoretical distribution? or is it better to directly extrapolate the frequency of the states from the data? – Tito Candelli May 06 '13 at 15:59
@Did: Thanks a lot, that is quite interesting. Can you suggest a reference that I could cite? In particular, I would be interested under which scenarios the stationary distribution is unique. – Bernhard Apr 25 '16 at 13:50
1

@Bernhard http://tata-box-blog.blogspot.fr/2012/04/introduction-to-markov-chains-and.html https://en.wikipedia.org/wiki/Models_of_DNA_evolution https://books.google.fr/books?id=ci7dZGe6oWkC – Did Apr 25 '16 at 14:01
@Did: Thanks a lot for the references - but I still could not find a reference for your claim that $\pi(x,y)$ is uniquely determined by $Q$ if $Q$ is an irreducble tensor. Do you have any math textbook that you could refer me to? – Bernhard Aug 03 '16 at 15:44
@Bernhard Any textbook on finite Markov chains would do, say, Norris Markov chains available on the web. – Did Aug 03 '16 at 18:28
@Did: Unfortunately, I could neither find the required result in Norris, Kemeny & Snell, Gallager, or Papoulis. They all treat only first-order Markov chains, which is quite simple. For higher-order chains, I'm not sure about the uniqueness property, to be honest, and that is why I ask for a reference. I will post my concerns as an answer. – Bernhard Aug 04 '16 at 07:52
1

@Bernhard Of course they do not (this was not clear to me from your previous comment that this is what you were after) since every second order Markov chain $(X_n)$ is associated to a first order Markov chain $(Y_n)$ by defining $Y_n=(X_n,X_{n+1})$. Thus, take the result you are interested in in one of the canonical sources you cite, for $(Y_n)$, and trivially deduce its version for $(X_n)$. For example, $(Y_n)$ has a stationary distribution $\pi$, hence $(X_n)$ has a stationary distribution $\pi'$, which one gets by projecting $\pi$, that is, $$\pi'(x)=\sum_y\pi(x,y)=\sum_y\pi(y,x).$$ – Did Aug 04 '16 at 09:13
Thanks for your comment. I was thinking about that, too, but I am still hesitant. The transition chain $Y_n=(X_n,X_{n+1})$ might have states that cannot be reached, i.e., transient states. How shall I treat those? Moreover, for some state pairs $(x_1,x_2)$ the original $Q$ might not be defined, because the pair is assumed to be impossible (see answer below). Can these cases all be excluded easily, just by assuming irreducibility of $Q$? – Bernhard Aug 04 '16 at 09:46
"The transition chain" No "transition chain" here, whatever "transition chain" may mean. // "states that cannot be reached, i.e., transient states" Ouch... This is not what "transient states" are. // Anyway, states "that cannot be reached" are not a problem, one way of "treating them" is to define the state space of $(Y_n)$ as ${(x,x')\mid\exists n,P(X_n=x,X_{n+1}=x')\ne0}$. – Did Aug 04 '16 at 09:50
"Transition chain": One name for the Markov chain with states $Y_1=(X_1,X_2)$, $Y_2=(X_2,X_3)$, etc. (b/c states are transitions). "Ouch": Tuples that do not occur (see answer below), still need outgoing transition probabilities assigned --> the tuple becomes a transient state in the transition chain (or a separate ergodic set/communicating class). "Treating them": Yes, that is one way, but that only gives you an invariant distribution on the reduced state space, not on the product space $A\times A$. Don't get me wrong, I believe your results, but I'd like to see a rigorous proof. – Bernhard Aug 04 '16 at 11:14
Let us continue this discussion in chat. – Bernhard Aug 04 '16 at 11:16
" believe your results, but I'd like to see a rigorous proof." Belief? Not a matter of "belief", I hope. Get a book. (Unrelated: Rather than hijacking an answer from three years ago, you might want to post a separate question. This would allow you to be extra specific on the setting you have in mind and the precise problem you want to solve, and to explain in details how the basics I recalled are not answering it (something I cannot fathom at present, but perhaps others would).) – Did Aug 04 '16 at 13:48
Exactly, that's what I wanted to have: A book (or another reference). But again, thanks a lot for your comments, I really appreciate your effort! – Bernhard Aug 04 '16 at 14:31
@Did: As a follow-up: I got a book and I found a counterexample. The issue for higher-order Markov chains is not as simple as one would guess. See my answer below. Still, thanks for the discussion! – Bernhard Oct 20 '16 at 08:37
@Bernhard "I got a book and I found a counterexample" Yeah, another option would be to think about the problem. The chain in your post is reducible. // Let me repeat: If you have a problem, counterexample, whatever, in mind, then post a question explaining it in details and we will study it. Writing a detailed and explicit text would be a learning experience, forcing you to be precise and rigorous about the statements you want to dis/prove, thus eliminating the ambiguities that, at present, are clouding your thinking process. – Did Oct 20 '16 at 08:54
@Did: Before responding like you did, you should consider the references I linked; there you will find a definition of irreducibility for higher-order Markov chains, and you will see that my example is not reducible. Posting a new question is not appropriate in this case, as THIS is the question we are dealing with. – Bernhard Oct 20 '16 at 09:12
@Bernhard OK, you found one published book (that at least some of us cannot access to through your link, by the way) which may or may not have (I cannot say, see previous parenthesis) a careless definition, and from this you see fit to wax on and on... Sorry if I do not consider this endeavour very productive but note that I indicated clearly from the start that (and why) one should analyze second order Markov chains $(X_n)$ on some state space $S$ through the Markov chain $(Y_n)$ on $S\times S$ defined by $Y_n=(X_n,X_{n+1})$. Thus, obviously, ... – Did Oct 20 '16 at 09:58
... the irreducibility of $(X_n)$ should be defined in terms of the communicating classes of $(Y_n)$ in $S\times S$. And, naturally, if one strays away from this correct definition, then so-called "counterexamples" may arise. – Did Oct 20 '16 at 09:58
@Did: It is not obvious how one should define irreducibility for second-order chains: That is one of the reasons why I asked you for a reference. You could not - or did not want to - present one, so I had to look for one myself. Instead of trying to understand my arguments, you became offensive. In fact, I found several references that define irreducibility on the state space $S$, and not on its Cartesian product: See Definition 2.1 in http://www.math.hkbu.edu.hk/~mng/tensor-research/highermarkov.pdf and eq. (5) in https://arxiv.org/pdf/1307.6919.pdf. I found no (published) definition... – Bernhard Oct 20 '16 at 11:02
...that supports your claim. – Bernhard Oct 20 '16 at 11:04
My impression at this point is that the maths are clear (and yes, the correct definition is clear to any moderately trained probabilist). Re the papers you cite (since unfortunately the second reference in your post is unavailable, just like the first), if I read them correctly, they assert that a stationary distribution of a second order Markov chain $(X_t)$ is a measure $\mu$ such that if $X_{t-2}$ and $X_{t-1}$ are independent with distribution $\mu$, then $X_t$ has distribution $\mu$. This might make sense algebraically but probabilistically, this is nonsense. Anyway... – Did Oct 20 '16 at 11:24
The papers are cited only because of the definition of irreducibility; I agree that this kind of invariant distribution is not what we are after. I also believe the maths are clear, although we may not agree. After all, it's about definitions, and they can't be wrong. I was open-minded when I looked for these definitions, and these are what I found. You might not agree, but that's no reason to be offensive. Moreover, if you think your definition is so obvious, you should be able to back it with references. Shouldn't you? – Bernhard Oct 20 '16 at 11:36
@Bernhard Sorry to be blunt but the fact that the references you cite in support (which apparently all follow each other) define stationarity as I described in my previous comment suffices to make me consider their (related) definition of irreducibility as not very relevant to the problem at hand. Once again, these authors might have other, algebraic, motivations, but in terms of the underlying stochastic processes their definition is crap. No need of references to back up this statement. Don't you agree? – Did Oct 20 '16 at 11:46
@Did: No, because these are the only definitions for irreducibility I found in the literature. The two books I cite (esp the first) are on Markov chains, i.e., probabilistic in nature. Your definition might be obvious for you (and in this case, as you pointed out, the problem becomes trivial), but unless other probabilists agree openly (i.e., in publications), that does not mean a thing. I think we should move this to "Philosophy", right ;)? – Bernhard Oct 20 '16 at 11:53
Oh, and downvoting my answer was not nice, given that it is only a matter of definitions... – Bernhard Oct 20 '16 at 11:54

Bernhard · Answer 2 · 2016-10-20T11:22:37.140

Since your second-order Markov chain is regular, you can still compute $\lim_{n\to \infty} Q^n$, where

$$ Q^n(x|y,z) = \mathbb{P}(X_{n+1}=x|X_1=y,X_0=z). $$

In the limit, you will have $\lim_{n\to \infty}Q^n(x|y,z)=\varrho(x)$ for every pair $(y,z)$. More information can be found in Chapter 7 of this book (especially equation 7.1.3).

Note however, that regularity or irreducibility of $Q$ do not imply that a unique stationary distribution $\pi(x,y)$ exists:

Consider a second-order Markov chain on $\{1,2,3,4\}$. Consider further, that there are two possible classes of cycles this Markov chain may go through: 1-2-3-4-1 and 1-2-3-1 (to break periodicity), or 1-4-3-2-1 and 1-3-2-1. From all pairs of states, the Markov chain moves to any of these two cycles and remains in them. One can show that this second-order Markov chain is regular and that $Q^n$ converges to a matrix with identical rows with entries $\varrho(x)$. But there is no unique invariant distribution $\pi(x,y)$, since there are two recurrent classes of tuples $(x,y)$ of states.

More generally, as mentioned in this book on page 173, if $x$ is a recurrent state for $(X_n)$, it need not be the case that a state $(x,y)$ for $(X_n,X_{n+1})$ is recurrent.

The transition Q given in the example is obviously reducible, two communication classes being {12,23,34,41,31} and {14,43,32,21,13} (the status of states 24 and 42 being unclear). — Did, Oct 20 '16 at 08:49
A second-order Markov chain is irreducible if, for every $z,y,x$ there is a $n=n(z,y,x)$ such that $Q^n(x|y,z)>0$. Irreducibility is defined on the states of $X$, i.e., on $A$, and not on tuples of states (on $A\times A$). The example I give is clearly irreducible. — Bernhard, Oct 20 '16 at 09:15
See comment on my answer. (Once again, and for the last time, next time you want to explain your own theories à propos a question, please do so by first posting an answer, not by polluting the comment thread of another answer. Thanks in advance.) — Did, Oct 20 '16 at 10:00

Limit theorem of Markov chains applied to higher order Markov chains

2 Answers2