2

im reading about statistical learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman The elements of statistical learning) and for some reason it seems to be trivial that $E[XX^T]$ is non-singular (with $X \in \mathbb{R}^P$ a random real valued vector) but I dont really think its that easy, am i missing something here?

rarwoan
  • 926
  • as long as $\forall i ;E[X_i] \neq 0$... – Memming Apr 01 '15 at 17:18
  • i see, but there is no condition on $X$. Im trying to see if $E[XX^T]$ is positive definite,which i think but Im only running in circles with that. – rarwoan Apr 01 '15 at 17:24
  • if X is always 0, you won't include it in your statistical analysis, so there's virtually no danger in assuming the correlation matrix is positive definite. (It's always at least positive semi-definite.) – Memming Apr 01 '15 at 17:40

1 Answers1

0

This will only be a problem if $X$ has a really pathological distribution. Firstly, let's remap $X$ to have zero mean, i.e. $X \leftarrow X - \mathbb{E}[X]$; then $C=\mathbb{E}[XX^T]$ is exactly the covariance matrix of $X$, which should tell you that it is positive semi-definite at least. We can show this by taking $v\in\mathbb{R}^P$, then $$ v^TCv= v^T\mathbb{E}[XX^T ]v = \mathbb{E}[v^T X X^T v] = \mathbb{E}[v^T X (v^T X)^T] = \mathbb{E}[\alpha^T\alpha] = \mathbb{E}[\alpha^2] \geq 0 $$ where $\alpha = v^T X= v\cdot X$ and $v\ne \vec{0}$. So it's positive semi-definite.

But when is $\alpha = 0$ (the problem case)?

Well, $v$ is non-zero, but $X$ might be, but how often does that happen? Assuming $X$ is not a Dirac Delta distribution at zero, then in practice probably never, especially if $P$ is high. This is because the probability that a continuous random variable takes a specific value is zero (also here), under mild assumptions.

Of course, the real issue here is that we need to ensure $X$ is never orthogonal to $v$. This happens when all the probability mass (density) for $X$ lies on a hyperplane that is orthgonal to $v$, i.e. the data lies on $$ X^Tv=0 $$ In other words, our distribution cannot lie on a low dimensional subspace of the data space. This is exactly the problem people sometimes find with degenerate features in their dataset: it results in sample covariance matrices with zero determinant (i.e., your matrix would be singular). Check out these posts: [1], [2], [3].

Again, however, for any reasonable distribution, this cannot happen. The reason is that hyperplanes have measure zero (see also here), meaning choosing a random vector will essentially never hit the plane (for the same reason as the probability of a continuous random variable hitting an exact value is zero).

In any case, you would have to choose $X$ to very carefully adhere to a hyper-plane in dataspace, which doesn't occur in practice unless you do something like exactly duplicate your features. So ruling that out, yes, our $C$ is safe (non-singular) and positive definite with probability $1$, in theory.

One caveat is when working with sample covariance matrices $\widehat{C}$ on a computer. A combination of small sample size, and floating point and round-off error (e.g., dealing with data of small magnitude) can lead to the occasional singularity, if the distribution $p(X)$ is close to being degenerate. (Often when trying to compute $\log \det C$ for instance). One can fix this by adding a tiny bit of noise to the data, or by adding a tiny perturbation $\epsilon I_P$ to $\widehat{C}$.

user3658307
  • 10,433