Some statistical (learning) issues

Question

im reading about statistical learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman The elements of statistical learning) and for some reason it seems to be trivial that $E[XX^T]$ is non-singular (with $X \in \mathbb{R}^P$ a random real valued vector) but I dont really think its that easy, am i missing something here?

i see, but there is no condition on $X$. Im trying to see if $E[XX^T]$ is positive definite,which i think but Im only running in circles with that. — rarwoan, Apr 01 '15 at 17:24
if X is always 0, you won't include it in your statistical analysis, so there's virtually no danger in assuming the correlation matrix is positive definite. (It's always at least positive semi-definite.) — Memming, Apr 01 '15 at 17:40

score 0 · Accepted Answer · answered Jul 08 '19 at 14:04

This will only be a problem if $X$ has a really pathological distribution. Firstly, let's remap $X$ to have zero mean, i.e. $X \leftarrow X - \mathbb{E}[X]$; then $C=\mathbb{E}[XX^T]$ is exactly the covariance matrix of $X$, which should tell you that it is positive semi-definite at least. We can show this by taking $v\in\mathbb{R}^P$, then $$ v^TCv= v^T\mathbb{E}[XX^T ]v = \mathbb{E}[v^T X X^T v] = \mathbb{E}[v^T X (v^T X)^T] = \mathbb{E}[\alpha^T\alpha] = \mathbb{E}[\alpha^2] \geq 0 $$ where $\alpha = v^T X= v\cdot X$ and $v\ne \vec{0}$. So it's positive semi-definite.

But when is $\alpha = 0$ (the problem case)?

Well, $v$ is non-zero, but $X$ might be, but how often does that happen? Assuming $X$ is not a Dirac Delta distribution at zero, then in practice probably never, especially if $P$ is high. This is because the probability that a continuous random variable takes a specific value is zero (also here), under mild assumptions.

Of course, the real issue here is that we need to ensure $X$ is never orthogonal to $v$. This happens when all the probability mass (density) for $X$ lies on a hyperplane that is orthgonal to $v$, i.e. the data lies on $$ X^Tv=0 $$ In other words, our distribution cannot lie on a low dimensional subspace of the data space. This is exactly the problem people sometimes find with degenerate features in their dataset: it results in sample covariance matrices with zero determinant (i.e., your matrix would be singular). Check out these posts: [1], [2], [3].

Again, however, for any reasonable distribution, this cannot happen. The reason is that hyperplanes have measure zero (see also here), meaning choosing a random vector will essentially never hit the plane (for the same reason as the probability of a continuous random variable hitting an exact value is zero).

In any case, you would have to choose $X$ to very carefully adhere to a hyper-plane in dataspace, which doesn't occur in practice unless you do something like exactly duplicate your features. So ruling that out, yes, our $C$ is safe (non-singular) and positive definite with probability $1$, in theory.

One caveat is when working with sample covariance matrices $\widehat{C}$ on a computer. A combination of small sample size, and floating point and round-off error (e.g., dealing with data of small magnitude) can lead to the occasional singularity, if the distribution $p(X)$ is close to being degenerate. (Often when trying to compute $\log \det C$ for instance). One can fix this by adding a tiny bit of noise to the data, or by adding a tiny perturbation $\epsilon I_P$ to $\widehat{C}$.

Some statistical (learning) issues

1 Answers1