Why is modeling the joint distribution between many continuous random variables, obtains generalization more easily?

Question

In the paper "A Neural Probabilistic Language Model", by Toshua Bengio et Al., there is the following paragraph:

A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially 10000010 − 1 = 1050 − 1 free parameters. When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties. For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be estimated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.

Can someone explain, in simple words, why is that "When modeling continuous variables, we obtain generalization more easily"?

score 2 · Accepted Answer · answered Aug 16 '17 at 23:23

If one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V of size 100,000, there are potentially $100000^{10} − 1 = 10^{50} − 1$ free parameters

Why is this? Well, to specify the joint distribution of 2 words is a table of $|V|^2$ numbers (probabilities of joint appearance). For each new word, you add 1 new dimension to the table. Hence, for a set of $n$ words you need to specify $|V|^n$ values, minus $1$ (because probability distributions sum to $1$). So, ouch! That's a lot.

When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties.

The thing about discrete distributions is that they can be exceptionally "jagged"; i.e. things can In language, for instance, there is no reason why one word should statistically appear in similar contexts to, say, the one next to it, alphabetically. Hence the explosion of parameters above. Continuous distributions, by assumption, don't have this issue.

More concretely, our problem above had $|V|^n-1$ parameters to characterize the joint distribution of $n$ variables, in the discrete case. Let's suppose we have an RV $X_i$ that takes values in $\mathbb{R}^d$ rather than $V$. At first glance, this seems to be harder since the number of possible values in $\mathbb{R}^d$ is larger than $|V|$ (even for $d=1$). However, what if we think that the joint distribution of $X=(X_1,\ldots,X_n)$ is well approximated by a Gaussian mixture model? Then we need only specify $k$ (number of Gaussians), $W$ (vector of weights, $|W|=k$), the means $\mu_j\in\mathbb{R}^d$, and the covariances $\sigma\in\mathbb{R}^{d\times d}$. This is only on the order of $k+k(nd+n^2d^2)$, roughly speaking. This is comparatively quite small! Much of the reason for this is that large patches of space are assumed to have probabilities smoothly varying compared to their neighbors; hence, one requires many fewer parameters to characterize large patches of space. (Even the largest deep neural networks have nowhere near close to ${\sim}10^{50}$ parameters! Hence why we prefer to do NLP in "continuous spaces" by embedding them).

Why is modeling the joint distribution between many continuous random variables, obtains generalization more easily?

1 Answers1