I have questions on understanding this article about Dirichlet process. If you look at the beginning of section 2.1, it shows three equations 2.1, 2.2, 2.3. The question is I don't understand what exactly those probabilities represents and why we need them. And one thing that the article confuses me is that they removed the subscripts in equations 2.1 and 2.2. They said that $L_i=k$ implies that $X_i \in k$. So if $L_1 = 1$, then $X_1$ belongs to $1$ cluster. How can I interpret these two equations? Also, can anyone suggest articles about Bayesian nonparametrics, Dirichlet process and Indian buffet process?? I am trying to understand the unsupervised clustering method using these processes. Thank you!
Asked
Active
Viewed 139 times
1 Answers
2
By this model, the data are assumed to be generated in the following way.
- First, pick a cluster according to the distribution $c_k:=\mathbb{P}\{L=k\}$ (for example, maybe there are $6$ clusters, and you roll a biased die to choose the cluster).
- Having chosen the cluster $L=k$, pick a point $X$ by drawing from the distribution $P_k(\cdot):= \mathbb{P}[X \in \cdot \mid L=k]$ (for example, a classic example is a multivariate Gaussian, so that most points will be near the mean, and form a "cluster"). Each cluster has a different distribution (so maybe one cluster is a Gaussian centered over here, another cluster is centered over there, and so on).
- Plot this point. Repeat these steps over and over to generate a dataset. You can replace the $L$ and $X$ with $L_1$ and $X_1$, then $L_2$ and $X_2$ for the next point, and so on.
These are our assumptions, that is, somebody had distributions $\mathbb{P}\{L=k\}$ and $\mathbb{P}[X \in \cdot \mid L=k]$, and followed this procedure to create a dataset.
The usual task is to go backwards: to take a given dataset, and [assuming it was created in this manner], figure out what these two distributions are.
angryavian
- 89,882
-
Thanks for you comment. So the second step of the list means that we choose one cluster $L=k$, and finding the probability $P_k$ of data $X$ given the cluster $L=k$?? And going backward meant the clustering data?? – eChung00 Aug 19 '14 at 03:30
-
@eChung00 The list of steps that I wrote describes how to create a brand new dataset when all you have are the two distributions. "Going backwards" means that you have a dataset already, but you want to find out what the two distributions are. – angryavian Aug 19 '14 at 03:38
-
I have one more question. If you look at the equation 2.3, they multiply these two probabilities and say it is mixture distribution. Can you explain this part too?? – eChung00 Aug 19 '14 at 18:37
-
@eChung00 This is the marginal distribution of $X$ (don't worry about the word "marginal", it's just the probability distribution that the data follow). You can derive it using marginalization and the definition of conditional probability: $$\mathbb{P}{X \in \cdot} = \sum_k \mathbb{P} {X \in \cdot, L=k} = \sum_k \mathbb{P} {X \in \cdot \mid L = k} \mathbb{P}{L=k}.$$ – angryavian Aug 19 '14 at 21:05
-
@eChung00 This is how a mixture distribution is defined. If you use the Bernoulli/Gaussian model that I described in my post, it would look like a bunch of "hills," since you are adding a bunch of Gaussian distributions together. There's a picture in Figure 2.2 in the file. – angryavian Aug 19 '14 at 21:07