How to cluster words using Jensen-Shannon Divergence?

Question

I'm working on a project that requires text analyzing. I'm currently trying to extract keywords from a single document. I referred to this thesis(Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information by Matsuo) and tried to implement its algorithm myself. In the algorithm there's a step of clustering the similar words. It recommends Jensen-Shannon divergence, according to this essay(Similarity-Based Models of Word Cooccurrence Probabilities by Degan). The original Jensen-Shannon divergence formula is: $J\left(w_{1}, w_{1}^{\prime}\right)=\frac{1}{2}\left[D\left(w_{1} \| \frac{w_{1}+w_{1}^{\prime}}{2}\right)+D\left(w_{1}^{\prime} \| \frac{w_{1}+w_{1}^{\prime}}{2}\right)\right]$and the essay gives another version: enter image description here where h(x) = -x*log(x)

Now I've been using the second formula, where w1 and w1' being two words from a word set, and w2 being other words in said word set. But I don't know what to do next. I don't know how to calculate the probability distribution for words in a text! Can you help?

@JussiPiitulainen sorry, typo mistake... Do you have any idea how to solve this? — Joe Wong, Feb 03 '16 at 10:39
I'm not sure how concrete advice you need. I think the key is that the things inside the formula are probability mass functions that are obtained from co-occurrence counts in text, so this is what I mainly tried to explain in my answer. I can expand or clarify if that seems useful. I've used this formula on word similarity myself. — , Feb 03 '16 at 11:31

score 1 · Accepted Answer · 2016-02-03T11:15:19.990

A general keyword you could search for is "distributional similarity". My formulation: Words are similar to the extent they occur together with the same words.

For this purpose, word co-occurrence statistics mean basically the number of times that other words occur together with the words of interest. First you build a "vector" that tells how often any of the "other words" occurred together with a word. Then you normalize these counts so that their sum is $1$. That is formally a probability mass function: the "other words" are the elementary outcomes and the normalized counts are their probabilities. The "words" in the formula in your image are these probability mass functions.

Then you apply the Jensen-Shannon divergence, aka information radius, aka mean divergence to the mean, $(D(f\| (f + g)/2) + D(g\|(f + g)/2))/2$, to the pairs of such functions, $f$ and $g$, and cluster the words based on the numbers you obtain, using them as if they were distances of the words.

It's possible to apply weights to the counts according to their informativeness. Formally the crucial things are that the mean of probability mass functions is again a probability mass function, and the mean is only $0$ where both of $f$ and $g$ are $0$, so relative entropy aka Kullback-Leibler divergence can be applied without smoothing.

The group where Dagan was at that time (I think) did some empirical experiments where this formula compared favourably to a number of others. You might also want to look up some of the papers of Lillian Lee from that time, including her PhD thesis.

(Some use the name "information radius" for a formula that is twice "Jensen-Shannon", but there is an earlier paper on information radius that develops the same general formula under that name. When comparing two equally weighted distributions, the difference does not matter.)

The paper I mentioned is Robin Sibson, 1969, "Information Radius", Probability Theory and Related Fields, 14(2):149-160. In 1969 the journal was Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete. It develops the formula, not the application to word co-occurrences. (Jianhua Lin's paper where he names the formula after Jensen and Shannon seems to be as late as 1991. Is that right?)

Thank you I really appreciate your answer! I now believe I've done it the right way. — Joe Wong, Feb 04 '16 at 10:24
Excuse me I have another question right now. If I got a very small Jensen-Shannon Divergence value, does it mean the two distributions(words) are close enough to be clustered? — Joe Wong, Feb 14 '16 at 07:43

How to cluster words using Jensen-Shannon Divergence?

1 Answers1