A general keyword you could search for is "distributional similarity". My formulation: Words are similar to the extent they occur together with the same words.
For this purpose, word co-occurrence statistics mean basically the number of times that other words occur together with the words of interest. First you build a "vector" that tells how often any of the "other words" occurred together with a word. Then you normalize these counts so that their sum is $1$. That is formally a probability mass function: the "other words" are the elementary outcomes and the normalized counts are their probabilities. The "words" in the formula in your image are these probability mass functions.
Then you apply the Jensen-Shannon divergence, aka information radius, aka mean divergence to the mean, $(D(f\| (f + g)/2) + D(g\|(f + g)/2))/2$, to the pairs of such functions, $f$ and $g$, and cluster the words based on the numbers you obtain, using them as if they were distances of the words.
It's possible to apply weights to the counts according to their informativeness. Formally the crucial things are that the mean of probability mass functions is again a probability mass function, and the mean is only $0$ where both of $f$ and $g$ are $0$, so relative entropy aka Kullback-Leibler divergence can be applied without smoothing.
The group where Dagan was at that time (I think) did some empirical experiments where this formula compared favourably to a number of others. You might also want to look up some of the papers of Lillian Lee from that time, including her PhD thesis.
(Some use the name "information radius" for a formula that is twice "Jensen-Shannon", but there is an earlier paper on information radius that develops the same general formula under that name. When comparing two equally weighted distributions, the difference does not matter.)
The paper I mentioned is Robin Sibson, 1969, "Information Radius", Probability Theory and Related Fields, 14(2):149-160. In 1969 the journal was Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete. It develops the formula, not the application to word co-occurrences. (Jianhua Lin's paper where he names the formula after Jensen and Shannon seems to be as late as 1991. Is that right?)