4

I am carrying out analysis on a corpus of data and I am currently investigating the frequency of words appearing in that corpus.

What I am looking for is a function which penalises large and small values so that, instead of a graph of decreasing values as words become more infrequent, I will be left with an approximation of a bell shaped curve.

Any help would be greatly appreciated.

Patrick Original and Transformed data

  • Why do you want a bell shaped curve? I wouldn't expect word frequencies to have anything like a normal distribution, and I don't see any reason to force the data into that distribution. – Robert Israel Aug 06 '13 at 15:18
  • What I'm trying to do is to cluster documents and assign relevant labels. I wouldn't want to have a cluster titled with very common words but similarly I wouldn't want very obscure names. By transforming it I hope to be able to view the central range and identify labels from this data – Pdycassidy Aug 06 '13 at 15:50

1 Answers1

2

Well, you can play with the standard Gaussian. Suppose your smallest value is $x_{\min}$ and your largest is $x_{\max}$. Then the median would be $(x_{\max}+x_{\min})/2$, which you could set to be $\mu$. The normal distribution is $\displaystyle f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}.$ The $\sigma$ parameter is a measure of the spread of your data, so you could play around with that. I'd recommend coding this all up in Excel. So you have your histogram data in one column, compute the max and min, and then code this function up, referencing a changeable cell as the $\sigma$.

Adrian Keister
  • 10,099
  • 13
  • 30
  • 43
  • Thanks Adrian, this function does generate a peak in the middle of the graph. The problem is that my data is extremely right skewed and, as a result, the peak occurs far to the left as well. Can you think of any solutions to achieve a central peak? I can supply graphs if required. – Pdycassidy Aug 14 '13 at 10:45
  • @Pdycassidy: I'm not sure I quite follow. If the data is right-skewed, why does it peak to the left as well? The peak of the normal distribution I gave you occurs at $\mu$. I'm not following your distinction between "peak in the middle of the graph" and "central peak". A graph would be nice, definitely. – Adrian Keister Aug 14 '13 at 12:11
  • I have attached the graphs to the question body. The rows are the original and transformed data and the columns are the first 250 terms and the first 2000 terms (Data set has >250,000 terms). As you can see, there is a peak at the very beginning of the transformed data whereas I was looking for something more central. However, this is not a fault of the solution you provided so thank you. – Pdycassidy Aug 15 '13 at 13:33