Finding Mode from Histogram

Question

In our High School Mathematics, we often find mode of a grouped data from a histogram. I am not talking about that process here.My question is why the process works to find the mode using the graph?

I am not talking about the formula.I know, The formula can be derived from Geometry. But, in that geometry, firstly we have to assume the mode from the histogram.Like, joining the last point from the previous bar to the second point of the mode class.And joining the first point of the next bar to the first point of the mode class.Then from the intersecting point,we draw a perpendicular line to the X axis. — Swapnil MZS, Feb 09 '21 at 06:19
https://tutors4you.com/modegraphically.htm THIS link might help.I wanted to know, Why does this procedure work? — Swapnil MZS, Feb 09 '21 at 06:22

BruceET · Answer 1 · 2021-02-09T11:22:37.423

The answer depends on what you mean by the mode of grouped data. Sometimes this is based on a histogram, The modal interval is defined as the one with the tallest histogram bar. Then there are various rules for picking one point within the modal interval to designate as the mode of the grouped data.

Often the ultimate goal of such a procedure is to estimate the mode of the population, which is at the highest point of the density function. For example, the mode of a normal distribution is always at the mean. Suppose you are trying to use a histogram to find the mode of the normal distribution from which the the data in the histogram were sampled. If you have the $n$ individual observations that were grouped to make the histogram, then the best estimate of the mode of the normal distribution from which the data were sampled is just the sample mean of the observations, $\bar X = \frac 1n \sum_{i=1}^n X_i.$

Example 1: Suppose I have $n = 500$ observations from the normal distribution $\mathsf{Norm}(\mu = 50, \sigma=7).$ Below I use R statistical software to make such a sample, rounding the observations to 2 decimal places:

set.seed(2021)
x = round(rnorm(500, 50, 7), 2)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  31.09   44.87   50.10   50.03   54.88   71.15

We know that the mode of the normal distribution is at $\mu=50.$ From the data summary we would estimate this by $\bar X = 50.03.$

Below is one possible histogram of the data. (I have shown the density function of the population in the background.) The modal interval of the histogram is $(50, 60).$ So any formula for finding 'the mode' from the histogram will pick some point in that interval.

hist(x, prob=T, ylim = c(0,.06), col="skyblue2", main="n=500: NORM(50, 7)")
 curve(dnorm(x, 50, 7), add=T, lwd=2, col="orange")

If I just have the grouped data, then the frequencies $f_j$ of the intervals and the midpoints $m_j$ of the intervals are as shown below.

f = hist(x,plot=F)$counts;  f
[1]   4  41  82 118 131  74  42   7   1
m = hist(x,plot=F)$mids;  m
[1] 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5

Then I can use the formula $\bar X \approx \frac 1 n \sum_{j=1}^9 f_jm_j = 50.12$ to estimate the sample mean from the interval frequencies and midpoints.

sum(f*m)/500
[1] 50.12

I don't know what formula you are using to find the mode from the histogram, but whatever it is, for normal data it won't estimate the mode of the population more reliably than $\bar X.$

I have used the default histogram in R. I could have use R along with additional instructions to make a different histogram. Then estimates using the histogram would be similar to the above, but a little different. One possibility is shown below. Now the modal bar is in the interval $(50,52).$

hist(x, prob=T, br = 20, col="skyblue2", main="n=500: NORM(50, 7)")
 curve(dnorm(x, 50, 7), add=T, lwd=2, col="orange")

Here, the parameter br = 20 is only a rough suggestion; R chooses 21 instead in order to make a histogram with 'round' numbers as interval boundaries. [I could pick my own exact interval endpoints, and R would obey, if possible.]

Example 2. Now let's look briefly at a sample of $n = 1000$ observations from a non-normal population (gamma distribution), for which the population mode $40$ is not the same as the population mean $\mu = 50.$

set.seed(2011)
y = round(rgamma(1000, 5, 1/10), 2)
summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.34   32.76   46.02   49.65   63.59  199.30 
hist(y, prob=T, br=15, col="skyblue2", main="n=500, GAMMA(5, .1)")
 curve(dgamma(x, 5, .1), add=T, lwd=2, col="orange")
 lines(density(y), lwd=2, lty="dotted")

For this histogram, the modal interval is $(30,40)$ and your method of finding the sample mode would choose a point in that interval. Knowing the formula for this density function (orange curve), I could use calculus to find its maximum, which is the population mode. If I have only the sample I can estimate the mode from the histogram.

Or, ignoring the histogram, I can use more advanced methods to estimate the shape and rate parameters of the distribution and estimate the mode from them. Alternatively, for moderately large samples, a kernel density estimator (KDE) can estimate the density from the sample and we can use that to estimate the mode. The KDE for the sample above is shown as a dotted curve.

Finding Mode from Histogram

1 Answers1