1

I have a discussion with a co-worker, he insist on try to get the mode of a dataset with continuous data, with numbers that go from 0 to 3000, I say that the result are irrelevant because the repeated values will be too lower compared to the total of the data.

Can you point me to some literature that proves or disproves my point? and what percentage of the total data can be acepted as a valid mode?

Thanks.

JhonDoe
  • 13
  • 1
    Kernel density estimation with a suitable bandwidth may give a reasonable estimate of the mode of the underlying distribution. – heropup Dec 01 '20 at 07:14

1 Answers1

0

Suppose you have 1000 observations from a gamma distribution with shape 5 and rate 0.01. The following random sample is chosen using R.

set.seed(2020)
x = rgamma(1000, 5, .01)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  66.37  339.38  463.70  499.28  619.42 1636.80 

Because this is an (unrounded) sample from a continuous distribution all 1000 observations are different, and the sample can have no mode.

However, the continuous distribution of the population has a mode. Sometimes the population mode is approximated by looking at a histogram of the data, and using some kind of interpolation to say where in the tallest bar the population mode may lie. (The assessment of the mode depends on the choice of histogram bins used. Sometimes the tallest bar of a histogram, if there is one, is called its modal bar.)

As @heropup suggests, a better way of finding the shape (hence mode) of the population distribution is to use a kernel density estimator (KDE).

The figure below shows a histogram of the 1000 observations. The actual population density for $\mathsf{Gamma}(5, 0.01)$ is shown as a black curve. (We are able to plot this curve because this is a simulation in which we know the exact population distribution.)

If the population distribution is unknown, then a KDE often does a better job of estimating the mode of the population. A KDE of the sample is shown as a dotted red line in the figure.

hist(x, prob=T, br=20, col="skyblue2", 
     ylim=c(0,.002), main="1000 Obs. from GAMMA(5, .01)")
 curve(dgamma(x, 5, .01), add=T)
 lines(density(x), col="red", lwd=2, lty="dotted")

enter image description here

Notes: (1) By differentiating this gamma density function, one can find that the mode of its distribution (unique maximum of its density function) is at $400.$

(2) The following R code shows that the maximum of the KDE is at $374.5.$

mean(density(x)$x[density(x)$y==max(density(x)$y)])
[1] 374.5053
BruceET
  • 51,500