0

I have a doubt on boxplot.
I'll expose my knowledge and then my doubt.

  • $x=\{x_1,x_2...x_n\}$: the set of samples
  • $q_1$,$q_3$: the first and third quartiles
  • $w_l$,$w_u$: the lower and upper whiskers
  • $IQR = q_3 - q_1$
  • box extends from $q_1$ to $q_3$
  • $w_l = max(min(x),q_1 - 1.5\cdot IQR)$
  • $w_u = min(max(x),q_3 + 1.5\cdot IQR)$
  • $outliers = \{ x_i \in x \; | \;\; x_i < w_l \vee x_i > w_u\}$

Observations:

  • $\text{whiskers' distance from box are not symmetric} \\ \iff (w_l = min(x) \vee w_u = max(x)) $
  • $w_u - q_3 < q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i > w_u$
  • $w_u - q_3 > q_1-w_l \;\; \implies \nexists x_i : x_i \in outliers \wedge x_i < w_l$

My doubt: if all what I exposed is correct, how do you explain the presence of outliers in this speed of light boxplot (third experiment, lower outliers) and in this plot (see wednesday, lower outliers)?
In the case my reasoning is wrong, please provide a simple numeric counterexample.

HAL9000
  • 205
  • I see it know, you mean how is it possible that there are simultaneously outliers both above and below. And, ok, but then in that case whiskers are determined by q3+1.5IR and q1-1.5IR, so how it is possible in that case that they are not symmetric? That is your objection, isn't it? – Jimmy R. Feb 28 '14 at 19:25
  • @Stefanos Yes, my doubt is this. – HAL9000 Feb 28 '14 at 19:26
  • 1
    Ok, I see. Interesting observation. In general they do not have to be symmetrical (you know this, as I see) but in that special case (outliers in both directions) they should! Sorry, I have never seen that, I will think over it! Interesting observation +1 – Jimmy R. Feb 28 '14 at 19:28
  • @Stefanos The only case I've seen it is on logscale, which is not the case of examples I linked. I simply think they are wrong or possibly they used a method different of that exposed on wikipedia. – HAL9000 Feb 28 '14 at 19:30

2 Answers2

1

Consider the data $$\{0,4,5,5,5,6,6,6,6,7,20\}.$$ The median is $6$, the first quartile is $5$, and the third quartile is $6$. So the IQR is $1$ and it easily follows that $\{0\}$ is a lower outlier and $\{20\}$ is an upper outlier. What you need to take into account is that the box shows you where 50% of the data lies, so if this is particularly narrow, then the IQR is small, and any values outside the range determined by the 1.5IQR rule are outliers. There can be many outliers, or none at all.

heropup
  • 135,869
  • Thank you for the answer, however you are missing something: in your example whiskers have the same length. My question is why, in those plots I linked, whiskers have different length while both upper and lower outliers are present? – HAL9000 Feb 28 '14 at 19:55
  • It should not take much thought to realize that whiskers don't have to be the same length, either. Try ${0, 1, 30, 31, 31, 32, 32, 32, 32, 32, 32, 33, 39, 39, 50, 500, 1000 }$. $Q_1 = 31$, $Q_2 = M = 32$, $Q_3 = 39$, $1.5 IQR = 12$, and it is easy to see that the whiskers are not the same length, there are multiple outliers, etc. – heropup Feb 28 '14 at 20:24
  • There is still something wrong, maybe in my definition of whiskers: if you calculate $w_l = q_1-1.5\cdot IQR$ you obtain $w_l = 31-18 = 13$. However in MATLAB $w_l$ is $30$. So must $w_l$ always correspond to the value of set nearest to $q_1-1.5\cdot IQR$? – HAL9000 Feb 28 '14 at 20:34
  • $13$ is not a data point. The lower fence/whisker corresponds to the smallest data point that is still greater than $Q_1 - 1.5 IQR$, not the value of $Q_1 - 1.5IQR$ itself. A similar idea applies to the upper fence/whisker. Your confusion arises because you think that the whiskers are the calculated limits of the 1.5IQR rule, when they are not. – heropup Feb 28 '14 at 20:41
  • I set your answer as good, thanks. – HAL9000 Feb 28 '14 at 20:46
0

Ok I got the answer:

The definitions of $w_l$ and $w_u$ in my question were wrong. Referring to Wikipedia:

"whiskers can represent several possible alternative values" such as "the minimum and maximum of all of the data" or "the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile", or even "one standard deviation above and below the mean of the data" and finally "the 9th percentile and the 91st percentile" or "the 2nd percentile and the 98th percentile".

HAL9000
  • 205