0

I want to plot a histogram of some timing data. The timing data, represented by a continuous variable t, is binned as follows:

t=0
0<t<=1
1<t<=2
2<t<=3
3<t<=4

I have frequency data for each bin. To plot this as a histogram, I understand that I ought to use frequency density; that is, the frequency divided by the bin width. But my first bin has zero width! How can one cope with this?

  • Why can't the first bin include $t = 0$? – K. Miller Oct 29 '16 at 14:06
  • @K.Miller $t$ represents some sort of delay. I'm particularly interested in showing explicitly the situations where there is no delay whatsoever. – John Wickerson Oct 29 '16 at 14:08
  • You could add a bin $[-1,0]$ with the understanding that since $t \geq 0$, it corrsponds to observations at $t = 0$. – K. Miller Oct 29 '16 at 14:25
  • @K.Miller Mm I thought of that too and it's quite tempting. Still feels like a bit of a hack though! – John Wickerson Oct 29 '16 at 15:34
  • I find it curious that you can detect a delay of exactly zero. How do you know when it occurs? What does it even mean? – David K Oct 29 '16 at 16:52
  • @DavidK It's certainly a good question! It's because the data is obtained analytically rather than experimentally. – John Wickerson Oct 30 '16 at 21:02
  • If the data are all integers then set the bins to $(n-\frac12,n+\frac12]$ for integers $n$. If the other data really are spread randomly within each of the intervals $(0,1], (1,2], \ldots$, it sounds like a model of a mixed probability distribution (or something that works like such a distribution), in which case maybe a cumulative distribution function would be a better representation. Or hack the histogram as already suggested; histograms aren't really designed to do mixed distributions. – David K Oct 30 '16 at 21:16
  • @DavidK Thanks. The data is indeed real-valued within those intervals. I will look into a cumulative version. Feel free to upgrade your comment to an answer that I can accept. – John Wickerson Oct 30 '16 at 21:32

1 Answers1

1

For data that are analytically derived, where some positive percentage of the data occur at a single exact value and others may be found throughout some interval(s) on the real line, a cumulative distribution function (CDF) is one way to clearly graph the data.

If this actually is a probability distribution of a random variable $X$, the CDF is given by $F(t) = P(X \leq t)$. For the situation described in the question, where only values $t \geq 0$ can occur, you would have $F(t) = 0$ for all $t < 0$, then $F(t) = P_0$ for $t = 0$, where $P_0$ is the fraction of data that fall at $t = 0$ exactly, and $F(t)$ is increasing for all $t > 0$ where the probability density at $t$ is positive, $F(t)$ constant anywhere else.

This also works for data that are not random but that act like a probability distribution, in this example a certain percentage at one exact value, a certain percentage distributed in the interval $(0,1]$, a certain percentage in the interval $(1,2]$, and so forth. If all you had available (or all you wanted to determine) was the frequencies for each of these bins and for the value $t=0$, you could interpolate a straight line segment from $(0,P_0)$ to $(1,P_0 + P_1)$ where $P_1$ was the fraction of data falling in the interval $(0,1]$.

David K
  • 98,388