2

I am BEGINNING to study Statistics and Probability and am trying to understand what a probability density function is/is used for.

My current interpretation is:

The name function indicates to me something that provides an output dependent on the input I give it. Taking for example the PDF for the standard normal distribution (shown below);

$$ p(x) = \mathcal{N}(x;0,1) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} $$

In my mind the above equation describes the probability/likelihood that a continuous random variable $x$ takes on a value in it's sample space (i.e. set of all possible values).

So lets say this PDF (normal distribution) describes the time taken for men to run a marathon (real average is about 4 hours). If plotting this PDF the $y$-axis would contain non-negligible values for corresponding marathon times from around 2 hours (on the extreme left) to 6 hours (on the extreme right) with the average/mean centered at 4 hours.

If I programmed the PDF equation (above) into computer and then ran a script that requested a input $x$; I could provide any real valued input in the domain from $-\infty$ to $+\infty$ and the output of the PDF equation would give me the probability that a man would finish the race in that time?

Why is this useful; If i'm standing at the start line before the race begins and a competitor walks over to me and bets me $20 that he can finish the race in exactly 3 hours, if I know nothing else about him, his training regime etc... I can quickly take out my phone, run the script and enter the value 3 hours and the output can be interpreted as the probability the man will finish the race in exactly 3 hours? If I fancy the odds I might decide it is a good idea to accept his bet.

Questions related to my current understanding are as follows:

(1) Is the the above interpretation correct whole/partially?

(1.1) If partially then where exactly am I getting my wires crossed?

(2) Bonus Question: How would you link an understanding of standard deviation and/or variance into this example?

MarkMark
  • 171

2 Answers2

2

A couple of things that you may find useful

Continuos vs. Discrete

The distribution you use as an example (the normal distribution) is a continuos distribution, in the sense that the values the random variable can take is uncountable. Another examples of these variables are the $\beta$-distribution, the logit distribution, $\dots$ Here's a comprehensive list of continuos distributions. The deal with these distributions is that the probability that the variable takes a particular value is exactly zero. In this case, what has a meaning is the probability of getting a value in some measurable set. In you example, this would be to tell the script to calculate the probability of finishing in a time $t$ between $t_1$ and $t_2$ for $t_1<t_2$ given numbers.

$$ P(t_1 < t < t_2) = \int_{t_1}^{t_2}{\rm d}t~f(t) $$

This in contrast with discrete distributions, where the possible values that the variable can take are countable. A typical example is the result of throwing a dice, or flipping a coin. Here you can find another examples.

Why is this useful?

The list including the cases where knowing the probability distribution of a random variable is useful is rather long. Each field has its own application. I can give you a couple of examples that some people may consider useful.

Imagine you want to make an invest on the stock market. The prices fluctuate of stocks in general fluctuate and you're not sure if the commodity will devaluate (losing you money) or will go up. If you knew the probability distribution of the prices of the stock at a given time you could ask and answer yourself "what is the probability of loosing a fraction $x$ of my investment?"

Like this there are many other very interesting applications! Here's another one: quantum mechanics is in essence a theory that describes the statistical nature of subatomic entities. There, knowing the probability distribution associated with a given physical system, is knowing how the system behaves

Meaning of variance

In your example of the racers, imagine two situations, in both of them, competitors cross the final after 4 h in average

  1. 99% of the racers cross the line between 3:50 h and 4:10 h

  2. 99% of the racers cross the line between 2 h and 6 h

This tells you something about these two distributions. Clearly they are different. For example, in the second case, you need to include a longer interval to account for the same fraction of racers, so in a sense the distribution is broader, or with larger variance than the first one.

caverac
  • 19,345
  • 1
    Thanks, so there are many, many distributions for different use cases. But the PDFs describing each distribution simply return probabilities associated with events (whether those events are discrete or continuous). However as far as continuous probability distributions are concerned, the probability of getting a specific value is 0, ergo we need to take the integral between two points in the domain in order to get the probability that the variable of interest will be above/or below some value. – MarkMark Sep 27 '17 at 11:29
  • 1
    @MarkMark That's pretty much it – caverac Sep 27 '17 at 11:31
  • Thanks @caverac your explanation cleared that up a lot for me, I appreciate the help. – MarkMark Sep 27 '17 at 11:32
  • @MarkMark BTW I completely overlooked the last part of your question before. I edited the answer to include it now – caverac Sep 27 '17 at 11:38
  • That explanation of variance has been extremely helpful, I have been reading about it all morning but couldn't really get a grasp on it until it was explained in these terms. So basically a distributions variance is a metric which describes the spread of values around the mean (distance of data-points from the mean in the distribution of interest). From an analytical standpoint I can decrease distribution variance by increasing the number of samples taken. By standing at the finishing line of many, many marathons with an infinitely precise stopwatch recording race times of every man – MarkMark Sep 27 '17 at 11:53
  • @MarkMark The first conclusion is right: it just measures spread around the mean. But this is intrinsic to the distribution, it does not depend on how many samples you get from it. What will definitely change is how you estimate it. Say for example that the variance of the distribution of arrival times is $\sigma^2$, with $\sigma = 1$ h. If you just measure two samples and try to determine the spread you will probably get a number that is very different to $\sigma$, but, if you wait for many racers to finish, you will start to see that the spread is $1 h$ – caverac Sep 27 '17 at 12:03
  • @ caverac Thanks, I see now how it is intrinsic to the distribution, since my observation of samples from the underlying distribution does not affect the distribution in any way. If the actual distribution is unknown to me before I begin, I can only begin to develop a sense of the underlying distribution by collecting as many data points as possible, as N (number of samples I take increases) the recorded variance will approach the variance of the underlying distribution. Thanks again for your help, would never have understood this without your help. – MarkMark Sep 27 '17 at 12:12
1

The probability is given by the integration of the density function over a range. So the probability of finishing in exactly 3 hours is zero. This of course assumes a stopwatch with infinite precision. The cumulative probability function which is the integral from $-\infty$ to $x$ is often more useful as you would use this to tell you the probability of finishing in 3 hours or less.

user121049
  • 1,599
  • 1
    Thanks for your response, what I get from this is; It is impossible to assign a probability to a specific value in the domain. However the PDF itself does give probabilities for each value in the domain. In order to win my $20 I will need to take the integral from -∞ to 3 hours in order to assess the probability that the competitor will finish the marathon in <=3 hours. – MarkMark Sep 27 '17 at 11:08