4

I am analyzing data with a geometric distribution. Using maximum likelihood estimation, I can estimate $p$ to be $\displaystyle \hat p_{MLE} = \frac{N}{\sum_{i=1}^N x_i}$, where $N$ is the number of datapoints and each $x$ is the number of trials necessary for the first success, in each separate experiment.

However, it is not clear to me how 'accurate' this value is, and thus I'd like to construct a $95\,\%$ confidence interval. But my searches have been rather unsuccessful, as I can't find any worked out versions of a suitable confidence interval for this particular distribution. I'm pretty sure there has to be something out there, and I would be very thankful if someone could guide me towards it!

1 Answers1

2

In frequentist inference, the proper confidence interval depends on what you consider your sample space: If you were to repeat your experiment/observational scheme, what would remain constant, what would be allowed to change. Specifically, did you always intend to observe N geometric RV's or was that just when the experiment ended? Are you always observing the same number of trials? If neither is controlled, then its hard to specify what type of process you've observed and how an interval would behave under repeated sampling.

Here are a couple suggestions:

  1. Turn your observed geometric variables into an equivalent bernoulli series and apply binomial inference procedures to it. This assumes you had no pre-specified number of successes you needed to achieve (i.e. the sample size was not specified in advance).
  2. Apply negative binomial confidence procedures to it, with a known number of successes, i.e., r=N, and you want an interval for p given N. This assumes you pre-specifed the sample size.
  3. Perform boostrap resampling on your observed data vector, each time re-calculating the estimate for p, and see the distribution of your estimator relative to the actual value of the estimator for the original sample. Look up Boostrap confidence intervals (percentile and BCC methods).
  • Thank you for your elaborate answer. I'll dive into all of this today, and see if I can get anywhere from that. – user3183724 Jan 23 '14 at 09:10
  • I was too late to edit my previous comment, but I'm not sure I understand the distinction between N geometric RV's, and the number of trials. Isn't the outcome of each trial a random variable? In any case, I do not know the number of trials I will end up beforehand, no. So I'll start looking at number 1 and 3! – user3183724 Jan 23 '14 at 09:15
  • Though, now that I think of it, I might not know the number of trials before obtaining the data, but I do know them before I start calculating what P is. So I'll look at 2 instead. – user3183724 Jan 23 '14 at 09:32
  • Alright, so about the negative binomial distribution. If I were to go with that one, I would need X samples of size N, would I not? If this is what you meant by pre-specified sample size, then this is indeed not the case, and I was mistaken. I have X samples of variable size N. So I'll go back to the other methods. – user3183724 Jan 23 '14 at 09:54
  • Alright, the bootstrapping seems to work. I'm still a little uncertain when it comes to the number of bootstrap samples used. Should I just put this as high as computation time permits? My dataset will be relatively large, on the order of ~60000 datapoints, so I'm assuming that is positive. Once I figure this out, I can close the question, and thank you very much for your time and patience! – user3183724 Jan 23 '14 at 10:11
  • Or would you say that the first method is more accurate, scientifically? It is very hard for me to assess how 'good' bootstrapping is, compared to for example the binomial inference procedure you mentioned. – user3183724 Jan 23 '14 at 10:38
  • 1
    @user3183724 For the non-bootstrapping approaches, when I say "trials" I do not mean the geometric RVs, but the string of 1's and 0's that correspond to your observed RVs. Then, you either treat this as as one large negative binomial sample or one large binomial sample...depending on which parameter you assume to control for future replications (time periods observed or number of transitions). Also, with 60,000 data points (is that time periods or geometric RVs?), you have a great basis for bootstrapping. Of course, use as many replicaitons as possible to minimize simulation error :) –  Jan 23 '14 at 17:13
  • 1
    @user3183724 As an example for non-bootstrapping methods: if you have the geometric RVs {2,2,1} then the sample you will apply neg or regular binominal methods to is {01011}...see what I mean...you are aggregating into the raw observed data string then applying a method to the entire string. –  Jan 23 '14 at 17:14
  • Hm, I see. The parameter I control is the time periods, so that would be negative binomial? And the 60000 data points in this case was the number of transitions. This is simulated data, from my experiment on which I'm applying this I get 250.000 timepoints per dataset, and I get about 10 datasets. So it's a lot of data, which I suppose is good when you talk about bootstrapping. Then again, from what I could find on bootstrapping, its more something you do when you don't have enough data, am I correct? – user3183724 Jan 23 '14 at 19:51
  • Also, I don't completely follow how {2,2,1} = {01011}. The way my data works is that it'll be like {1110011110110001000}. Here I'll only look at the time periods spent in '1', so then my corresponding array would be {3,4,2,1}. Does that make sense? – user3183724 Jan 23 '14 at 19:51
  • @user3183724 Since you are only focusing on P(S1 to S2), the time spent in S2 is irrelevant for your purposes. Therefore, when I say "reconstruct" the series, I only mean the series relevant to your problem, which would be {1110,11110,110,10}. This would correspond to geometric RVs {4,5,3,2} since you include the first success. The other 0's are irrelevant for determiming "P". Since you are controlling for the number of time periods observed (this should be "time periods spend in S1"!! Not total time periods) you would use a Binomial model on the equivalent bernoulli dataset {00010000100101}. –  Jan 23 '14 at 20:05
  • @user3183724 You are correct that boostrapping comes to the fore for small datasets that defy parameterizaion. With so many datapoints, your point estimate should be very, very accurate! The boostrap is the most DIRECT way to get the CI's, but with so much data, you may get away with forming a normal confidence interval using the sample mean, variance, and sample size. I.e. $\bar x\space \dot\sim \space N(\frac{1}{p},\frac{1-p}{\sqrt{N}p^2})$ Or using the wilson score interval on your estimator for p, $\hat p$ with $n=\sum x_i$. –  Jan 23 '14 at 20:14
  • @user3183724 Finally, there may be a bit of nomenclature issues between us...I am defining a geometric RV as the TOTAL number of trials needed for a success = $n_{failures}+1$, whereas you may be thinking of the alternate form that only counts failures. My estimator (from previous post) assumes the former. –  Jan 23 '14 at 20:17
  • No, luckily I am thinking of the same nomenclature, being the total number of trials needed for a success. I quickly went through the derivation of the MLE and I obtained the same expression as in your previous post, so that is good. In any case, I see what you mean with the binomial model, and I also understand the argument for a normal confidence interval. I'll think about it for a while, and decide what fits my purposes best. In any case, I'd like to thank you very much, for your extensive amount of help! It has been very valuable. – user3183724 Jan 23 '14 at 20:37