1

Suppose that you have $n$ observations, $y_1, y_2, ...y_n$ that are the result of the process $aX$ where $a$ is a real constant and $X$ is a random variable such that $X\sim\mathrm{Poisson}(-b/a)$. It is the case that $0>a>1$ and $b>0$. I am trying to use maximum likelihood to estimate $a$ and $b$. I can do it for $b$, and have made some progress for $a$, but need help.

To illustrate, here's a density plot for two simulated data sets. The blue one is $b=-\ln 0.9, a=0.1$ and the pink one is $b=-\ln 0.9, a=0.001$.

enter image description here

The approach, suggested by spaceisdarkgreen, is that we set $x_i = y_i/a$ at the beginning, so that the log-likelihood function is:

$$\ell(\theta) = \ln \big(\frac{(b/a)^{\sum y_i / a} e^{-nb/a}}{\prod (y_i / a)!} \big)$$

$$\ell(\theta) = \sum{(y_i / a)}\ln{b/a} - nb/a - \sum{\ln [(y_i / a)!]}$$

Take the partial derivative w.r.t $b$:

$$ \frac{\partial \ell}{\partial b} = \frac{\sum y_i}{ab} - n/a$$

Setting to zero and multiplying through by $a$ we get

$$ \hat b = \frac{\sum y_i}{n} $$

I can veryify by simulation that this is a good estimator for $b$.

With that in the bag, we try to do the same for a:

$$ \frac{\partial \ell}{\partial a} = -\frac{\sum{y_i}\log{b/a}}{a^2} - \frac{\sum y_i}{a^2} + \frac{nb}{a^2} - \sum \frac{d}{ds}\log [(y_i / a)!]$$

At this point I replaced the factorial with the gamma function. I learned that the logarithmic derivative of the gamma function has its own name, the digamma function, denoted by $\psi()$, so the last term becomes $ -\frac{y_i}{a^2} \psi(y_i/a) $

Multiplying through by $a^2$, setting $b = \sum y_i / n$ which we already know, cancelling terms and distributing the log we get:

$$ \frac{\partial l}{\partial a} = -\sum{y_i}\log{\sum y_i} + \sum{y_i}\log{na} + \sum y_i \psi(y_i/a)$$

Set it equal to zero and we have

$$ \sum{y_i}\log{\sum y_i} =\sum{y_i}\log{na} + \sum y_i \psi(y_i/a)$$

I can check that this equality holds, which is does for relatively small values of $a$ and relatively large values of $b$:

n = 10000; a = 0.02; b = 0.1
x = rpois(n, b/a)
y = a*x

# Left side
sum(y)*log(sum(y)) # Evaluates to 6958.385 on my test run

# Right side
# There are a few NaNs in the digamma(y/s)...these get multiplied by 0 
# anyway so it actually doesn't matter what they get set to
dig = digamma(y/a)
dig[is.na(dig)] <- 0
sum(y) * log(n*a) + sum(y*dig) # Evaluates to 6958.799 on my test run

That's very good; but if you flip the values of $a$ and $b$ this equation is much further from holding.

My question (which has by now undergone extensive editing!) is how to proceed from here. Does what I've done look justifiable/right? Is there a way to use the last equation to solve analytically or numerically for $a$? Why does it only seem to work well for a range of values? Thanks!

Ben S.
  • 505
  • Since you don't actually observe $X_i,$ but only $aX_i,$ how do you know what function of $a$ and $b$ you're trying to maximize? – Michael Hardy Feb 28 '18 at 23:09
  • Not sure I totally understand your question? I'm supposing that for a given series of observations there's an $a$ and a $b$ that are most likely. It's definitely true for $b$ because I can see that the ML works in that case. By simulation, varying $a$ while keeping $b$ constant sure changes the shape of the distribution, so it at least seems like at first glance it should be doable? – Ben S. Feb 28 '18 at 23:22
  • @BenS. I think Michael's point is how are you even determining what $\sum_i x_i$ is if you don't know $a$? You don't observe the $x_i$ directly. – spaceisdarkgreen Mar 01 '18 at 00:26
  • If $Y_i = aX_i$ then we have for an observation of $Y_i$, $$ P(Y_i = y_i) = 1(\mbox{$y_i/a$ is a natural number})e^{-b/a} \frac{(b/a)^{y_i/a}}{(y_i/a)!}$$ – spaceisdarkgreen Mar 01 '18 at 00:35
  • @spaceisdarkgreen Oh I see, thanks for explaining. Right, we don't know $\sum_i x_i$ so it doesn't make sense to use it. Your second comment, is that a suggestion for how to proceed? – Ben S. Mar 01 '18 at 02:15
  • @BenS. It was more intended as a correction to what you had for the likelihood. I don't actually know how to proceed all the way to the finish line, as it seems a strange problem. The MLE will occur at one of the discrete values of $a$ that make all of the $y_i/a$ natural numbers. You have a largest value for $a^$ that will do that and then $a^/n$ for $n=2,3,4$ etc will also. One of these others could be the true value of $a$ if you, say happened to have a sample where the $x_i's$ were all even. For each $a^*/n$ you can maximize over $b.$ – spaceisdarkgreen Mar 01 '18 at 02:55
  • @BenS. You would see the optimum likelihood decrease eventually as you increase $n$ (probably very soon) and the biggest likelihood at that point will be the max. The problem is there's no practical way I can think of to find $a^*.$ In theory, you scan upward through $a$ until you hit something where the $y_i/a$ have no common divisor. – spaceisdarkgreen Mar 01 '18 at 02:58
  • @BenS. It also seems like we have an identification problem: Say your sample is all zeros. Then your likelihood reduces to $e^{-b/a}$ which is maximized for $b=0$ and $a$ anything, or $a\to\infty$ and $b$ anything. In other words, the zeros could be because $a$ is very large or $b$ is very small. – spaceisdarkgreen Mar 01 '18 at 03:10
  • @spaceisdarkgreen, I see, thank you. All good points. In practice, I don't think relying on something like no common divisor would end up working because there would be some noise in the observations. I actually have a pretty good idea of the range of the values that $a$ could fall into (about -0.001 to -0.1). Given that $b$ can be estimated from the data, is there some accepted way to simply try a range of $a$'s, generate the predicted distribution for each one, calculate some sort of fit-to-the-data score for each trial value of $a$, and pick the best one? – Ben S. Mar 01 '18 at 04:13
  • @spaceisdarkgreen I have used your suggestion and I think I made a bit of progress. Question has been fairly heavily edited. – Ben S. Mar 01 '18 at 07:04
  • @BenS Your work for the estimator for $b$ is right. As for replacing with the Gamma function, remember that $n!=\Gamma(n+1)...$ I don't see any plus ones in your work. But I don't think you will get anywhere this way, anyway. When you set $b$ to its uniformly optimum value you found, with very high probability, the likelihood increases as $a$ gets large and has a maximum at $a=\infty .$ The integer quantization is crucial, and you can't just forget about it. – spaceisdarkgreen Mar 01 '18 at 18:56
  • @BenS But, moving away from likelihood, consider that $Var(Y)=ab.$ So you can estimate $a$ based on the variance. (However, you said your data "has noise" i.e. your true data generating distribution is not the one you specified, so the variance of the data might be higher that the true variance of $Y.$ Just a caveat.) – spaceisdarkgreen Mar 01 '18 at 19:12
  • @spaceisdarkgreen Wow, that is a very simple thing that I missed. Yes, $var(Y) = var(aX) = a^2 Var(b/a) = ab$. Thanks for your help. If you want to submit that as an answer I'll accept it. – Ben S. Mar 01 '18 at 23:58

1 Answers1

1

Per our discussion in the comments, MLE does not seem like a pragmatic choice for estimating these parameters. Because it works using the whole probability distribution, the MLE must hone in on the fact that there are a discrete set of $a$'s for which the data $\{y_i\}$ has a nonzero probability, i.e. $a$'s such that $y_i/a\in\mathbb N$ for all $i .$ But it is hard to use this information in practice, especially if there's any measurement error or misspecification. The fact that MLEs can be very tailored to the probability distribution assumed tends to make them less robust even as it improves their theoretical performance assuming the model is true.

On the other hand, method of moments might work here. We have $E(Y)=b$ and $Var(Y) =ab,$ so we can estimate $\hat b$ as the sample mean and then $\hat a$ as the sample variance divided by the sample mean. The theoretical performance is poor: it will almost always give you a value of $\hat a$ that you can theoretically tell from the data is impossible (since $y_i/\hat a$ are not all integers). But this might not be important from a practical standpoint, given that the model isn't perfect and you probably just want a distribution that fits decently.

But notice this isn't a one-size-fits-all thing. In the limit where $b/a$ is small so only a handful of values for $X$ are common, your mean and variance will be fairly noisy, but your data should be pretty clearly clustered, so estimating $a$ using the clustering would probably be better than the moments.