0

If I have a data set of x,y values that I believe resemble a cumulative gaussian distribution, is there an easy way for me to programmatically derive the mean/stddev of the CDF?

For example, imagine I have the following x/y value pairs:

1.380211242 0.06984097
1.62324929  0.09854223
1.84509804  0.15019463
2.133538908 0.227332755
2.423245874 0.352771719
2.73239376  0.481883449

I believe this roughly corresponds to a CDF gaussian function with mean=2.8 and stddev=0.9, or at least that roughly matches when I plot it:

data set and model plot

I'd like to programmatically derive mean/stddev from the x/y values (instead of trial and error in excel) - is that possible?

1 Answers1

1

What I would suggest is to look at square differences between your given $y_i$s and the value implied by a normal distribution at the corresponding $x_i$ value.

Lets call your pairs $(x_i,y_i)_{i=1}^N$, so for example $(x_1,y_1) = (1.380211242, 0.06984097)$. I'll also denote by $\Phi$ the cdf of a standard normal distribution, while $F_{\mu,\sigma}$ is the cdf of a normal distribution with parameters $\mu$ and $\sigma$. Then we have the relationship $F_{\mu,\sigma}(x) = \Phi(\frac{x - \mu}{\sigma})$. Now I suggest to minimize the sum of square differences, i.e. define \begin{align} &D(\mu,\sigma) = \sum_{i=1}^N (y_i - F_{\mu,\sigma}(x_i))^2, \mbox{ and}\\ &\min_{\mu,\sigma} D(\mu,\sigma) \end{align}

The minimization is carried out over the two parameters $\mu \in \mathbb{R}$ and $\sigma > 0$, and the optimal parameters (i.e the parameters that give the lowest value) will be your best (in the least square sense) estimates.

If you want to go there you can calculate the derivative of $D$ with respect to $\mu$ and $\sigma$, see https://stats.stackexchange.com/questions/154133/how-to-get-the-derivative-of-a-normal-distribution-w-r-t-its-parameters


Edit: running your values through the R-script below with your starting values of $2.8$ and $0.9$ gives me $\mu = 2.7750658, \sigma = 0.8981213$.


XX = c( 1.380211242, 1.62324929, 1.84509804, 2.133538908, 2.423245874, 2.73239376)

YY = c( 0.06984097, 0.09854223, 0.15019463, 0.227332755, 0.352771719, 0.481883449)

squareDiff <- function(param,dat){

x = dat\$x

y = dat\$y

m=param[1]

s=param[2]

D = sum((y - pnorm(x,mean=m,sd=s))^2)

return(D)

}

data = list(x=XX,y=YY)

result = optim(par=c(2.8,0.9), squareDiff, dat=data)

print(result)

user3456032
  • 1,456
  • Thanks! Just wondering, is there a way to optimize this? Your solution would indeed solve the problem, but requires many iterations, I was hoping there would be some closed-form expression. – Ronald S. Bultje Dec 02 '16 at 00:58
  • In terms of close form expressions I think you are out of luck - you can always try to compute the gradient of $D$ to find a stationary point but... Once you embrace a numerical solution you can use different optimizers (for example gradient based such as Newton or BFGS) that possibly work better. – user3456032 Dec 02 '16 at 01:17