I am trying to improve my implementation of the maximum likelihood (ML) estimator for the multiple squared correlation (https://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.1985.tb00559.x).
The ML estimator takes as input the normal $R^2$ value, the sample size $N$, and the number of predictors $p$. It returns an estimate for the multiple squared correlation ($\hat{\rho}^2$) by finding $\rho^2$ that maximizes the likelihood of the normal $R^2$ value. In (https://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.1985.tb00559.x), it is shown that instead of the likelihood, one can maximize the effective likelihood
$$L\left(p,N,R^2;\rho^2\right)=\left(1-\rho^2\right)^{N/2}{}_2F_1\left(\frac{1}{2}N,\frac{1}{2}N,\frac{1}{2}p;\rho^2R^2\right)$$
where ${}_2F_1$ denotes the Gaussian hypergeometric function. The ML estimate is thus $argmax_{\rho^2}L(p,N,R^2;\rho^2)$, with $p$,$N$,$R^2$, all being fixed values.
My current R implementation is quite rudimentary and looks like this
effectiveLH <- function(Rsquared, N, p, rhoSquared) {
-1 * (1 - rhoSquared)^(N / 2) * gsl::hyperg_2F1(0.5 * N, 0.5 * N, 0.5 * p, rhoSquared * Rsquared)
}
mlEstimator <- function(Rsquared, N, p, llfunction = effectiveLH) {
if (Rsquared <= p / N) {
return(0)
} else {
likelihoodFunction <- purrr::partial(llfunction, Rsquared = Rsquared, N = N, p = p)
res <- stats::optim(Rsquared, likelihoodFunction, method = "Brent", lower = 0, upper = 1)
return(res$par)
}
}
This works well in many cases. However, it fails if both $R^2$ and $N$ are big. For example, for $N=500, p=2, R^2=.9$, it returns $0.64$, which is wrong, since according to the Corollary on page 177 in (https://onlinelibrary.wiley.com/doi/10.1111/j.1467-842X.1985.tb00559.x) $\hat{\rho}^2 \in (0.89960,0.89963)$. The reason for this is that the output for gsl::hyperg_2F1 becomes very big as rhoSquared becomes bigger in this case, which leads to an overflow and thus gsl::hyperg_2F1 returning Inf. I am looking for a solution for this. Possible ideas:
- Find a transformation of the effective likelihood that avoids overflow.
- Instead of finding the maximum of the effective likelihood numerically, try to work out an analytical solution.
- Increase the precision of the floating-point operations.
- Switch the library for the calculation of the hypergeometric function.
- Simply run the numerical optimization with the relatively tight bounds from the Corollary. I would prefer not to fall back to this solution, but it might be good enough in practice.
Regarding 1, I tried Euler's transformation, but this does not seem to work. Regarding 2, I realized that the derivative of the hypergeometric function is a very similar hypergeometric function. Thus, the same problem should occur. 3. seems impossible without changing the gsl library, which is beyond my capabilities. For 4., I found the hypergeo package in R, which leads to similar problems. However, Mathematica does not seem to suffer from the overflow problems. I guess because it does computations symbolically. But Mathematica is, of course, not free and thus does not mix too well with R.
gslpackage. – Julian Karch Feb 17 '22 at 13:09