Testing whether forecasting models differ significantly in their forecasts

Question

I am not finding my question on here which I sort of find it hard to believe unless I am simply on the wrong stack board. If it is on this board, kindly direct me and I will be grateful. Otherwise, I have two models used to forecasting the same data set. I want to find out if my models are producing significantly different forecasts or not. How would I go about doing this?

Null: Models do not differ. Alt: Models do differ from one another.

Models are used to forecast demand.

This is too vague. Do the models yield binary results (as in "Rain" vs. "No Rain") or is there a range of possible forecasts, in which case, what function do you want to use to gauge how close two predictors are? Also, you have to consider reality. One expects (hopes?) that both models will be correct rather a lot, in which case you must expect that they agree rather a lot. — lulu, Jan 18 '21 at 18:39
The output is non-binary. They are used to predict demand of some product. I am simply wanting to know if the values that the model is forecasting differ significantly than each other. Meaning, does it matter which model I use or is one actually better than the other. — Sven, Jan 18 '21 at 19:13
But "better" requires some quantification. You'll need to specify some function to tell you how close two models are to each other (or how close they are to reality). — lulu, Jan 18 '21 at 19:16
One useful think that might be easy is to look at the cases when model $A$ is wrong. Do those cases predict that model $B$ will also be wrong or are the errors of $A$ uncorrelated to those of $B$? That is very important when looking at things like tests for some disease. Two fairly inaccurate models might produce a better model in tandem, if you expect them to have uncorrelated errors. — lulu, Jan 18 '21 at 19:17
I should add; often times, in practice you can learn a lot just by plotting the three data series (Model $A$, model $B$, Reality). The graphs might suggest persistent patterns which you can then quantify and test for. — lulu, Jan 18 '21 at 19:19

BruceET · Accepted Answer · 2021-01-18T21:31:20.553

Comments are correct that this question is quite vague, somewhat less so in view of your response. But I will try to give an answer that shows various statistical procedures that may be useful.

Suppose you have 100 paired forecasts as in vectors x1 and x1 below;

How well do they agree? We can look at descriptive statistics of their differences:

d = x2 - x1
summary(d); length(d);  sd(d)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.2626  1.3748  2.0057  1.9440  2.5832  5.0812 
[1] 100        # number of differences
[1] 0.9689753  # standard deviation
hist(d, prob=T, col="skyblue2")

So it appears that forecasts $X_2$ are mostly larger than forecasts $X_1.$ However, the two forecasts are highly correlated, so it may not make much difference which forecast is used.

cor(x1,x2)
[1] 0.9989661
plot(x1, x2);  abline(0, 1, col="green2")

Nevertheless, we do know that the forecasts are not exactly the same, and we can do a formal test whether the two are significantly different in a statistical sense. The histogram of differences looks roughly normal, so we use a paired t test. The tiny P-value very near $0$ shows that the two forecasts are significantly different.

t.test(x1, x2, pair=T)
   Paired t-test


data:  x1 and x2
t = -20.063, df = 99, p-value < 2.2e-16
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
 -2.136304 -1.751773
sample estimates:
mean of the differences 
              -1.944039

A nonparametric Wilcoxon paired (signed rank) test also shows a hightly significant difference with a P-value very near $0:$

wilcox.test(x1, x2, pair=T)$p.val
[1] 6.407145e-18

However, statistical significance is not the same thing as practical importance. If the fact that $X_2$ forecasts average about $2$ higher then $X_1$ forecasts is not important, and because the two never seem to be far apart, then it may not make any practical difference which forecast is used.

By contrast, if a difference of $2$ is of practical importance, we should look at the record of past performance and use the forecasting method that has been most often correct.

Note: The following R code was used to simulate x1 and x2:

set.seed(121)
x1 = rgamma(100, 5, .1)
x2 = x1 + rnorm(100, 2, 1)

Testing whether forecasting models differ significantly in their forecasts

1 Answers1