## Wednesday, May 03, 2006 ... /////

Via Steve McIntyre's blog (posting by John A.)

Dave Stockwell has created a script whose source is found here (mirror) and described here. If you click the image below, it opens in a separate window: you probably need to click because the image does not quite fit here. Every time you reload the image, the calculation starts from the beginning.

Although Dave offers an explanation, let me offer you mine, too.

The blue graph shows temperatures from 1856 to 1994 or so measured by the CRU thermometers - the array is called "cru" - and these real numbers are used to make predictions from 1994 to 2093 with an important help of a random generator: the predicted temperatures for the period 1995-2094 depend on random numbers as well as the CRU data from the past.

The eleven temperatures from the period 1995-2005 are known from the CRU data, but they are also predicted using the random forecasting algorithm. These eleven years are used to calculate the verification statistics - a kind of score that is used to evaluate how much you should believe the prediction: statistical skill.

How are the random predictions made?

Weighted random data

The temperatures predicted for the years 1995-2094 are calculated using the array called "fcser" that later becomes the second part of the array "graphValues"; the role of the "fcser" array is to emulate the temperature persistence. These "predictions" are "calculated" from the "series" as follows: the temperatures in the "temp" array are calculated as

• the CRU temperature from 1994 plus a random number between -0.5 and +0.5 plus "fcser" for the given year

where fcser for the given year is a weighted average of the values of "temp" from previous years (for the years up to 1994, the real CRU data are used): the weights, defined in the array "weight", are a particular decreasing function of the time delay. If you care, "weight[y]" for the delay of "y" years is recursively calculated by

• weight[1] = 1/2
• weight[y] = weight[y-1] * (y-1.5)/y

For a long time delay, you see that "dw/dy = -1.5 w/y" which means that the weight goes like "y^{-1.5}", a power law. All the numerical constants are variables in the script that can be modified if you wish. The formula for the weights has the interesting feature that they automatically sum to one, in fact for a general value of "d":

• weight[1] = d
• weight[y] = weight[y-1] * [1 - (1+d)/y]
I leave you the proof as a homework exercise. The value of "d" leading to the most reasonable color of the noise is clearly related to the critical exponents encoding the temperature autocorrelation.

Verification statistics

Two verification statistics are calculated to quantify the agreement between the observed CRU temperatures and the randomly predicted temperatures in the 1995-2005 interval:
• r2 - or "r squared"
• re - or "reduction of error"

Here, "r2" is the usual correlation coefficient squared - something that measures the correlation between eleven numbers "x_i" (CRU temperatures) and eleven numbers "y_i" (randomly predicted temperatures). The correlation coefficient is a number between -1 and 1 calculated as follows:

• [Sum(xy) - 11 Average(x) Average(y)] / sqrt(Variance(x) Variance(y)]

where "Variance(x) = Sum[(x_i-Average(x))^2]" and similarly for "y". This "r2" statistics is normally used to evaluate statistical skill, and you may see that this number is extremely close to zero whenever you reload the picture; they're much smaller than one. This smallness tells you that the random numbers (of course) are statistically insignificant and the prediction is not trustworthy. The "hockey stick graph" of the past temperatures gives you a tiny "r2", too.

On the other hand, "re" is the reduction of error. You usually get high numbers around 0.5; the Mann-Bradley-Hughes gives a rather high verification statistics, too. Because in this experiment, you see that "re" is high even though the prediction is based on random data - i.e. on complete garbage - it shows that high "re" can't be trusted. This "re" is calculated as follows:

• re = 1 - SumVariances/SumVariancesRef

where "SumVariances" is the sum of "(cru-predictedtemp)^2" over the eleven years while "SumVariancesRef" is the sum of "(cru-averagecru)^2" where "cru" are the actually measured temperatures in the eleven years of the verification period. In other words, the number "re" is a number between 0 and 1 that tells you by how much your prediction is better from the assumption of a simple "null hypothesis" that the temperature is constant over the 11-year period.

This particular program predicts the 1995-2094 temperatures as random data with a particular power law encoding the noise at different time scales, but otherwise oscillating around constant data (the 1994 temperature). You could modify the "predictions" by any kind of bias you want - global warming or global cooling - and the statistical significance of your results would not change much. Also, the M&M mining effect is still not included: if you allow your algorithm to choose the "best trees", you can increase your verification statistics even though the data is still noise.

The punch line is that the reconstructions that imply that the 20th century climate was unprecedented are as statistically trustworthy as sequences of random numbers. If you want to verify the hypotheses, you must actually pay attention to the "r2" statistic. With this method, you can see that the randomly generated predictions are garbage, much like various existing "hockey stick" graphs whose goal was to "prove" that the 20th century climate was unprecedented.