Saturday, March 28, 2009

Spatial correlations in Antarctica

This is a continuation of a TRF analysis called
Eigenvalues in Antarctica
The main goal is to qualitatively confirm a plot by Steve McIntyre.

Click to zoom in. See it in different colors.

The picture shows the distribution both of different correlation coefficients, a priori between -1 and +1, for pairs of 5,509 Antarctic stations, as well as the distances in the pair. It shows how the correlation tends to disappear as the distances increase.

The chart has been calculated from the temperature anomalies during a period of 300 months: by the temperature anomaly, I mean the temperature for a given month minus the average temperature for the same station and the same month calculated from those 25 years.

Yes, McPete, the colors show the density, i.e. the number of pairs of stations that fit into a small pixel or box: yellow means "a lot".

The raw data were taken from the last, cloud* file on Steig's website. I didn't use any code of Steve which increases the independence of the calculation. The colors that have only made the picture slightly more attractive and more informative than Steve's graphs are pretty much the only visible contribution from your humble correspondent. ;-)

Only half a million of (randomly chosen) pairs are drawn out of 5509 x 5509 pairs, in order to save some time. The pairs were clustered into small boxes where the correlation jumps by 0.01 and the distance jumps by 50 km, while a grid that combines 2 x 2 small boxes has been added to the picture, too.
Download the Steig2 notebook for Mathematica 7.0.1
Download its zipped PDF preview

If you focus on the "bulk" of the picture with the intense colors, you can see that the typical correlation coefficient drops from 1.0 to 0.5 or so for stations that are approximately 2,000 km apart. The correlation goes pretty much to zero for stations that are 5,000 km apart, as you can see on the right side of the picture.

This observation extracted from the reality heavily disagrees with Steig's assumptions/claims/approximations. The latter imply that for distances below 1,500 km or so, a vast majority of Steig's correlation coefficients are almost exactly equal to +1.0. Also, his reconstructed temperatures at distant places of Antarctica seem to be almost always correlated.

Click to zoom in. White background.

This image follows the same conventions as the first one but uses Steig's reconstructed data (last 25 years) as input. Can you spot the difference between this image and the first one on this page?

This effect occurs because the continental temperature is universally assumed to be a time-dependent linear combination of three or four spatial patterns. Amusingly enough, many pairs of stations have a huge negative correlation of Steig's temperatures (that's because the spatial patterns are often combined with negative coefficients - a type of anticorrelation between two places that rarely takes place in reality). While the real-world correlation never drops below -0.4, Steig offers many distant pairs whose correlation is below -0.6 or more.

Just visually, imagine that your Photoshop would "compress" the first "photograph" on this page and it would create a JPEG file that would look like the second "photograph". Would you be satisfied with the accuracy of the picture? I guess that you wouldn't. Is the climate science supposed to be less demanding?

It follows that these approximations are unacceptable to learn something about the regional climate. They are also unacceptable as a tool to deduce something about the global climate from some limited regional data, or vice versa. In other words, these crude oversimplifications make Steig's paper pretty much worthless.

General message

The method of autocorrelations, either spatial or temporal ones, is a very effective tool to find out whether an approximation captures the correct behavior of the climate (or other systems) at various length scales or time scales.

In Nature, the typical autocorrelations tend to decrease with the separation (spatial or temporal - see our discussion of the frequency of weather records) but they tend to decrease gradually. Many oversimplified models may look OK enough to an untrained eye. However, their statistical patterns usually show either excessive spatial autocorrelation, insufficient temporal autocorrelation at long time intervals (at least when the hypothetical monotonical, linear trends are removed), or a discontinuous, non-gradual dependence of the autocorrelation on the separation.

These are serious problems with these models that strongly suggest that whatever rough agreement one gets - a "trend" plus a "noise" (which is how the naive climatologists think of the system) - is likely to be coincidence because the detailed physical laws (and various power laws that determine the character of the noise) that can be extracted from the observations don't agree with the simplified theories.

The gradual decrease of correlation with the separation observed in Nature implies that there is no "universal" number of spatial patterns or PCs that would fully capture the system. The number of PCs or patterns (or different types of physical phenomena) needed to reproduce the behavior of the climate increases with the number of spatial or temporal cells that you want to describe and distinguish: it increases with the number of "decades" you want to cover i.e. both with the size or duration you want to capture as well as with your fine resolution.

The typical distance at which the autocorrelation calculated from several decades of the data drops to 1/2 is comparable to 2,000 km. This also implies that if you care about the regional weather and its change in several decades, you have to compute the averages over 2,000-km or larger regions in order to filter the local weather fluctuations out. However, once you do so, the "continental" or "global trend" becomes invisible, anyway.

In science, hypotheses can often be falsified, even if their proponents think that it must be hard

It is surely true that the climate is a mixture of effects that are regular and can be understood, and effects that seem random (at least with our current level of understanding). However, the task to divide the effects into these two groups is very subtle and only a small portion of methods to divide them are scientifically sensible (or true or accurate).

And more detailed statistical methods can actually show that a particular way to describe the climate - a "regular" effect plus a "noise" of a particular type (e.g. a linear global warming trend plus a white noise with 3-4 spatial patterns over a continent) - doesn't work well. It seems that most existing climate models and approximations predict statistical correlations that can be falsified.

Importing CDF.GZ files

Steve McIntyre is also working on another project and its first step is importing some Wisconsin CDF.GZ datasets into R. It seems that unzipping the GZIP files seems to be the main task so far. ;-) Mathematica unzips imported files automatically.

However, the internal structure of a CDF file seems nontrivial, and if R is not equipped with gadgets to decode CDF files, it may be pretty hard to do it manually. For example, the climate files under consideration contain 25 pieces of data (25 matrices, if you wish), each of which comes with a different length of vectors, different formats for numbers, and different annotations: see PDF printout of a Mathematica notebook clarifying the structure of the CDF file.

Wolfram's software seems to be very convenient for such things - but yes, R is free.


  1. There's no color key. Am I correct in assuming it is a density (# items) plot?

    I.e., the bright yellow area is where the bulk of items are found)?

  2. Your analysis is very interesting, thanks!

    (Side-note: MATLAB can load CDF files directly! :) )