## Thursday, February 19, 2009

### Eigenvalues in Antarctica

Principal components in Antarctica (part 2)
and reproduced some of the basic findings. Let me start with some results; I will explain what we're doing later in the text.

This is the PC1 visualized by my favorite Voronoi diagrams. Finally, you should also see two more diagrams of this kind, namely PC2 and PC3. There are no other PCs! What the hell is your humble correspondent talking about?

Fifty megabytes of junk

A few weeks ago, Nature printed a paper by Steig et al. about the temperatures in Antarctica. Michael "hockey stick" Mann is among the authors. Curiously enough, the data used in the paper are publicly available:
Steig's Nature 2009 website
The page above contains links to various files such as:
• Tir_lats.txt (latitudes)
• Tir_lons.txt (longitudes)
• ant_recon.txt (temperatures)
The first two files are very short and describe the latitudes and longitudes (in degrees) of 5509 points in Antarctica. The third file is much bigger: it is about 50 MB long and it describes the temperatures for each "station" (5509 columns) in each month (600 rows corresponding to a recent 50-year period).

A priori, I would be completely overwhelmed by such a huge file with seemingly uncontrollable numbers. Fortunately, some people around ClimateAudit.ORG are more courageous. It turns out that exactly 99.5% of the figures in the file are redundant junk. The file pretends to describe 600 x 5509 independent numbers but it can actually be calculated from 3 x 5509 numbers!

How is it possible? If you want to follow me, you should first click the logo below and buy a full-fledge Mathematica 7 Home Edition for USD 295 only:

Alternatively, you may try some free Mathematica players that won't work with my notebooks. Or you can download and learn a free programming language called "R" - Steve McIntyre's favorite choice - which is largely impenetrable unless you are a statistical geek. :-)
... (or PDF preview)
Fine. So what I did was to download the three files into my special Antarctica directory. I imported the files as matrices into Mathematica: it was smoother than ever before. No problems with the exponential notation, either. Then I converted the 600 x 5509 matrix into the principal components, using a function that is aptly called PrincipalComponents[...].

It took about 10 seconds to calculate the principal components. The program is real fast.
Update: See Spatial correlations in Antarctica for another, "future" TRF article about these topics
I looked at the result. And indeed, only 3 rows out of 600 were nonzero - when you looked at a large number of initial entries (close enough to the beginning of the measurements in 1957). More precisely, lots of the initial entries in the 4th row were about one billion times smaller than those in the 3rd row. Wow. Clearly, it was meant to be zero and the nonzero values came from some rounding errors and other negligible effects. The three principal components were drawn, using Voronoi diagrams and the TemperatureMap color scheme, with positions on Earth projected stereographically on the plane touching the South Pole.

So actually all the 600 rows (months) are 5509-dimensional vectors that belong into a 3-dimensional space instead of an expected 600-dimensional space. All these vectors are heavily linearly dependent: they are linear combinations of three 5509-component vectors (one coordinate for each reference point in Antarctica) and these three vectors are graphically represented as PC1, PC2, PC3 at the diagrams above.

Instead of the file with 600 vectors, each of which has 5509 components, they could offer a file with 3 vectors only, each of which has 5509 components, plus 600 x 3 numbers telling you the weights of PC1, PC2, PC3 for each month. OK, I mean 600 x 4 numbers because I also need an additive shift (another time series) for the linear fit.

At any rate, the latter 2,400 numbers would give you a file of a negligible size. These four time series look pretty chaotic and are included in the notebook. The notebook also verifies that the fit works perfectly at the available accuracy.

Transposed matrices

I am still somewhat mystified what it tells us about the transformations that were done to create the files. One thing is clear: the files are not the actual exact observed data from each month and each point. They're the result of a simplified fit. But what can look strange at the beginning is that the principal components are vectors whose 5509 coordinates are associated with points in space rather than individual months!

One is used to truncation into a few principal components, each of which looks like a time series. But what Steig et al. (or someone else) did was a different step. They wanted to simplify the dependence of the temperatures on space. The truncation to the three PC pictures means that if two points in Antarctica happen to have (almost) the same color on the PC1 diagram, (almost) the same color on the PC2 diagram, and (almost) the same color on the PC3 diagram, they must have (almost) the same temperature in every single month in the past and in the future. ;-)

I hope that the three principal components were at least chosen in some sensible way, i.e. as real principal components of a more complete matrix. You can see that PC1 - the picture included in this blog text - resembles the elevation graphs. It means that something similar to the altitude is the most important factor that decides about the character of temperatures (and their change) at a given point.

Nevertheless, I am not getting the point of any of these simplifications. It's not hard to see that all the calculations could be quickly done with the exact 600 x 5509 matrices of temperatures. Why are they simplifying in this brutal way, by erasing 597 dimensions of the 600-dimensional space? Why don't they clearly say that they are doing so?

It is hard not to feel that the details in these papers are rubbish that is prepared well enough so that the laymen who read these papers - such as the typical referees - won't be able to tell. But the ClimateAudit.ORG readers clearly can recognize real measured numbers from some artificial, calculated ones. A possible "innocent" explanation is that all the data from the beginning, right after 1957, actually come from 3 stations, and someone tried to "retrodict" the temperature at all 5509 places out of these 3 numbers. Later, the number of stations was increasing. But the old enough data actually never display any satellite-based continuous maps.

Even if these "innocent" explanations are correct, I am afraid that the existence of bright bloggers such as Steve and a few of their commenters is enough to reveal that a large portion of the stuff done by the climate scientists is extremely sloppy.

And you know, Steve loves principal components. I love them too, as anything about linear algebra. On the other hand, I don't love unnecessary truncation of the data. It is useful to visualize things but I believe that such a truncation should only be the last step. The bulk of all calculations should use the exact data and the full matrices. It is easy to see that the existing computers are clearly capable to do all such procedures very quickly.