## Saturday, March 20, 2010 ... /////

### Defending statistical methods

There surely exist propositions by the skeptics - and opinions liked by many skeptics - that I find highly unreasonable. I don't know whether they're equally frequent as the alarmists' delusions but they certainly do exist. A whole article of this sort was written by Tom Siegfried in Science News,
Odds Are, It's Wrong.
The article whose subtitle is "Science fails to face the shortcomings of statistics" - it sounds serious, doesn't it? - was promoted at Anthony Watts' blog. The most characteristic quote in the article is the following:
It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.
So I gather that the idea is that one should throw away all statistical methods which are a "mutant form of math". That's a rather radical conclusion.

There surely exist whole scientific disciplines that are trying to find tiny, homeopathic signals that can be hugely overinterpreted and hyped because the researchers are usually rewarded for such statements, regardless of their validity (or at least, they don't pay any significant price if the claims turn out to be wrong). The science of health impacts of XY is the classic example - and environmental and climate sciences may have become another.

People in those disciplines are usually led by their environment to be "finding" effects even if they don't really exist. The average ethical and intellectual qualities of the people who work in these disciplines are poor. But it's just preposterous to imagine that the right cure could be to throw away or ban all statistical methods.

Richard Feynman about the probabilistic character of most scientific insights. Taken from 37:45, "Seeking New Laws", the last Messenger Lecture at Cornell University (1964).

Statistical methods are crucial and omnipresent

In fact, statistical methods have always been essential in any empirically based science. In the simplest situation, a theory predicts a quantity to be "P" and it is observed to be "O". The idea is that if the theory is right, "O" equals "P". In the real world, neither "O" nor "P" is known infinitely accurately.

Why? Because observations are never accurate, so "O" always has some error, at least if it is a continuous quantity. And "P" is almost always calculated by a formula that depends on other values that had to be previously measured, too. So even the predictions "P" have errors. There are various kinds of errors that contribute and they would deserve a separate lecture.

Moreover, quantum mechanics implies that all observations ever made have some uncertainty and all of them are statistical in character. The most complete possible theories can only predict the probabilities of individual outcomes. Clearly, all observations you can ever make have a statistical nature. In particular, experimental particle physics would be impossible without statistics. If you can't deal with the statistical nature of the empirical evidence, you simply can't do empirical science.

Now, if "O" and "P" are known with some errors, how do you determine whether the theory passes or fails the test? The errors are never "strict": there is always a nonzero probability that a very big error, much bigger than the expected one, is accumulated, so you should never imagine that the intervals "(O - error, O + error)" are "strictly certain". Nothing is certain. If you pick 1,000 random people, the deviation of the number of women from 500 may be around 30 but it is unlikely, but not impossible, that there will be 950 women and 50 men.

The answer to the question that started the previous paragraph is, of course, that if "O" and "P" are (much) further from one another than both errors of "O" as well as "P", the theory is falsified. It's proven wrong. If they're close enough to one another, the theory may pass the test: we failed to disprove it. But as always in science, it doesn't mean that the theory has been proven valid. Theories are never proven valid "permanently". They're only temporarily valid until a better, more accurate, newer, or more complete test finds a discrepancy and falsifies them.

In the distant past, people wanted to learn "approximate", qualitatively correct theories. So the hypotheses that would eventually be ruled out used to be "very wrong". Their predicted "P" was so far from the observations "O" that you could have called the disagreement "qualitative" in character. However, strictly speaking, it was never qualitative. It was just quantitative - and large.

But as our theories of anything in the physical Universe are getting more accurate, it is completely natural that the differences between "O" and "P" of the viable candidate hypotheses is getting smaller, in units of the errors of "O" or "P". In some sense, the new scientific findings at the "cutting edge" or the "frontier" almost always emerge from the "mud" in which "O" and "P" looked compatible. When the accuracy of "O" or "P" increases, we can suddenly see that there's a discrepancy.

We should always ask how big a discrepancy between "O" and "P" is needed for us to claim that we have falsified a theory. This is a delicate problem because there will always be a nonzero probability that the discrepancy has occurred by chance. We don't want to make mistakes. So we want to be e.g. 99.9% sure that if we say that a theory has been falsified, it's really wrong.

The required separation between "O" and "P" can be calculated from the figure above, from 99.9%. If you don't know the magic of statistical distributions, especially the normal one, I won't be teaching you about them in this particular text. But it's true that the probability that the falsification "shouldn't have been done" because the disagreement was just due to chance is decreasing more quickly than exponentially with the separation - as the Gaussian.

So e.g. particle physics typically expects "new theories" to be supported by "5 sigma signals" - in some sense, the distance between "O" and "P" is at least 5 times their error. The probability that this takes place by chance is smaller than one in one million. Particle physicists choose such a big separation - and huge confidence level - because they don't want to flood their discipline with lots of poorly justified speculations. They want to rely upon solid foundations so statistical tests have to be really convincing.

Softer disciplines typically choose less than 5 sigma to be enough: 2 or even 1 sigma is sometimes presented as a signal that matters. Of course, this is because they actually want to produce lots of results even though they may be (and, sometimes, are likely to be) rubbish. But a simple fix is that they should raise the required confidence level for their assertions - e.g. from 2 sigma to 5 sigma. They don't have to immediately throw statistics as a tool away.

Be sure that a 5-sigma confidence level is also enough for tests of dozens of drugs at the same moment. I agree with Siegfried that if you're making very many tests, there will surely be some "false positives" - statistics happens. But with an appropriately chosen confidence level, depending on the context, one can keep any errors in his research "very rare".

A problem is that many of these researchers actually don't want to do it - e.g. to improve the confidence level. They don't want their science to work right. They have other interests.

In fact, while the confidence level is dramatically increased if we go from 2 sigma to 5 sigma (something like from 90% to 99.9999%), the required amount of data we need to collect to get the 5-sigma accuracy is just "several times" bigger than for the 2-sigma accuracy. So if there's some effect, it's not such a huge sin to demand that published "discoveries" should be supported by 5-sigma signals. Once again, the soft scientists - who propose various theories of health (what is healthy for you) - are choosing low confidence levels deliberately because they like to present new results even though they're mostly bogus. They still get famous along the way.

If some key statements about AGW are only claimed to be established at the 90% confidence level, it's just an extremely poor evidence (and may be overstated or depend on the methods, anyway). In principle, it shouldn't be hard for the evidence for such a hypothesis, assuming it's true, to be strengthened to 99.9% or more. That's what "hard sciences" deserving the name require, anyway.

The laymen usually misunderstand how little "90%" is as a confidence level - and some traders with fear masterfully abuse this ignorance. 90% vs 10% is not that "qualitatively" far from 50% vs 50% - and one can transform one to the other by a "slight" pressure in the methodology and the formulae. If you want to be scientifically confident about a conclusion, you should really demand 99.9% or more. And it's actually not that hard to obtain such stronger evidence assuming that your hypothesis is actually correct and the "signal" exists.

Falsifying a null hypothesis

I must explain some basic points of the statistical methods. Typically, we want to find out whether a new effect exists. So we have two competing hypotheses: I will call them the null hypothesis and the alternative hypothesis.

The null hypothesis says that no new effect exists - everything is explained by the old theories that have been temporarily established and any pattern is due to chance. When I say "chance", it's important to realize that one must specify the exact character of the "random generator" that produces these random data, including the deviations, correlations, autocorrelations, persistence, color of the noise etc. There's not just one "chance": there are infinitely many "chances" given by "statistical distributions" and we must be damn accurate about what the null hypothesis actually says. (Often, we mean the "white noise" and "independent random numbers" etc.)

The alternative hypothesis says that a new effect is needed: the old explanations and the null hypothesis is not enough.

How do you decide in between these two? Well, you calculate the probability that the apparently observed "pattern" could have occurred by chance assuming the null hypothesis. If the probability of something like that were sufficiently high, e.g. 1% or 5%, you say that your data don't contain evidence for the alternative hypothesis.

If the calculated probability that the "pattern" in the data could have been explained by chance - and by the null hypothesis - is really tiny, e.g. 10^{-6}, then your data give you a strong evidence that the null hypothesis is wrong. If you say that it's wrong, your risk of having made a wrong conclusion - the so-called "false positive" or "type I error" - is only 10^{-6}. So it's sensible to take this risk. In my example, we falsified the null hypothesis at the 99.9999% level. It's very likely that a new effect has to exist.

You're expected to have an alternative hypothesis that actually describes the data more accurately and gives a higher probability that the data could have occurred according to the alternative hypothesis, with its new understanding of "chance".

However, if the probability of getting the pattern by chance, from the null hypothesis, is substantial, e.g. 10%, then your data only provide you with a very weak hint that a new effect could exist. If you use the standards of hard sciences, you should say that your data can't settle the question in either way.

Of course, it is always possible that if you make such a conclusion, you have made another kind of error, the "type II error", also known as the false negative. But what Tom Siegfried seems to misunderstand is that this is a common situation that you simply can't avoid in most cases. The data, with their limited volume and limited accuracy (and assuming a small size of the new effect), simply can't settle the question in either way.

So when you say that you don't have enough evidence to confirm the "pattern", i.e. that the data don't contain a statistically significant evidence for the alternative hypothesis i.e. the new effect, it is not the ultimate proof that the alternative hypothesis is wrong. It is not the final proof that the new effect can't exist.

It's just evidence that the new effect is small and unimportant enough so that it couldn't have been detected in the particular sample or experiment. You can't make a final decision here. While hypotheses can be kind of "completely killed" in science, they can never be "completely proved". Even though the null hypothesis can be pretty much safely killed, no one can ever guarantee to you that your particular generalization, your alternative hypothesis, is the most correct one. It could have been better than the null hypothesis in passing this particular test but the next one may falsify your alternative hypothesis, too.

There's no straightforward way to construct better hypotheses! Creativity and intuition is needed before your viable attempts are tested against the data.

And quite often, your data simply don't contain enough information to decide. This is not a bug that you should blame on the statistical method. The statistical method is innocent. It is telling you the truth and the truth is that we don't know. The laymen may often be scared by the idea that we don't know something - and they often prefer fake and wrong knowledge over admitting that we don't know - but it's their vice, their inability to live with what the actual science is telling us (or not telling us, in this case), not a bug of the statistical method.

Misinterpretations, errors, lousy scientists

Of course, the picture above assumes that one actually learns how the statistical method works and what it exactly allows us to claim in particular situations. That has nothing to do with the journalists' or laymen's interpretations. The journalists and other laymen usually don't understand statistics well - and sometimes they want to mislead others deliberately.

But again, it would be ludicrous to blame this fact on the statistical method.

Analogously, bad scientists may calculate confidence levels incorrectly. They may choose unrealistic null and/or alternative hypotheses: in systems theory, a wrong choice of the null hypothesis is sometimes referred to as the "type III error". And they may misinterpret what their test has really demonstrated and what it hasn't. They may hold completely unrealistic beliefs about the odds that a "generic" hypothesis would pass a similar test so they can't place their calculation in any proper context. Sometimes, they think that by falsifying the null hypothesis, they're proving the first alternative hypothesis that they find convenient to believe (one can't prove it in such a way, you would have to falsify all other possible alternative theories first!). Quite typically, such people only blindly follow some statistical recipes that they don't quite understand. So it's not shocking that they can end up with mistakes.

This fact is not specific to statistics. People who are lousy scientists often make errors in non-statistical scientific methodologies, too. That's not a reason to abandon science, is it?

The proper statistical method gives us the best tool to study the incomplete or inaccurate empirical information - and in the real world, every empirical information is incomplete or inaccurate, at least to some extent. And one can actually prove that the probability of a "false positive" is as small as the significance level: it's true pretty much by definition. Well, the p-value is not quite the same thing as the probability of a "false positive" i.e. as the confidence level but it's pretty close: if a calculated p-value is at most equal to the required significance level, the test may be used to reject the null hypothesis.

But "false negatives" can never be reliably cured. Whenever your experiment is not accurate enough, it will simply say "no pattern seen" even though a better experiment could see it.

The solution to fight against the widespread errors is to require the soft disciplines to become harder - to calculate the confidence levels properly and to require higher confidence levels than those that have been enough for a "discovery" in the recent decades. This recommendation follows from common sense. If your field has been flooded by lots of beliefs in correlations and mechanisms that often turned out to be incorrect or non-existent, it's clear that you should make your standards more stringent.

Scientists, journalists, and laymen should do their best to be accurate and to learn what various tests actually imply.

But it will still be true that no science can be done "quite" without any statistical reasoning. And it's still true that the datasets and experiments will continue to be unable to give the "final answer" to many questions we would like to be answered. These are just facts. You may dislike them but that's the only thing you can do against facts.

So I would urge everyone to try to avoid bombshell statements such as "statistics is a dirty core of science that doesn't work and has to be abandoned". Lousy work of some people can't ever justify such far-reaching claims.

After all, much of the lousy work - and lousy presentation in the media - emerges because the people want to claim that the relevant research is "less statistical" in character than it actually is. In most cases, weak statistical signals are being promoted to a kind of "near certainty". So the right solution is for everyone to be more appreciative of the statistical method, not less so!

And that's the memo.

Tamino and 5 sigma

Tamino claims that "requiring five sigma is preposterous". Well, it's not. It's what disciplines of hard sciences require as a criterion for a discovery. See five sigma discovery at Google (2000 pages). In particular, discoveries of new particles by colliders do require 5 sigma. No one would have claimed a discovery of a top quark at 3 sigma - which would only be viewed as a suggestive yet vague hint.

Once again, this increase is needed because people often cook their results to make "discovery claims" that are bogus: it's easy to "improve" the tests. If you try 10 variations of the same test, one of them will show a (fake) effect at a 90% confidence level: that's what the 90% confidence level means, by definition. Unfortunately, many researchers are approaching the things in this way.

With a 5-sigma discovery, such cheating becomes virtually impossible because you would need a million of variations of your paper - and only one of them would show a fake positive. On the other hand, it's not "infinitely more difficult" to get 5-sigma results relatively to 3-sigma results. Because the relative errors go like "1/sqrt(N)" where N is the number of events (whose average you're calculating, in a way), you only need to increase the number of events by a factor of "(5/3)^2 = 2.7778" to go from 3 sigma to 5 sigma.

Because of the amazing increase of the "purity" of your results and their immunity against errors and your own bias, it's surely worth paying this extra factor of 2.7778, isn't it?

#### snail feedback (12) :

If you read the article, you might actually agree with it.

The problem is that most articles actually published accept 2 sigma evidence (P<0.05). That's the main thing the author objects to.

If articles could not be published without 5 sigma evidence, pretty clearly the author and the experts that he is channeling would be satisfied.

Hi Lubos

I am an academic who teaches empirical research to students. I would like to post the original article and your rebuttal for my students to read. Could you pls post a modified version which is sufficiently academic (ie. Without phrases such as 'Holy Cow'). :). Your columns are always entertaining, but sometimes may not be sufficiently academic for a classroom.

Dear Bob, I surely agree with you, as my text indicates - if you read it, I wrote your comment explicitly.

The only problem is that they are satisfied with a low confidence level. But that's not the impression I got from that other article: it wanted to throw away the whole method.

Dinesh, thanks for your interest. Couldn't you please do the editing yourself? This is a blog, not a textbook for a classroom that apparently tries to be as boring and dull as possible.

I give you all the permissions to fix the informalities. For example, "dumb" should be "irrational" while "Holy cow" should be "That seems as a rather bold statement." I am sure you can do it with others, too.

Thank you.

Hi Lubos,

I've been confused by some of the discussion on tropospheric temperature measurements, especially the Santer et al 2008 paper. As far as I understand it, the crux of that issue is to do with the way that statistics is used to interpret the dataset. Do you have any comments or analysis about that question from the perspective of your article?

As a professor of statistics I can say that it is common to see people bashing statistics. It is rare to see someone defending it! Thanks.

Larry Waserman
Dept of Statistics
Carnegie Mellon

Hi Lubos,

Off topic, but...

I would appreciate it if you could take a few minutes to look at this blog post...

http://climatesanity.wordpress.com/2010/03/21/rahmstorf-2009-off-the-mark-again-part-1/

It looks to me like choosing a temperature that looks like T=Cexp(-at/b) in Rahmstorf's 2009 sea level scare story model will always yield a sea level rise of 0. (note that b is negative, so this is an exponentially rising temperature)

This would seem to invalidate his relationship between sea level rise rate and temperature.

Best Regards,
Tom Moriarty

Hi Lubos,

Off topic, but...

I would appreciate it if you could take a few minutes to look at this blog post...

http://climatesanity.wordpress.com/2010/03/21/rahmstorf-2009-off-the-mark-again-part-1/

It looks to me like choosing a temperature that looks like T=Cexp(-at/b) in Rahmstorf's 2009 sea level scare story model will always yield a sea level rise of 0. (note that b is negative, so this is an exponentially rising temperature)

This would seem to invalidate his relationship between sea level rise rate and temperature.

Best Regards,
Tom Moriarty

The article didn't bash statistics, it pointed out the big damn flaw with a lot of the work being done and published every day in the name of "science".

Statistics are wonderful so long as you are careful. Most of the studies being done these days tend to have really squishy data at the bottom. Used to be, you would see a number in the abstract. Apparently that isn't in fashion any more. Now we see "Statistically Significant". In order to get to the numbers you end up wading through the entire paper, then wondering if somebody left something out.

You don't need to throw statistics completely out, but you do have to be careful about its application.

If you are counting the number of protons hitting a target, you are probably still safe to use statistics.

If you are counting the number of people who get asthma in an area with increased pollution on the other hand finding statistical significance with an RR of 1.2 does not a tragedy make.

phillip_jr
The reason you aren't grasping what Santer does is that frequentist statistics should never be used for model outputs. The only papers on this subject have said that Bayesian stats should be used though they are still difficult to apply. The reason there arem so few papers on this subject is because combining model outputs as if the combination made any more sense than any individual run is a pretty stupid idea of itself and without any foundation in reality.

In effect, Douglass et al. didn't need to do any stats on the paper because us modelers just compare every model run to the actual observed data and calculate a percentage error. Climate scintists don't do this because to them the model is better than the data. hence the tendency to change the data in line with models.

Notwithstanding all of that, Santers 3 sigma test fails anyway just by using up to date data. Steve McI tried to get a comment like that published but it was apparently too long and boring for the journal.

All science is based on measurement, and all measurements (excepting pure counting) involve an indeterminate error. Those errors propagate according to Gaussian statistics, and in fact he propagation of those errors demonstrate 1) whether the errors were truly indeterminate 2) the confidence of the measurements 3) and the confidence in those measurements in demonstrating the (null) hypothesis, assuming that has been formulated correctly around the measurements and their methods.

The application of those statistics to "model" results is meaningless, because there is no way to demonstrate that the errors PROVIDED BY THE COMPUTATION were indeterminate. There may have been a number of factors outside the control of the modeler that would make any "errors" of computation determinate, such as truncation, roundoff, insufficient parameters in a model to come to a meaningful result, etc and it is impossible to account for all the possibilities.

And that's the memo.

You and I seem to have read slightly different articles. His point was not that all statistics should be thrown out. I read that statistics is so seldom used appropriately by scientists and so poorly understood by nonstatisticians (even by most people who talk a good game of knowing proper use of stats) that the MISUSE of statistics leads to a lot of incorrect interpretations. He makes several points about how people use stats that I see in almost every scientific paper I read.

The point is not to throw out stats, but to not let the stats control your brain. The way the scientific establishment overuses bad stats is what needs to change and that is what the article was about to me.