Wednesday, March 24, 2010

Proliferation of wrong papers at 95% confidence level

Kilotons of breathtaking [unflattering noun referring to their IQ], mostly gathering around the environmentalist cult, claim that it is possible - or desirable - to build science upon 2-sigma observations.

If you don't know what it is, it is an observation where the "signal" is only twice as large as "noise", using a proper definition of their magnitude - yet it is claimed that the "signal" is real and means something. The probability that it is not real and the deviation arose by chance is 5% or so. (In reality, it may be much higher because of various biases, but let's generously assume it's 5%.) The idealized figure is actually closer to 4% but let us use the conventional figure 5%.

So 5% of the statements claimed to be right because of statistical observations are wrong while 95% of them are right. Is it a good enough success rate to build science? Someone who is not familiar with science or rational thinking in general may think that it is good enough to make 95% of correct statements.

However, science is not about the search for completely isolated insights. The scientific insights, theories, and papers depend on each other. For example, a theory about the shrinking paths of the birds due to global warming depends on 3 observations about the bird behavior, 4 observations about the rise of temperatures, and so on.

Each of those depends on the union of several other papers, paleoclimatological reconstructions, the assumed validity of various computer models of the climate, and so on. Each climate model depends on 30 other assumptions. You need to rely on the work and conclusions by other scientists. In any science, this is a rule rather than an exception. Even Isaac Newton had to stand on the shoulders of giants.

Assume that your paper claims to be right at the 95% confidence level. But this figure only means that you are adding a 5% chance of an error: you must assume that the papers you're building upon are right. Some of them won't be - or they're at risk of being wrong.

A typical paper cites several dozens of other papers. However, most of the citations are bogus: the new paper only depends on several older papers. The other papers are being cited but their results are not essential for your new paper. Let's be very modest and assume that your new paper - and every typical paper in your field - only depends on two additional papers written in the previous T years (imagine T=1 or T=2) that are also at a small risk of being wrong. The period of T years will be referred to as one generation (generation of papers).

In reality, a typical paper may depend on many more older papers and the propagation of the errors will be much faster but let just assume that it depends on two.

Now, let's assume that the paper is correct if the essential "ancestor" papers are correct, and if the new "added value" test is right. Of course, the conclusions of the newest paper may also be correct by "chance", despite the fact that the methodology or essential assumptions are wrong, but let's omit this whole branch of papers that are correct because the crucial errors cancel or because of complete chance. We're only interested in science where not only the results but also the arguments are at least qualitatively correct.

Let's use the symbol "P(t-T)" for the probability that a typical paper written before T years - the previous "generation" - is correct. We said that the new paper is correct if the two ancestor papers are correct, and a new test works well - which has the 95% probability. You see that the probability of the new generation paper's being correct equals
P(t) = 0.95 * P(t-T)2
Let's assume that the first-generation papers didn't depend on anything. So they had the following success rate:
P(0) = 0.95
The second-generation papers, written T years later, have
P(T) = 0.953 = 0.857
The third-generation papers, written T years afterwards, have
P(2T) = 0.95 P(T)2 = 0.698
The fourth-generation papers, written 3T years after the initial papers, have
P(3T) = 0.95 P(2T)2 = 0.463
So these fourth-generation papers already have a higher chance of being wrong than right.

Because in the reality, T can be as short as 1 or 2 years, it's clear that with these two-sigma "standards", your discipline deteriorates into complete noise and rubbish in (much) less than a decade. I really mean 10 years. In other words, you know that the logical arguments in science often (or usually) require four steps or so, so the four-floor model above is realistic and the valid papers become a minority.

This is clearly unacceptable. Note that my estimates of the dependence on the previous papers was very modest - just two previous papers were used. It was enough to see that 2-sigma standards can't be enough for a sustainable science - whatever the science studies.

But in fact, even 3 sigma fails to be enough. At least, it's surely not "safely" enough.

A 3-sigma result has the 99.7% confidence level, a 1-in-300 chance of being wrong. Assume that a paper depends on 2 others. An analogous calculation of the validity of various generations of the papers gives
P(0) = 0.997
P(T) = 0.9976
P(2T) = 0.997 P(T)2 = 0.991
P(3T) = 0.997 P(2T)2 = 0.98
P(4T) = 0.997 P(3T)2 = 0.958
P(5T) = ... = 0.915
P(6T) = ... = 0.835
P(7T) = ... = 0.695
P(8T) = ... = 0.48
Eight generations - which may still be within one decade - bring your scientific discipline into a complete chaos in which most claims have nothing to do with the truth. If you assumed that you depended on 10 papers, it would be 2 or 3 generations.

Clearly, it's still unacceptable.

The mankind needed more than 5 generations of intense work before it degenerated into communist and socialist swines - otherwise it couldn't have developed the industrial civilization.

You need five sigma, and even with five sigma, you must be very certain that you're not building upon excessively long "trees" of arguments that depend on each other. You should better replace the other "5-sigma" papers you're depending upon by "10-sigma" or very "fundamental" papers.

Even with the 5-sigma standards, the calculation above could tell you that after 20 generations, the papers run amok. But as time goes by, the older papers are actually getting more certain. So some papers that seemed to be in the 10th generation - which would only have something like a 99.9% confidence level - actually become much more certain because the experimental evidence is either more accurate or it depends on a much shorter pyramid of older results.

Taming the infection

The simplified calculations above of course imply that the errors are propagating "exponentially", like an infection. And given the simplified assumptions, they do. But the calculation is idealized and the reality is not as bad. What you clearly need in science is to regulate any potential infection of this kind so that a better model replaces the hopeless infection in time.

It means that the time scale calculated from the algorithm above must be longer than the time scale after which the older results become "qualitatively" more certain than they were before (e.g. after which the confidence level in standard deviations doubles). If your scientific discipline doesn't have this property, it won't be able to resist the "infection of errors" because the latter proceeds exponentially.

If your confidence level is "P", a number slightly below 1, it can be seen that the probability of a valid paper in the G-th generation is something like "P^(2^G)" or so. For this to be safely above 50% or so, you need to take the logarithm and see that
-2^G * ln(P)
must be smaller than approximately one (so that its exponential is above 50% or so). It means that the number of generations after which the infection swallows your discipline is (not distinguishing "e" and "2" as the bases of the logarithm because it doesn't make much difference)
G = -ln(-ln(P))
Note that there are two logarithms embedded into one another. Just to be sure about the numbers,
-ln(-ln 0.95) = 2.97
-ln(-ln 0.997) = 5.81
-ln(-ln 0.9999994) = 14.32
Three generations of a survival is clearly not enough for a viable scientific discipline because much more accurate "proofs" usually arrive after much longer a time than 2.97 times the separation between consecutive papers, i.e. 2.97 generations.

Six generations - which is what you obtain from 3-sigma papers - is marginally enough.

However, 5-sigma confidence, with its 0.9999994 confidence level, gives you 14 generations of survival despite the infection which is more than enough to replace the older and convoluted arguments - assumptions or older papers you need - by more accurate or less convoluted ones.

Alternatively, you may claim that the "pyramids" of human knowledge and sensible theories that depend on each other never have more than 14 floors so that the "immunity" of the set of 5-sigma research papers against the pandemics of errors is sufficient.

However, you may see that the environmentalist sciences can be captured by an "ecological" model themselves. Because the reliance on previous 2-sigma papers is omnipresent in the discipline, you can see that by purely statistical arguments, the discipline would be inevitably plagued by an unstoppable infection of errors even if the people in it were honest and impartial.

Regardless of the character and interpretation of the hypotheses and theories, it's clear that a working scientific discipline requires at least the 5-sigma standards if its insights are going to be quasi-reliably reused in realistic, slightly longer chains of reasoning that can be as long as 6 steps or more.

People defending a 2-sigma science are loons, pseudointellectual weeds that are trying to infect not only their contaminated sub-world but all of science and all of modern civilization with a diarrhoea of bullshit.

And that's the memo.

Bonus: London Science Museum goes AGW neutral

The U.K. Times bring us some happy news: the London Science Museum goes AGW neutral, claiming that it has to respect the views of the people who disagree with the orthodoxy and those who remain unconvinced.

Also, the climate exhibition was renamed from "climate change" to "climate science" to remove the alarmist bias from the very title. This symbolic act is somewhat similar to the removal of the adjective "socialist" from the name of the Czecho(-)Slovak Republic in 1990.

Just a few months ago, the museum would ask its visitors to "Prove It", collecting votes to argue that the Copenhagen accord should be as tough as possible. As you know, the poll ended up by humiliating loss of the proponents of the climate panic and Copenhagen has been a blessed and total failure.

Even more importantly, the ClimateGate and the IPCC scandals have made it clear to pretty much everyone that the IPCC-linked scientists are not trustworthy and they can't boast any scientific integrity. So the people who lead the museum have actually learned their lesson. They have converged closer to the opinions of the staff e.g. at Pilsen's science museum/center, Techmania, which are somewhat more clearly against the AGW orthodoxy. I was there yesterday - it's a great place!

Via Climate Depot.

1. Civilizations thrive, then merely survive, and eventually die when they choke on their own bullshit.

2. Having worked in industrial quality assurance for many years, I am surprised at the demand for 5-sigma assurance.

Traditionally, QA uses "warning" at 2-sigma shifts of a mean value (e.g. mean diameter of 10 samples of a circular widget taken from a widget-producing process). 3-sigma is taken as an "action" limit - stop the process and seek a cause of the shift in the mean.

These limits have been used since Walter Shewhart first introduced Statistical quality Control at Western Electric in the 1920s. See http://www.amazon.co.uk/Introduction-Statistical-Quality-Control-Montgomery/dp/0471656313/ref=sr_1_1?ie=UTF8&s=books&qid=1269460503&sr=8-1

If 5-sigma limits had been used in industrial processes, then the level of quality and reliability we see in modern mechanical and electronic equipment would be considerably less.

3. There are other effects though.

If you are an optimist, you would hope that correct papers would be used as the basis of a new paper more frequently than incorrect papers, because correct papers are more in accord with reality and therefore more fruitful.

If you are a pessimist, you would worry that incorrect papers will be enriched as bases for new papers because flashy, attention-getting claims are more likely to be incorrect.

I'm not sure which of these effects is larger.

4. Brilliant as always.I'm a regular reader but not commented before.
Soft sciences, like medicine, climate etc, use 0.05 because it is too 'hard' to reach 0.01. However, as a medical student in the 60s we were taught 0.05 was the requirement for further investigation only. How times have changed!
I suspect that much of the problems of the 'science' of AGW is more due to ignorance and hubris rather than conspiracy.We all know from school that the really clever kids did physics and maths not those who did medicine, geology or, heaven help, law or politics.

5. Brilliant as always.I'm a regular reader but not commented before.
Soft sciences, like medicine, climate etc, use 0.05 because it is too 'hard' to reach 0.01. However, as a medical student in the 60s we were taught 0.05 was the requirement for further investigation only. How times have changed!
I suspect that much of the problems of the 'science' of AGW is more due to ignorance and hubris rather than conspiracy.We all know from school that the really clever kids did physics and maths not those who did medicine, geology or, heaven help, law or politics. TG Watkins

6. Eh eh eh.

I loved this. The amount of 97,5% (I always was told that 95% is only for the social 'sciences') confidence level excrement published in respectable peer-reviewed journals (and I don't mean Nature, New Scientist and the like) is amazing. I can only thank Zeus I didn't have to read it regularly. (Further good news to me is that I retired this month, so I no longer have to put up with the classic academic bull.)

When you think that people actually are treated with that "science" in a hospital or a clinic, you can only despair.

But most of what goes to the common people doesn't even reach that level of respectability. Recent admirable example, the opinion of some respectable UK doctors who intend to promote the prohibition of smoking in cars and open spaces.

So, when you ponder that laws are made with that lack of confidence level, you despair further.

As I said, people are treated in hospitals on this sort of statistical opinions. Stay healthy.

7. Oops. Forgot. You don't even have to worry further that generation 2 or something.

Nobody read further than the last two years abstracts. Yes, abstracts. Ever.

If you published something four or five years ago, it's as good as nothing. It vanished. I suppose you can even submit it again and no one will notice. (I'm talking about medicine.)

8. Dear Jeff, right, that's what civilizations do over the centuries. ;-)

Mike: I am a pessimist - in fields where the required confidence is low to start with. Wrong papers are more "exciting".

In other fields, one may be more optimist because there are often independent papers that effectively improve the confidence level.

What's important is for the freedom to cherry-pick to be (much) greater than the risk of a false positive. If the confidence level is 95%, i.e. a 5% risk of a false positive, it's enough to repeat 10 variations of the same experiment - or its interpretation - to be "more likely than not" to achieve the predetermined result. With the possibility to cherry-pick the research with "interesting" conclusions only, by a 1-per-10 basis from the candidates, 95% confidence level really means 50% confidence level.

For 5 sigma, it's clearly not possible to cherry-pick in this way because you would need to find 1 million natural variations of your methodology and throw away 999,999/1,000,000 of your work because you get something "interesting". No one does such things because he would know damn well that he's cheating.

9. Dear Toby, your "analogy" is not a real analogy. The "standard deviation" means something completely different than in the discussion of the scientific claims.

If the theories and observations disagree by "10 sigma", it really means "qualitatively" that the hypothesis is ruled out.

But if your quality control gives you even 10 sigma, it still doesn't mean much because the product (unlike the hypothesis) may still be operational when it's 10 sigma off.

It's just a completely different sigma, and the rejection means something else than right/wrong, too.

On Tamino's blog, Calabi made a correct comparison with the probability that a house will collapse - that's analogous to a hypothesis being wrong. But your "analogous" property for the products is simply not analogous.

Clearly, if there were a 1:300 (three sigma) risk that a component of a space shuttle is malfunctioning, it would be pretty much guaranteed that the space shuttle will "fail", too. You just don't understand what we're counting here.

Best wishes
Lubos

10. I'd say that this falls into the category of silly application of mathematics. The fundamental assumption is that the conclusion of a typical paper depends upon a chain of individual results, each of which was significant only at the p = 0.05 level. I don't think I've ever seen a paper for which that was true, and I've certainly never written one. I don't know anybody who works on the assumption that a single result at p = 0.05 is correct, even if all the statistics are perfectly correct. I would typically regard the evidence for a conclusion as strong if it had been independently replicated 3 times, so in the unlikely event that each of them had just barely met the p less than 0.05 criterion, that would give a net p value less than (0.05)^3. But even that is understating it, because if in 3 different studies nobody had managed to attain a significance level better than p = 0.05, I'd probably still be somewhat skeptical. Moreover, it would take more than replication of a single experiment to convince me that a conclusion is correct, because to be really convincing there needs to be convergent evidence obtained by different methodologies, all pointing at the same conclusion.

The suggestion that reliability of science can be improved by setting a more stringent p value standard is fundamentally misguided, because the most common reason why results turn out to be wrong in science is not because of false positives due to measurement errors, but rather because of incorrect assumptions--there turns out to be some other important variable that was unknown and uncontrolled, perhaps, or there was a fundamental error in the experimental design. Putting a lot of effort into obtaining results with lower p values would only slow down the progress of science, and hinder what really advances science--approaching questions from a variety of different directions.

In addition, setting a more stringent p value standard would tend to increase the problem of false negatives. Far more common than overvaluing a result obtained with p = 0.05 is the problem of undervaluing a result obtained with p = 0.1, because "no significant effect" can easily be mistaken for "no effect." A false positive usually gets corrected by follow-up studies, but people are much less likely to follow-up a negative result