Wednesday, January 23, 2013

Medical literature: do wrong results prevail?

A remotely related link: Scientific American surprisingly ran a story on the Liberals' War on Science. The content isn't quite accurate but it's impressive that this troubled magazine could have challenged the politically correct myth that the leftists are inherently pro-science. Hat tip: B Chimp Yen.
The Physics arXiv Blog wrote a comprehensible review of a new paper,
Empirical estimates suggest most published medical research is true,
which studies the same question as the 2005 paper by John Ioannidis that claimed that most published biomedical results are wrong – false positives.

Mathematician Leah Jager and her collaborator Jeffrey Leek now use similar techniques to estimate that among 77,000 papers in an ensemble they picked, the rate of false positives is about 14%. That's nearly 3 times higher than the naively expected percentage of false positives – 5% because the maximal tolerable \(p\)-value is 0.05 – but it's more than 3 times lower than what would be needed for the wrong results to represent a majority.

Even if I reproduced, verified, and/or corrected all steps in the two methodologies and all subtleties that may invalidate it, I wouldn't quite be sure whether the actual percentage of wrong biomedical papers exceeds 50%, as Ioannidis claimed in 2005, or whether it's significant but manageable.

But what the two papers – as well an impartial thought – agree about is the general principle that if the tolerated confidence level becomes too "soft", it's rather likely that various biases increase the actual rate of false positives well above the declared \(p\)-value. When things are really bad, there is really nothing that could stop a discipline from producing papers whose majority is wrong.

We don't know for sure whether it's the case in the biomedical research but I am sure it's the case in the "man-made climate change" subdiscipline of climatology.

It's important to realize that the \(p\)-value can't be naively interpreted as the actual percentage of wrong papers in the literature. Imagine that you start to test \[

N=N_{\rm OK}+N_{\rm bad}

\] hypotheses about medicine (claims about some significant impact of a cure or a harmful substance on the health or various diseases, most typically) which – we know but the actual researchers don't know – are composed of \(N_{\rm OK}\) valid claims and \(N_{\rm bad}\) invalid claims. What will happen in the tests?

Well, most of the claims that are OK will turn out to be OK, usually because the actual signal is strong enough so that it will show up as a 2-sigma or larger excess, anyway. Well, some of the true signals will fail to be found because of bad luck but their fraction is negligible in practice and its exact value depends on what we mean by "significant impact" and how large datasets the experimenters use relatively to what is needed. So let us assume that almost all \(N_{\rm OK}\) claims are detected as OK.

The situation is different for the wrong hypotheses. They have no reason to show a signal, so most of them will display themselves as "wrong" in the experiments, too. However, by pure chance, it is inevitable that a percentage, namely 5%, will show some excess that we know is a fluke but it is large enough so that the hypothetical impact seems real, at the desired 95% confidence level (or higher). So \(0.05 N_{\rm bad}\) bad hypotheses will be tested as "right".

What I think is necessary for the bad development in the literature is the publication bias, namely the higher chance to publish the results that are "interesting" i.e. "positive". If the inequality\[

p_{\rm positive} \cdot 0.05 N_{\rm bad} \gt p_{\rm positive}\cdot N_{\rm OK} + p_{\rm negative}\cdot 0.95 N_{\rm bad}

\] is satisfied, then most of the literature will be wrong. Here, \(p_{\rm positive,negative}\) are probabilities that a paper is published if the research confirms or fails to confirm the hypothetical effect. The inequality above may be obeyed if a sufficient number of wrong hypotheses is tested and if the publication bias (ratio \(p_{\rm positive}/p_{\rm negative}\)) exceeds the factor of 19 or more.

This ratio seems very high and you could think that the literature must be fine. But it's not too high and this calculation of course assumes that the researchers don't make any other, real errors in their work. Ioannidis tried to incorporate some other possible sources of error in his estimate of the percentage of wrong papers. The effect I would emphasize – not sure whether he did – is the "infectious propagation of wrongness" through the literature, namely the fact that a new paper relies on (and totally believes in the validity of) many previous results whose fraction is inevitably wrong as well, and this mutual dependence likely increases the percentage of the wrong papers.

I feel it's not really possible to get a reliable estimate of the percentage of the wrong papers without knowing how many wrong papers there actually are in a large enough, representative ensemble – i.e. without testing the validity of some individual paper. But given the biases that may easily send a discipline out of control, I do think that the biomedical research and other disciplines should adopt the particle physicists' 5-sigma standards for a discovery.

People in the softer disciplines often protest that it would be too expensive to make such tests, and so on. But this claim is ludicrous. Note that \(5/2=2.5\) and \(2.5^2=6.25\) so you only need about 6 times larger datasets to increase 2-sigma signals (\(p=0.05\) or so) to 5-sigma signals (\(p=0.0000003\) or so). This is such a dramatic increase of the reliability of the results – and reduction of the false positive rate and the risk that this rate actually exceeds 50% in your whole literature and the "wrongness infection" will spiral out of control – that anyone who has some passion for the truth should favor the transition to the 5-sigma standards.

Because papers in science depend on an increasing number of previous results and each of the previous results has some chance to be a "bad apple", the risk that the new papers are wrong is increasing, too. Even by their method, Jager and Leek found some increasing trend in the rate of false positives. This unfortunate but inevitable trend should be wrestled with by gradually increasing the required confidence level of the results. Alternatively, all papers should do the hard job of reducing their claimed confidence in their results by incorporating a calculation of the odds that the paper is invalidated because some of the previous insights it relies upon are invalid.


  1. Bayesians like William Briggs ( regard the p-value business as patently absurd and reject the entire frequentist approach to stats.

    If he is right, what does that say about particle physics, which appears to me to be fundamentally frequentist in its approach. Is 6-sigma a delusion?

  2. Apologies, I have no clue what you're talking about. The concept of p-value appears both in the frequentist and Bayesian approaches to probability theory and there's nothing unusual about 6-sigma confidence - for example, this was the confident of each LHC Collaboration about the existence of a new Higgs-like particle in Fall 2012.

  3. I completely agree that the 2 sigma test is ludicrous in biomedical research. It should at the very least be 3.5 or thereabouts. (keep in mind it only takes a 2 sigma signal to falsify a claim). This is far more important than in particle physics, where we are literally talking about arms, legs, lives and other important appendages that people seem to like.

  4. Pharmaceuticals can also jeopardize their market credibility if they keep a blind eye at "bad apple" in their tests. This could give your zombies a volley of blows ;-)

  5. How about a p value in a medical paper of 7X10-240? I had never seen one this small until I read this paper on "Genetic determinants of plasma triglycerides."
    That's the confidence level that these researchers have that a mutation in a particular genetic SNP is related to plasma triglyceride levels. I wonder how they came up with that?

  6. Right. They usually say that *because* arms, legs, lives, and appendages are at stake, they can't *wait* for any certainty. But this argument is wrong because they mix up waiting of a researcher with waiting of a first-aid medician. The former *should* wait for a year so that the latter doesn't use ineffective, perhaps superstitious methods.

  7. Apparently, I have seriously misunderstood Briggs.

    Of course, it has always seemed to me that Bayesians become frequentists when the first data set shows up.

  8. Way back in the 60's Jack Cohen did a study on abnormal psychology studies and concluded it was impossible to obtain so many abnormal results, though the authors of DSM 5 may disagree. :) There is a serious problem with statistics in biomedicine. McCloskey, an economist, has written an interesting paper on this. Nice cheeky title too.

    The Unreasonable Ineffectiveness of Fisherian“Tests” in Biology, and Especially in Medicine

  9. A certain irony in this recent paper, they appear to have made their own mistakes. See: