## Saturday, March 23, 2019

### A strange "letter against statistical significance"

Anton wanted me to react to
Scientists rise up against statistical significance,
a letter written by 3 people and signed by 800 others (which may look high on the street but it's really an insignificant fraction of similar or better "scientists" in the world – surely millions). Two of the three authors have written a similar manifesto to a Nature subjournal in 2017. The signatories mostly do things like psychology, human behavior, epidemiology – mostly soft sciences. I see only 4 signatories with some "physics" on their lines and 2 of them are "biophysicists".

First, I found that text to be largely incoherent, indicating a not really penetrating thinking of the authors. There isn't any sequence of at least three sentences that I could fully subscribe to. If there is a seed of a possibly valid point, it's always conflated with some fuzzy negative attitudes to the very existence of "statistical significance" and I think that no competent scientist could agree with those assertions in their entirety.

Statistical significance may be misunderstood and used in incorrect sentences, including fallacies of frequently repeated types (I will discuss those later) and in this sense, it may be "abused", but the same is true for any other tool concept in science (and outside science). One may "abuse" the wave function, quantum gravity, a doublet, a microscope, or a cucumber, too, and this website is full of clarifications of the abuses of most of these notions. But just because people abuse these things doesn't mean that we may or we should throw the concepts (and gadgets) to the trash bin.

When it comes to the description of the "frequent abuse of statistical significance", I don't see a statistically significant positive correlation between their comments and my views – and the correlation is probably negative although I am not totally certain whether that correlation is statistically significant. ;-)

Clearly, I must start with this assertion that will also be the punch line of this blog post:
Sciences that have experimental portions and that are "hard sciences" at least to some extent simply cannot work without the concept.
A proof why it's essential: All of science is about the search for the truth. One starts with guessing a hypothesis and testing it. Whether a hypothesis succeeds in describing data has to be determined. The process is known as the hypothesis testing. The result of that test has to be quantitative. It's called the $$p$$-value (or similar, more advanced quantities). The term "statistical significance" is nothing else than a human name for a $$p$$-value or a qualitative description of whether the $$p$$-value is low enough for the hypothesis to get a passing grade. The very existence of science is really connected with the existence of the concept of the statistical significance although a few centuries ago, the significance often used to be so high or low that the concept wasn't discussed explicitly at all.

This is a mostly theoretical physics blog but there are hundreds of comments about 3-sigma this and 4-sigma that. You couldn't really express these ideas "totally differently" (except for switching from sigmas to $$p$$-values or using synonyms). We simply need to quantify how reasonable it is to interpret an experiment as an experiment in which the Standard Model has apparently failed.

You may click at Statistical significance to see that the Wikipedia provides us with a perfectly sound and comprehensible definition – which doesn't indicate that there's anything controversial about the concept itself. A statistically significant outcome is one that is unlikely to emerge according to the null hypothesis. That's why such a result makes it likely that there's something beyond the null hypothesis. This kind of the interpretation of the empirical data represents the building blocks of almost all the reasoning in quantitative enough empirical sciences!

The letter has been summarized under some real clickbait titles e.g. It’s time to talk about ditching statistical significance and We call for the entire concept of statistical significance to be abandoned (both have also appeared in Nature!). Well, these two titles seem to significantly contradict a sentence in the letter, "We are not calling for a ban on P values". Within a natural enough protocol, the significant contradiction could almost certainly be shown to be statistically significant, too. ;-) And yesterday, three days after the original "petition", Nature has also published more sensible responses to the letter such as
Retiring statistical significance would give bias a free pass (exactly, Mr John Ioannidis)

Raise the bar rather than retire significance (exactly, Mr Valen Johnson)

Retire significance, but still test hypotheses (well... Ms Haaf et al.)
OK, before I start to discuss the particular fallacies, abuses, and non-fallacies, and non-abuses, let me say that I still don't quite understand where the authors come from. What is their "scientific culture", what is the real "motivation" of the letter? Just to be sure, it's possible that they come from different cultures and have different motivations.

The three authors, Valentin Amrhein, Sander Greenland, and Blake McShane, study birds' singing, epidemiology, and proteins, respectively. I tried to read some abstracts of the bird spying Gentleman. There are comments like "birds sing when they touch the mate" etc. Clearly, these papers don't contain much real maths and I have real doubts whether the author has the capability of dealing with the statistical concepts quantitatively. The other two authors seem a bit more mathematically trained but I don't want to go into details.

Needless to say, it's true that you don't really need sophisticated statistical tools to listen to the birds (especially if you want to enjoy the sound just as the music to your ears) – and even say some things about the singing which are reliable enough. Some effects are "obvious" and statistical methodologies may be an excessively powerful weapon in some contexts. But at some level of rigor which is really needed in harder disciplines than "bird songs", the statistical treatment simply is essential for science.

The letter starts with this short paragraph:
When was the last time you heard a seminar speaker claim there was ‘no difference’ between two groups because the difference was ‘statistically non-significant’?
Indeed, that's an omnipresent comment during scientific seminars. What's your problem with it?

I can't be sure about their motivation but I think that I can feel where the wind is blowing from. They don't like papers that conclude that some correlation isn't supported by experimental tests! That this is their main problem is confirmed by many other sentences in the article. For example, the text inside one figure says:
Wrong interpretations: An analysis of 791 articles across 5 journals found that around half mistakenly assume non-significance means no effect.
Oh, really? So is it a "wrong interpretation" that makes one-half of papers wrong just because of that statistical fallacy? Does non-significance mean no effect? Not strictly. But non-significance of the outcome of a research paper means that the paper hasn't found any real evidence for the effect so it remains consistent with the data to assume that the effect doesn't exist. And because there has been a pre-existing chance that the research would find a statistically significant effect, the null result increases the probability that the effect is non-existent, indeed. This is roughly what they call the "wrong interpretation" although it's clearly the correct interpretation. And it seems likely that their summary is just a wrong interpretation of the actual interpretation in those papers.

Why would they have trouble with the sentence "the effect was insignificant"? Well, some Twitter users provide us with the obvious possible answer:

Or:

Or:

Or:

Right. It seems almost impossible to avoid the conclusion that this is their actual agenda. They want to replace science by "feels" and give the biases a free pass, as one of the responses in Nature said in the very title. They may want to fill the scientific literature with thousands of fake discoveries, thousands of false positives, and make it impossible for scientists to publish opposing results.

You know, for example, there may be papers that conclude:
We have investigated the effect of a large number of petrol cars on the frequency and melodic diversity of the bird songs in our country and we have found no significant effect.

Or:

We have investigated the effect of the regular drinking of Gatorade (or Brawndo) on the ability to complete a physics PhD and we have found no significant influence.
Some "researcher" who has previously determined (maybe by his or her feelings or by being ideologically indoctrinated) that petrol cars are bad for the bird songs (and, similarly, a manager or stockholder of the Gatorade Corporation or even a TV viewer "persuaded" by the deceitful Gatorade ads; just to be sure, only Kofola, not Gatorade, really works because it has 50% more caffeine than Coke and that's a statistically significant difference) could find the result of such a paper inconvenient. And by banning the concept of "statistical significance", the conclusion could be banned, too. Well, indeed, that could make superstitions such as "cars are bad for bird songs" impossible to refute in scientific journals. More frustratingly, it would also end the scientific character of the research of the bird songs (and thousands of other topics). The paper above may probably be performed with the same result, it is correct, and the interpretation is correct, too! It's too bad if you don't like it because this is what the science says about that question.

Amrhein would probably call this result a "hyped claim" that "dismiss[es] possibly crucial effects", as the subtitle of the letter says. But on the contrary, the result of this form is a result that debunks some hype, in this case, hype about the connection between cars and bird songs. And the well-defined "crucial effects" were rightfully "dismissed" because their existence wasn't confirmed by the experimental data!

A song from a Czech fairy-tale that was shot in the 1980s. "Statistics is a boring thing but it produces valuable data; let us not get knocked down, statistics will quantify it for us" the somewhat out-of-tune boy-statistics-wizard explains while he statistically analyses the motion of the Sun and strategies to defeat a dragon. He takes the fraction of dragons facing princes who have survived up to the next breakfast in various fairy-tales into account. Also, too bad that non-Czechs can't understand this hilarious sketch about statistics that a man is trying to explain to a moron (comedian Felix Holzmann, in the glasses) using real world examples.

There are some other ideas in the letter. It's always a bit unclear "what they really mean" and whether "they are really right" or "they are wrong and the criticized papers are right". Again, I think it's mostly the latter – the authors of the letter are mostly wrong and the authors of the criticized paper are mostly right.

Let me wrap up with a list of possible fallacies and abuses related to the concept of statistical significance, starting with those that they could have meant but they weren't too clear. My bold face titles will be correct propositions (according to your humble correspondent).

Statistical significance is a fuzzy characteristic that doesn't divide things to "black" and "white"

They came pretty close to this point but I just don't understand why they can't express it clearly and crisply, without vague references to something else that is partially wrong. OK, "statistically significant" and "statistically insignificant" claims aren't separated by a sharp, thick, impenetrable, canonical, and unmovable wall.

When we divide the results to two classes, we do so according to the $$p$$-value that is either larger than or smaller than a conventional "decisive threshold".

Note that the statistical significance is a bit elaborate concept within the probabilistic calculus and science cannot work without probabilities which are continuous variables. As Feynman said in the context of the flying saucers, it is scientific to only say what is more likely and what is less likely.

I want to brag that Czech, Slovak, Polish, and probably a few other Slavic languages have a nice word for "probability", namely "pravděpodobnost". Note that it's a bit longer than the already long "probability" but it's clever as you see if you deconstruct it. It means "pravdě podobnost". "Pravdě" means "to the truth", "podobnost" is "similarity". So instead of "probability", we say "similarity to the truth". It's so logical, isn't it? You could try to use it as well, it's great. Maybe you would gain a whole new Slavic dimension to look at the Universe!

The number of sigmas required in softer sciences is too low, they should be harder

I've discussed it many times. It's terrible that many sciences are satisfied with $$p\lt 0.05$$ i.e. with 2-sigma evidence. In particle physics, we need $$p\lt 0.000001$$ or so, a 5-sigma proof. This harder standard eliminates a huge majority of the "false positives" that the softer sciences are full of.

And it doesn't make proofs of genuine, important new effects too much harder. If the error is mostly statistical, the transition from 2 sigma to 5 sigma only requires to increase the sample by a factor of (5/2) squared, i.e. by a factor of 6.25 or so. By using six-fold larger samples, the disciplines could upgrade their degree of "certainty" from "1 in 20" to "1 in 1 million". Isn't it a great deal?

I am afraid that these soft sciences are full of people who enjoy the freedom to produce bogus results and superstitions and the soft definition of the statistical significance, $$p\lt 0.05$$, makes it easier. Because it's possible that most people in many disciplines don't actually want to find the truth – they are doing activism or promote some products or cures of some sort – they would vote against the switch to 5 sigma as the default standard for a discovery.

As precise input parameters should be used as possible, instead of using "statistically indistinguishable" different ones

They may have come close to this point as well but they have failed to make it, too. You know, according to some analysis, 330 million people and 350 million people could be statistically indistinguishable. But when you need to assume the U.S. population in your analysis, you should use the actual correct number – with the actual error margin (or more complicated probabilistic distribution) indicated by the relevant and accurate previous measurements.

330 million and 350 million could be statistically indistinguishable as outputs i.e. in conclusions. But if you substitute these two numbers as inputs, of course you may get different results. Incidentally, this reminds me of my discussions about the logical arrow of time. You don't care about the difference between different microstates in an ensemble defining the final state. But you need to care about the difference between the initial microstates. Some special ones among them could lead to qualitatively different final results. So instead, you need to understand that you don't know the precise initial microstate in most cases which is why your conclusions (predictions) has to be statistical in character.

My general discussion about initial and final microstates is mostly isomorphic to the discussion about the input and output data that are statistically indistinguishable (which is analogous to the indistinguishable microstates in the statistical physics discussions).

The statistical indistinguishability shouldn't be assumed to be a "transitive property"

You know, 330 million and 350 million people could be statistically indistinguishable. Similarly, 350 million is indinstinguishable from 370 million. ... And then 970 million from 990 million. If you merge these steps, 330 million is indistingushable from 990 million. Is it? It is not. Sensible people really understand thy they're not. If someone makes this elementary mistake, it's by treating the "statistical indistinguishability" as a strict equivalence. But it is not a strict equivalence that could be used many times. It's only an "effective empirical equivalence" that may be used once while reading one set of conclusions.

Correct statistical reasoning will prevent you from saying that 330 million is indistinguishable from 990 million. If someone makes fallacious reasoning with this ludicrous outcome, it's because of a mistake. One mistake could be the one described in the previous section – namely the assumption that different inputs will yield equivalent results just because these different inputs appeared as statistically indistinguishable as outputs in some paper.

Statistical significance isn't the same as practical importance (or clinical significance)

Now, this is also an important question but they have failed to discuss it correctly, too. When we say that a measured effect is "statistically significant", it means that the effect was seen to be "large" in some sense. But which sense and how large?

When we say that we need to be careful while doing some activity because a mistake could matter, it also means that the effect of that mistake is believed to be "large". But in what sense and how large? I think that sensible people – and authors of most articles in scientific journals – understand very well that the two types of "large" in this paragraph and the previous one are different.

"Statistically significant" effects are those that are "large" in the sense that they're visible through the particular research or "way of looking" that was employed in that paper. If we look in a certain way, we may calculate the probability that we're not seeing correctly (because some noise or the error of the apparatus has prevailed) and the probability may be calculated to be low. So we should believe our eyes that tell us that we are really seeing the effect.

But being visible is something else than being important for someone's life etc.

For example, one may calculate from the government's data that there is a difference between the average annual salary of the people who are born in the winter months and those who are born in the summer months. The difference in the annual salaries is some $30 and because the salaries are averages from very many people,$30 is actually "statistically significant", maybe over 5 sigma, you would have to search for details.

But \$30 a year is still a tiny difference that shouldn't tangibly affect the rational parents' planning, right?

This is quite a typical example: effects may be statistically significant but they're still small enough to matter in practice. Why is it possible? Because we can see even things that don't matter. We may surely see things and effects that are too small to be dangerous.

What about the opposite "soft contradiction"? Can there be "statistically insignificant effects that are still practically important"? Well, in practice, there may be. But I am convinced that this option is much less likely and reasonable.

Why? If you have good enough, modern apparatuses that can observe and measure things, and large enough samples in surveys, they can simply see a lot. Their resolution is better than yours. So if they don't see anything substantial, it makes perfect sense to assume that it won't be important for you.

This comment is extremely important in all the environmental, healthy food, and similar soft sciences. You know, one can measure the correlation between the drinking of coffee and some kind of cancer. Coffee may increase the cancer rates by 10% and within a paper, it may be statistically significant or not.

If it is significant, you might say that in principle, you have a reason to avoid coffee. But 10% is still a rather small change. It may translate to the life expectancy that is shortened by a month. I think that most coffee lovers would prefer their lives to be 1 month shorter than to live their whole life without coffee. What they do if they're rational is some cost-and-benefit analysis, sometimes done instinctively or subconsciously. They're solving a more complex problem that also takes totally different things, like their well-being without coffee, into account. The mere statistical significance of some coffee-canceer link simply isn't the whole story.

On the other hand, if the effect of coffee is measured to be 10% and statistically insignificant, I think it is right for everyone to interpret the paper as "no scientific reason go avoid coffee". The paper has looked at the question and concluded with "it seems that there is no effect". This is a right interpretation of the statistical insignificance. 10% seems both small in some "absolute" terms as well as relatively to the threshold needed to be certain that the effect exists at all. On top of that, there are lots of other effects of coffee, many of which may be (and almost certainly are) beneficial! So it's wrong to deduce your future coffee drinking habits from such a special cancer-coffee-link analysis. And if the result of that paper is "statistically insignificant", it means that "according to science, the effect is said not to exist in practice". This interpretation is really a top reason why the scientific research is done in the first place. If we couldn't make such conclusions, the research would be useless.

Statistical significance may sometimes be too heavy, too mathematical tool, but as soon as scientists' "obvious" conclusions differ, it is simply needed

OK, the bird songs may be "obviously" correlated with some copulation cycles. A bird song researcher may hear it or see it – by eyeballing. When others do, even outside all the echo chambers, it's great (but even in the case of a universal agreement, it is still a less solid type of a "proof"). But what if another bird song researcher simply says that the correlation isn't there? As soon as any significant differences emerge in scientists' interpretations of what they observe, they simply have to become quantitative and use the $$p$$-values.

To observe all the relevant quantities and to calculate the $$p$$-values is needed to quantify whose view is more supported by the actual empirical data. The procedure finding the $$p$$-values is the "measurement of who seems to be right". How could science live without this procedure, the so-called "hypothesis testing"? Isn't it obvious that the people who want to eliminate the statistical significance in general are those who don't have the science on their side?

Feynman's point: You should test a null hypothesis that was formulated in advance, otherwise you're overfitting

Of course, Richard Feynman has said a lot about the philosophy of science – and even the role of the probabilistic tools in the research. One slightly technical point he sometimes emphasized was that the quantification of the $$p$$-value – which is the probability that the null hypothesis predicts a result that is "at least as extreme as the observed one" – should depend on a null hypothesis that is not being adjusted after the experiment.

In particle physics, we test the Standard Model and we can really calculate what it predicts – unless some truly messy QCD phenomena are relevant. So particle physics really tends to obey Feynman's condition.

However, in other fields, you could claim that "there is no effect here" just by saying that "what you're seeing is actually basically the null hypothesis of yours". In particular, "what the global warming predicts concerning the cool weather" is an example of predictions that are being adjusted a posteriori. "Global warming" is a very flexible "null hypothesis" and all observations have been said to be consistent with "it".

I must say that there's a danger on the opposite side of Feynman's recommendation, too: You may use too special a null hypothesis in some research and it's easier to refute it at five sigma or more. But you shouldn't overinterpret this "new discovery" because there could be a better, less naive null hypothesis of the "qualitatively same kind" that would be compatible with the data. Again, examples exist in the climate debate. Some alarmists have claimed to "exclude natural variability" as the cause of the warming. But in reality, they only excluded a particular, rather limited null hypothesis that they called "natural variability". More realistic models of the natural variability are still doing fine and it's obvious why they do – some of such conceivable hypotheses may be made almost as indistinguishable from the man-made influence as you can get.

Statistical significance doesn't immediately tell us which new theory is correct

When the Standard Model is ruled out at 5 sigma, we will know – be reasonably certain – that "new physics" beyond the Standard Model exists. But there will still be infinitely many choices what the new physics is and what particles and equations it needs to be properly described. Something similar is true in science in general. Certain people are too fast while interpreting statistically significant results, of course. They say "something is statistically significant and therefore this favorite story of ours has been proven". But it's not quite correct in most cases. Just one "null story" has been disproven – which is something else than incorrectly saying that "one particular different story was proven".

The most important punch line again: Science cannot work without statistical significance

Statistical significance is a reasonably well-defined quantity measuring how well one theory or another succeeds in explaining the observed data. The whole scientific method is about the search for the correct explanations – which proceeds primarily by the elimination of the explanations that have failed – which means that the whole scientific method needs to compare the "success" of the theories. Because $$p$$-values are the quantities that decide which theories work and which don't, in the wake of some measurements, $$p$$-values – expressed in one language or another – are absolutely essential for science. Without that essential concept, science would degrade into a pile of feelings, beliefs, tricks, and superstitions – and competing feelings and superstitions that couldn't be compared according to decent rules.

So whether or not you feel to share some of their murky goals or agenda, it's simply wrong and unethical to fight against "statistical significance" in science because no real science would be left. The activity would degenerate into emotions and "biases that are getting a free pass" and such an activity would have nothing to do with science as we have known it for centuries.

And that's the memo.