Monday, July 11, 2016 ... Français/Deutsch/Español/Česky/Japanese/Related posts from blogosphere

The Jeffreys-Lindley paradox

On the probabilities of theories that are close to the measured answer but not quite right

In a discussion with Alessandro Strumia and others, Tommaso Dorigo has repeated some of his opinions about the Jeffreys-Lindley's paradox (Wikipedia) which, in Dorigo's opinion, makes Bayesian thinking unusable in experimental particle physics (and probably everywhere because all other situations are analogous). He has previously written about it in 2012 and the paradox was also discussed by W.M. Briggs and others.

Czechia: Off-topic, geography: the U.N. dabatases now list "Czechia" as an official name of my homeland. Czech report, Bloomberg. Its usage is not mandatory but those who use wrong names will be stabbed to death by Mr Ban Ki-moon. Despite widespread fearmongering, the Prague Castle hasn't collapsed yet.
The paradox is meant to be one in which frequentism and Bayesianism give opposite verdicts about the validity of a hypothesis. Just to be sure, the frequentist definition of the probability is that every probability should be measured (and measurable) as the ratio \(N_{\rm OK}/N_{\rm total}\) for a limiting, very large number of trials \(N_{\rm total}\). The Bayesian probability admits that probabilities are used to quantify the subjective belief in the validity of statements and there exists a rational method (involving the Bayesian theorem) how to correctly update these probabilities (beliefs) even if we can't ever make measurements with \(N_{\rm total}\to\infty\).

What's the paradox? Wikipedia gives a simple example. Try to test whether one-half of the newborn infants are boys. You wait for the birth of some 1 or 1.4 or 2 million children, I don't know which it is, and the "one-half hypothesis" predicts that one-half of those kids will be boys, plus minus 1,000 (the standard deviation). The distribution is binomial – almost exactly normal.

However, the measured number of boys in a nation will be some number approximately 3,000 boys below the exact "one-half prediction". The question is whether this evidence makes the "one-half hypothesis" proven or disproven.

The frequentist only cares about the hypothesis, not about the negation, so he sees a deviation of the experimental fact from the prediction by 3 sigma and falsifies the "one-half hypothesis". In effect, the frequentist is satisfied to see a small conditional probability \(P(E|H)\), the probability of the observed evidence predicted by a hypothesis, to falsify \(H\). The probability of a newborn kid to be male isn't exactly one-half. The 3-sigma deviation ruled this theory out.

On the other hand, the Bayesian calculates the probability of all hypotheses, including the "negations", and he insists on showing that \(P(H|E)\) is small – note the opposite order – if he wants to eliminate \(H\). The negation of the "one-half hypothesis" is basically saying that the percentage of the boys is an unknown number uniformly distributed between 0% and 100%. Because the uncertainty of the Bayesian prediction for the percentage of boys made by the "not 50% hypothesis" was so much higher, it is very unlikely to produce a number close to 50%. In other words, according to the Bayesian, the "not 50% hypothesis" has no explanation why the percentage was rather close to 50%. Bayes' theorem punishes this "not one-half hypothesis" for this vagueness of the prediction and the result may end up being that the posterior probability for the "not one-half hypothesis" is even (much) smaller than that of the "one-half hypothesis".

For the Bayesian, the "one-half hypothesis" did a lousy job but the opposite hypothesis did no job at all – it had no clue about the percentage, not even approximately – so the "one-half hypothesis" wins over the negation despite the lousy fit.

OK, is there a paradox? No. The very specific hypothesis that the fraction is 0.50000 was ruled out at some 99.7% level, indeed. But what was considered the "negation" of the hypothesis – that it's a random number between 0% and 100% – could have been falsified, too. My view is that the reason why there's no contradiction is that the purported "negation" isn't really a negation at all. More precisely, the uniform distribution on the interval from 0% to 100% in no way follows from the assumption that "the percentage of boys isn't 50%".

Why? Because the actual "negation of the one-half hypothesis" is an extremely vague hypothesis that doesn't really allow you to make any predictions at all. For this reason, you can't really calculate the "probabilities of an observation" predicted by this "it is not 50% of boys" hypothesis. And that's what prevents you from applying Bayes' theorem really accurately.

Bayesian inference works great but the competing hypotheses should be sufficiently well-defined so that the probabilities of various observations may be quantitatively predicted from these hypotheses. This is satisfied for the "uniform distribution between 0% and 100%" but that hypothesis may justifiably be heavily disfavored even relatively to hypotheses that are ruled out. On the other hand, a better interpretation of the "not 50% of boys hypothesis" may give more reasonable predictions for the percentage (uncertain but close to 48-52 percent) but it's not clear what the predictions are and why.

Equivalently, the statement "the rule is something else than 50% for boys" may have a probability that is nearly 100% when the "50% theory" is ruled out at 3 sigma. This nearly 100% is the sum of the probabilities of all the detailed alternative, mutually exclusive theories. However, if we decide that this "not 50% for boys" should be one hypothesis, its probability is something else – it's a weighted average of the individual hypotheses it contains, and that can be low, even lower than the 3-sigma-excluded "50% theory".

How should we proceed in this case of the 49% of boys? Well, we may formulate another hypothesis, i.e. that the percentage of boys is 50% plus minus 2%, and this hypothesis will beat both the "exactly 50% hypothesis" and "the uniform distribution between 0% and 100% hypothesis". It's common sense because 50% plus minus 2% is what we're normally getting for boys (well, maybe 49% plus minus 2%). But where does this better hypothesis fit? Does it fit to the "exactly 50% slice" or "its negation"?

Well, in principle, it contradicts the "exactly 50% hypothesis". When you measure the percentage many times, you may decide whether the fraction is exactly 50% or not. But this better hypothesis is clearly inequivalent to the negation mentioned above, the negation that assumes the uniform distribution. The better hypothesis is a slice of the negation which is very close to the "exactly 50% hypothesis" in some metric.

At any rate, the right way to deal with these situations is to have sufficiently well-defined yet actually viable or realistic hypotheses to run as competitors. When you believe that a quantity is close to 50% but it isn't quite there, you should say it, e.g. define a distribution preferring numbers close to 50% but not necessarily 50%. (This situation appears in particle physics all the time, with all the constants that are much smaller than one or the order-of-magnitude estimates if those are dimensionful.)

When you have such promising hypotheses, you may compare them in the Bayesian way. This contest will be fair and you should trust the results. So the better "medium" hypothesis will beat all the extreme ones. In principle, when lots of evidence is accumulated, the "medium" hypothesis is mutually exclusive with the theories on both sides. But one must always appreciate that if the total amount of data is known to be limited, hypotheses that sound differently may still overlap – so they are not mutually exclusive.

For example, the hypothesis that the percentage of boys in a normal distribution around 50% plus minus 1%; and that it is 50% plus minus 1.01% – those are two "different" hypotheses. If the distribution may be measured (and you need a huge number of births for that), they may be strictly distinguished. But for any realistic finite number of births, the hypotheses are effectively equivalent. So when you demand that the total probability of all such hypotheses is 100%, you are making something fishy. Because this pair of hypotheses is basically equivalent, you are really double- (or multiple-) counting the probability of this hypothesis if you compute the sum of their probabilities, so this sum should be allowed to be greater than 100%.

Hypotheses that "compete" in the Bayesian reasoning should be sufficiently well-defined for them to predict some probabilities of observations; producing predictions that sufficiently differ from those of other hypotheses; and sufficiently realistic not to predict completely wrong values most of the time.

Let's apply this to the cosmological constant. We may have a hypothesis that \(\Lambda=0\); the hypothesis that \(\Lambda\) is nonzero and uniformly distributed in an interval of numbers comparable to \(m_{\rm Planck}^4\); and a more realistic yet seemingly "artificial" hypothesis that\[

\Lambda=m_{\rm Planck}^4 \cdot \exp(-100E)

\] where \(E\) has a normal distribution around zero with the standard deviation one. Clearly, the latter hypothesis will win. (I could have invented nicer or more justified similar distributions but I wanted to make things simple.) The experiments indicate that \(E\sim 1.23\) because the cosmological constant is 123 orders of magnitude away from the "Planckian estimate" hypothesis. And \(E\sim 1.23\) is perfectly likely according to the normal distribution.

The only objection you might have is that my hypothesis was really built after I learned the measured value of \(\Lambda\) so this hypothesis is "artificial". I was "cheating". I don't have any explanation for this form of the distribution for \(\Lambda\). Right. Except that even if I don't have an explanation for this distribution, one may exist and it's a legitimate possibility to assume that such an explanation exists – even if no one knows what it could be. The absence of an explanation for \(\exp(-100E)\) is a disadvantage in the eyes of a theorist – but it shouldn't be a disadvantage in the eyes of an experimenter. An experimenter should be able to impartially compare well-defined hypotheses whether they sound motivated to him or not.

After all, Weinberg basically did calculate a similar distribution (but one with \(E\) more tightly focused on \(E\sim 1.23\)) using the anthropic observation (the existence of stars and the remarkable skill of the early cosmology to easily make stars impossible). When you compare "motivated" hypotheses about the cosmological constant – those with a plausible detailed explanation of their structure – you will surely see that Weinberg's distribution for \(\Lambda\) is among the best hypotheses, probably the best one.

But there may exist other explanations of the tiny cosmological constant which are perhaps (even) more quantitative than Weinberg's anthropic estimate. String theory might have an explanation why \(\Lambda\) is proportional to an exponential of something simpler; and why the exponent is naturally of the form \(100E\) where \(E\) is even simpler (and \(100\) is an estimate of the homology classes of a typical compactification manifold, for example). The fact that we don't know the precise logic does not mean that we may eliminate these possibilities a priori. These explanations might be right even if their details – the detailed reasons why they predict what they predict – are unknown at this moment.

(Analogous comments hold not only for the cosmological constant but also for the Higgs mass in the Planck units, CP-violating phases, Yukawa couplings, and several other "small" parameters we know in Nature.)

This "nonzero probability" of these "so far unknown" hypotheses is something totally different than the claim that physicists should work on them much of the time. These hypotheses of an unknown form may be so difficult that physicists might think that it's a waste of time to try to attack these difficult puzzles right now. But they may still believe that these hypotheses are likely. The time spent on a theory isn't quite proportional to the probability that the theory is correct – although the proportionality should probably be approximately true if "all other aspects are equal" (which they almost never are, however).

To summarize, there is no paradox. To avoid mistakes including the wrong claim that there's a paradox, we should:
  1. distinguish vague "I say nothing about the parameters" from the "parameter is distributed uniformly" hypotheses: the latter may easily be falsified which doesn't falsify the former (this warning of mine is at least morally equivalent to my criticism of the unjustifiable "typicality" assumption of the believers in the anthropic principle)
  2. consider hypotheses of the "medium" type that make small but nonzero values (e.g. modest deviations of boys from 50% or the cosmological constant) reasonably likely
  3. try to adjust the form of the hypotheses in such a way that they're distinguishable by realistic portions of the data we want to consider – that they are mutually exclusive even in practice (otherwise the given package of the empirical data is not useful for the discrimination, and whenever it's so, we should be aware of this fact and consider it a vice of the experiment in the context of the theories, not a vice of any competing hypothesis itself)
  4. appreciate that the typical scenario in all of science is that theories that predict quantities "pretty well" yet imprecisely are more useful than theories that predict nothing about these quantities; in this typical case, the "pretty good" yet imprecise theories may be falsified but they may still beat the "no prediction" theories with the uniform distribution; when it's so, the next step should be to try to stand on the shoulders of giants and define a "refined" or "corrected" theory that makes predictions similar to those of the specific theory but with some corrections that are said to be small
The last item is the longest one because I believe that the omission of this step – the construction of better hypotheses and theories – is the point that is missed by those who claim that there is a paradox. They imagine that the right "set of competing hypothesis" is there from the beginning and the rest of the history of science is to measure and pick the winner. But that's clearly not what science is all about. Science constantly updates, corrects, enriches, and rearranges the list of the actual competing hypotheses and theories. Those considered by physicists today are completely different than those considered 100 years or even 40 years ago. The drift towards more accurate, precise, universal, elegant, and justified theories is the bulk of the scientific progress. The experiments aren't just picking winners and losers in a static race that has always the same character; they are primarily telling theorists how to breed better horses for future races!

When one is careful, the paradox isn't there. In particular, it is not true, as Tommaso Dorigo tries to claim, that the Jeffreys-Lindley paradox shows that the Bayesian thinking about scientific theories is unusable.

At the end, experiments' main purpose is to produce verdicts about the relative likelihood of various explanations and hypotheses. If experiments don't tell us whether some theory or hypothesis is viable or not, they're useless for theorists.

Dorigo's focus on the frequentist thinking is ultimately unusable for all of science and for theorists because those simply need to know the probability that some theory or hypothesis or statement is true. But this Dorigo's focus is legitimate to a limited extent. Namely that the "frequentist probabilities" that the experimenters produce may be viewed as an "isolated part" coming from the evaluation of an experiment. Theorists may combine this isolated part (a job that the experimenters are trained for and should be good at) with other reasoning – which also includes some Bayesian reasoning involving their prior probabilities of different hypotheses (vague but vital information that a theorist should always try to have some clue about) – and at this moment, the experimental data are useful for the theorists.

Dorigo has told us that he believes that the probability of any new physics at the LHC is \(10^{-10}\). So even if he found a 6-sigma evidence for a new BSM particle, he would believe that it's not real! This fact obviously means that Dorigo is a prejudiced bigot. But as long as he (and others at the LHC) won't cheat in the experiments or hide some inconvenient evidence for new physics and as long as they inform us about the correctly measured frequentist probabilities, we may view his claims about the "near certainty that physics is over" to be just some stupid hobby or religious ritual unrelated to his actual work.

In other words, Dorigo and others is a dog who is barking the wrong tree but he may still be useful in biting an unwelcome guest.

Add to Digg this Add to reddit

snail feedback (1) :

reader Tommaso Dorigo said...

Hi Lubos,

nice post. I may have overlooked it in your long post (sorry am doing this with residuals of cpu while trying to follow a talk), but you do not focus on the important issue here, namely that "point nulls" do exist in HEP. The mass of the photon is EXACTLY zero, and the charge of the proton is equal to the charge of the positron. The cross section of NP is either >0 or EXACTLY zero. This makes the null hypothesis a "point null", and forces a Bayesian to place a non-null lump of probability in a single point of a continuous space. From this point on, and the existence of two other scales -a wider prior (make it a narrow one if you prefer, but things don't change) and a data-size-dependent "evidence" - make the paradox appear.
I don't think you can solve it with your recipe, but I think your post is useful anyway. Please have a look at the paper by Robert Cousins on this topic, you will find it very enlightening.