## Monday, August 19, 2013

### In defense of five standard deviations

Originally posted on August 12th. The second part was added at the end. The third part. Last, fourth part.

Five standard deviations are cute.

However, Tommaso Dorigo wrote the first part of his two-part "tirade against the five sigma",
Demistifying The Five-Sigma Criterion
I mostly disagree with his views. The disagreement begins with the first word of the title ;-) that I would personally write as "demystifying" because what we're removing is mystery rather than mist (although the two are related words for "fog").

He "regrets" that the popular science writers tried to explain the five-sigma criterion to the public – I think they should be praised for this particular thing because the very idea that the experimental data are uncertain and scientists must work hard and quantitatively to find out when the certainty is really sufficient is one of the most universal insights that people should know about the real-world science.

When I was a high school kid, I mostly disliked all this science about error margins, uncertainties, standard deviations, noise. This sentiment of mine must have been a rather general symptom of a theorist. Error margins are messy. They're the cup of tea of the sloppy experimenters while the pure and saint theorist only works with perfect theories making perfect predictions about the perfectly behaving Universe.

Of course, sometimes early in the college, I was forced to get dirty a little bit, too. You can't really do any empirical research without some attention paid to error margins and probabilities that the disagreements are coincidental. As far as I know, the calculations of standard deviations was one of the things that I did not learn from any self-studies – these topics just didn't previously look beautiful and important to me – and the official institutionalized education system had to improve my views. The introduction to error margins and probabilistic distributions in physics was a theoretical introduction to our experimental lab courses. It was taught by experimenters and I suppose that it was no accident because they were more competent in this business than most of the typical theorists.

At any rate, I found out that the manipulations with the probability distributions were a nice and exact piece of maths by themselves – even though they were developed to describe other, real things, that were not certain or sharp – and I enjoyed finding my own derivations of the formulae (the standard deviations for the coefficients resulting from linear regression were the most complex outcomes of this fun research).

At any rate, hypotheses predict that a quantity $$X$$ should be equal to $$\bar X\pm \Delta X$$ if I use simplified semi-laymen's conventions. The error margin – well, the standard deviation – $$\Delta X$$ is never zero because our knowledge of the theory, its implications, or the values of the parameters we have to insert to the theory are never perfect.

Similarly, the experimenters measure the value to be $$X_{\rm obs}$$ where the subscript stands for "observed". The measurement also has its error margin. The error margin has two main components, the "statistical error" and the "systematic error". The "total error" for a single experiment may always be calculated (using the Pythagorean theorem) as the hypotenuse of the triangle whose legs are the statistical error and the systematic error, respectively.

The difference between the statistical error and the systematic error is that the statistical error contains all the contributions to the error that tend to "average out" when you're repeating the measurement many times. They're averaging out because they're not correlated with each other so about one-half of the situations are higher than the mean and one-half of them are lower than the mean etc. and most of the errors cancel. In particular, if you repeat the same dose of experiments $$N$$ times, the statistical error decreases $$\sqrt{N}$$ times. For example, the LHC has to collect many collisions because the certainty of its conclusions and discoveries is usually limited by the "statistics" – by their having an insufficient number of events that can only draw a noisy caricature of the exact graphs – so it has to keep on collecting new data. If you want the relative accuracy (or the number of sigmas) to be improved $$K$$ times, you have to collect $$K^2$$ times more collisions. It's that simple.

On the other hand, the systematic error is an error that always stays the same if you repeat the experiment. If the CERN folks had incorrectly measured the circumference of the LHC to be 27 kilometers rather than 24.5 kilometers, this will influence most of the calculations and the 10% error doesn't go away even after you perform quadrillions of collisions. All of them are affected in the same way. Averaging over many collisions doesn't help you. Even the opinions of two independent teams – ATLAS and CMS – are incapable of fixing the bug because the teams aren't really independent in this respect as both of them use the wrong circumference of the LHC. (This example is a joke, of course: the circumference of the LHC is known much much more accurately; but the universal message holds.)

When you're adding error margins from two "independent" experiments, like from the ATLAS collisions and the CMS collisions, you may add the statistical errors for "extensive" quantities (e.g. the total number of all collisions or collisions of some kind by both detectors) by the Pythagorean theorem. It means that the statistical errors in "intensive quantities" (like fractions of the events that have a property) decreases as $$1/\sqrt{N}$$ where $$N$$ is the number of "equal detectors". However, the systematic errors have to be added linearly, so the systematic errors of "intensive" quantities don't really drop and stay constant when you add more detectors. Only once you calculate the total systematic and statistical errors in this non-uniform way, you may add them (total statistical and total systematic) via the Pythagorean theorem (physicists say "add them in quadrature").

So far, all the mean values and standard deviations are given by universal formulae that don't depend at all on the character or shape of the probabilistic distribution. For a distribution $$\rho(X)$$, the normalization condition, the mean value, and the standard deviation are given by$\eq{ 1 & = \int dX\,\rho(X) \\ \bar X &= \int dX\,X\cdot \rho(X) \\ (\Delta X)^2 &= \int dX\,(X-\bar X)^2\cdot\rho (X) }$ Note that the integral $$\int dX\,\rho(X)$$ with the extra insertion of any quadratic function of $$X$$ is a combination of these three quantities. The Pythagorean rules for the standard deviations may be shown to hold independently of the shape of $$\rho(X)$$ – it doesn't have to be Gaussian.

However, we often want to calculate the probability that the difference between the theory and the experiment was "this high" (whether the probability is high enough so that it could appear by chance) – this is the ultimate reason why we talk about the standard deviations at all. And to translate the "number of sigmas" to "probabilities" or vice versa is something that requires us to know the shape of $$\rho(X)$$ – e.g. whether it is Gaussian.

There's 32% risk that the deviation from the central value exceeds 1 standard deviation (in either direction), 5% risk that it exceeds 2 standard deviations, 0.27% that it exceeds 3 standard deviations, 0.0063% that it exceeds 4 standard deviations, and 0.000057% which is about 1 part in 1.7 million that it exceeds five standard deviation.

So far, Dorigo wrote roughly four criticisms against the five-sigma criterion:
• five sigma is a pure convention
• the systematic errors may be underestimated which results in a dramatic exaggeration of our certainty (we shouldn't be this sure!)
• the distributions are often non-Gaussian which also means that we should be less sure than we are
• systematic errors don't drop when datasets are combined and some people think that they do
You see that this set of complaints is a mixed bag, indeed.

Concerning the first one, yes, five sigma is a pure convention but an important point is that it is damn sensible to have a fixed convention. Particle physics and a few other hardcore hard disciplines of (usually) physical sciences require 5 sigma, i.e. the risk 1 in 1.7 million that we have a false positive, and that's a reasonably small risk that allows us to build on previous experimental insights.

The key point is that it's healthy to have the same standards for discoveries of anything (e.g. 1 in 1.7 million) so that we don't lower the requirements in the case of potential discoveries we would be happy about; the certainty can't be too small because the science would be flooded with wrong results obtained from noise and subsequent scientific work building on such wrong results would be ever more rotten; and the certainty can't ever be "quite" 100% because that would require an infinite number of infinitely large and accurate experiments and that's impossible in the Universe, too.

We're gradually getting certain that a claim is right but this "getting certain" is a vague subjective process. Science in the sociological or institutionalized sense has formalized it so that particle physics allows you to claim a discovery once your certainty surpasses a particular thresholds. It's a sensible choice. If the convention were six sigma, many experiments would have to run for a time longer by 35% or so before they would reach the discovery levels but the qualitative character of the scientific research wouldn't be too different. However, if the standard in high-energy physics were 30 sigma, we would still be waiting for the Higgs discovery today (even though almost everyone would know that we're waiting for a silly formality). If the standard were 2 sigma, particle physics would start to resemble soft sciences such as medical research or climatology and particle physicists would melt into stinky decaying jellyfish, too. (This isn't meant to be an insulting comparison of climatology to other scientific disciplines because this comparison can't be made at all; a more relevant comparison is the comparison of AGW to other religions and psychiatric diseases.)

Concerning Tommaso's second objection, namely that some people underestimate systematic errors, well, he is right and this blunder may shoot their certainty about a proposition through the roof even though the proposition is wrong. But you can't really blame this bad outcome – whenever it occurs – on the statistical methods and conventions themselves because you need some statistical methods and conventions. You must only blame it on the incorrect quantification of the systematic error.

The related point is the third one, namely that the systematic errors don't have to be normally distributed (i.e. with a distribution looking like the Gaussian). When the distribution have thick tails and you have several ways to calculate the standard deviations, you should better choose the largest one.

However, I need to say that Tommaso heavily underestimates the Gaussian, normal distribution. While he says that it has "some merit", he thinks that it is "just a guess". Well, this sentence of his is inconsistent and I will explain a part of the merit below – the central limit theorem that says that pretty much any sufficiently complicated quantity influenced by many factors will be normally distributed.

Concerning Tommaso's last point, well, yes, some people don't understand that the systematic errors don't become any better when many more events or datasets are merged. However, the right solution is to make them learn how to deal with the systematic errors; the right solution is not to abandon the essential statistical methods just because someone didn't learn them properly. Empirical science can't really be done without them. Moreover, while one may err on the side of hype – one may underestimate the error margins and overestimate his certainty – he may err on the opposite, cautious side, too. He may overstate the error margins and $$p$$-values and deny the evidence that is actually already available. Both errors may turn one into a bad scientist.

Now, let me return to the Gaussian, normal distribution. What I want to tell you about – if you haven't heard of it – is the central limit theorem. It says that if a quantity $$X$$ is a sum of many ($$M\to\infty$$) terms whose distribution is arbitrary (the distributions for individual terms may actually differ but I will only demonstrate a weaker theorem that assumes that the distributions coincide), then the distribution of $$X$$ is Gaussian i.e. normal i.e. $\rho(X) = C\exp\zav{ - \frac{(X-\bar X)^2}{2(\Delta X)^2} }$ i.e. the exponential of a quadratic function of $$X$$. If you need to know, the normalization factor is $$C=1/(\Delta X)\sqrt{2\pi}$$. Why is this central limit theorem true?

Recall that we are assuming$X = \sum_{i=1}^M S_i.$ You may just add some bars (i.e. integrate both sides of the equation over $$X$$ with the measure $$dX\,\rho(X)$$: the integration is a linear operation) to see that $\bar X = \sum_{i=1}^M \bar S_i.$ It's almost equally straightforward (trivial manipulations with integrals whose measure is still $$dX\,\rho(X)$$ or similarly for $$S_i$$ and that have some extra insertions that are quadratic in $$S_i$$ or $$X$$) to prove that$(\Delta X)^2 = \sum_{i=1}^M (\Delta S_i)^2$ assuming that $$S_i,S_j$$ are independent of each other for $$i\neq j$$ i.e. that the probability distribution for all $$S_i$$ factorizes to the product of probability distributions for individual $$S_i$$ terms. Here we're assuming that the error included in $$S_i$$ is a "statistical error" in character.

So the mean value and the standard deviation of $$X$$, the sum, are easily determined from the mean values and the standard deviations of the terms $$S_i$$. These identities don't require any distribution to be Gaussian, I have to emphasize again.

Without a loss of generality, we may linearly redefine all variables $$S_i$$ and $$X$$ so that their mean values are zero and the standard deviations of each $$S_i$$ are one. Recall that we are assuming that all $$S_i$$ have the same distribution that doesn't have to be Gaussian. We want to know the shape of the distribution of $$X$$.

An important fact to realize is that the probabilistic distribution for a sum is given by the convolution of the probability distributions of individual terms. Imagine that $$X=S_1+S_2$$; the arguments below hold for many terms, too. Then the probability that $$X$$ is between $$X_K$$ and $$X_K+dX$$ is given by the integral over $$S_1$$ of the probability that $$S_1$$ is in an infinitesimal interval and $$S_2$$ is in some other corresponding interval for which $$S_1+S_2$$ belongs to the desired interval for $$X$$. The overall probability distribution is given by $\rho(X_K) = \int dS_1 \rho_S(S_1) \rho_S(X_K-S_1).$ You should think why it's the case. At any rate, the integral on the right hand side is called the convolution. If you know some maths, you must have heard that there's a nice identity involving convolutions and the Fourier transform: the Fourier transform of a convolution is the product of the Fourier transforms!

So instead of $$\rho(X_K)$$, we may calculate its Fourier transform and it will be given by a simple product (we return to the general case of $$M$$ terms immediately)$\tilde \rho(P) = \prod_{i=1}^M \tilde\rho(T_i).$ Here, $$P$$ and $$T_i$$ are the Fourier momentum-like dual variables to $$X$$ and $$S_i$$. However, now we're almost finished because the products of many ($$M$$) equal factors may be rewritten in terms of an exponential. If $$\rho(T_i)=\exp(W_i)$$, then the product of $$M$$ equal factors is just $$\exp(MW_i)$$ and the funny thing is that this becomes totally negligible if $$MW_i\gg 1$$. So we only need to know how the right hand side behaves in the vicinity of the maximum of $$T_i$$ or $$W_i$$. A generic function $$W_i$$ may be approximated by a quadratic function over there which means that both sides of the equation above will be well approximated by $$C_1\exp(-MC_2 T_i^2)$$ for $$M\to\infty$$.

It's the Gaussian and if you make the full calculation, the Gaussian will inevitably come out as shifted, stretched or shrunk, and renormalized so that $$X$$, the sum, has the previously determined mean value, the standard deviation, and the probability distribution for $$X$$ is normalized. Just to be sure, the Fourier transform of a Gaussian is another Gaussian so the Gaussian shape rules regardless of the variables (or dual variables) we use.

So there's a very important – especially in the real world – class of situations in which the quantity $$X$$ may be assumed to be normally distributed. The normal distribution isn't just a random distribution chosen by some people who liked its bell-like shape or wanted to praise Gauss. It's the result of normal operations we experience in the real world – that's why it's called normal. The more complicated factors influencing $$X$$ you consider, and they may be theoretical or experimental factors of many kinds, the more likely it is that the Gaussian distribution becomes a rather accurate approximation for the distribution for $$X$$.

Whenever $$X$$ may be written as the sum of many terms with their error margin (even though the inner structure of these terms may have a different, nonlinear character etc.; and the sum itself may be replaced by a more general function because if it has many variables and the relevant vicinity of $$X$$ is narrow, the linearization becomes OK and the function may be well approximated by a linear combination i.e. effectively a sum, anyway), the normal distribution is probably legitimate. Only if the last operation to get $$X$$ is "nonlinear" – if $$X$$ is a nonlinear function of a sum of many terms etc. or if you have another specific reason to think that $$X$$ is not normally distributed, you should point this fact out and take it into account.

But Tommaso's fight against the normal distribution as the "default reaction" is completely misguided because pretty much no confidence levels in science could be calculated without the – mostly justifiable – assumption that the distribution is normal. Tommaso decided to throw the baby out with the bathwater. He doesn't want an essential technique to be used. He pretty much wants to discard some key methodology but as a typical whining leftist, he has nothing constructive or sensible to offer for the science he wants to ban.

Second part

Dorigo's second part of the article is insightful and less controversial.

He reviews a nice 1968 Arthur Rosenfeld paper showing that the number of fake discoveries pretty much agrees with the expectations – some false positives are bound to happen due to the number of histograms that people are looking at. Sometimes experimenters tend to improve their evidence by ad hoc cuts if they get excited by the idea that they have made a discovery. Too bad.

Dorigo argues that the five-sigma criterion should be replaced by a floating requirement. This has various arguments backing it. One of them is that people have differing prior subjective probabilities quantifying how much plausible or almost inevitable they consider a possible result. Of course that extraordinary claims require extraordinary evidence while almost robustly known and predicted ones require a weaker one. It's clear that people get convinced by some experimental claims at a lower number of sigmas than for other claims. But I wouldn't institutionalize this variability because due to the priors' intrinsically subjective character, it's extremely hard to agree on the "right priors".

He also mentions OPERA that made the ludicrous claim about the superluminal neutrinos that was called bogus on this blog from the very beginning. It was a six-sigma result (60 nanoseconds with a 10 nanoseconds error), we were told. Dorigo blames it on the five-sigma standards. But this is just silly. Whatever statistical criterion you will introduce for a "discovery", you will never fully protect physics against a silly human error that may introduce an arbitrary large discrepancy to the results – against stupid errors such as the loosened cable. I wouldn't even count it as a systematic error; it's just a stupid human error that can't really be quantified because nothing guarantees that it remains smaller than a bound. So I think it's irrational to mix the debate about the statistical standards with the debate about loosened cables and similar blunders that cripple the quality of an experimental work "qualitatively" – they have nothing to do with one another!

Third part

In the third part, Dorigo discusses three extra classes of effects and tricks that may lead to fake discoveries. I agree with everything he writes but none of these things implies that the 5-sigma standard is bad or that it could be replaced by something better.

The first effect is mismodeling (well, a systematic error on the theoretical side); the second effect is aposterioriness, the search for bumps in places where we originally didn't want to look (which is OK for discovering new physics but such unplanned situations may heavily and uncontrollably increase the number of discrepancies i.e. false positives we observe, and we shouldn't forget about it when we're getting excited about such a discrepancy); and dishonest manipulation of the data (there's no protection here, except to shoot the culprit; if someone wants to spit on the rules, he will be willing to spit on any rules).

Fourth, last part

In this fourth and final part, Dorigo continues in a discussion of a bump near $$150\GeV$$. At the very end, he proposes ad hoc modifications of the five-sigma rule – from 3 sigmas for the B decay to two muons to 8 sigmas for gravitational waves. One could assign ad hoc requirements that differ for different phenomena but it's not clear how it would be determined for many other phenomena for which the holy oracle Dorigo hasn't specified his miraculous numbers. Moreover, the patterns in his numbers don't seem to make any sense. It is very bizarre why a certain exotic, not-guaranteed-to-exist decay of the B-mesons is OK with 3 sigmas while the gravitational waves that have to exist must pass 8 sigmas. Moreover, some other observed signatures, like "SUSY", aren't really signatures but whole frameworks that may manifest themselves in many experiments and each of them clearly requires a different assignment of the confidence levels if we decide that confidence levels should be variable. There would be some path from being sure that there's new physics to being reasonably sure that it's a SUSY effect – Dorigo seems to confuse these totally different levels of knowledge.

If this table with the variable confidence levels were the goal of Dorigo's series, then I must say that the key thesis of his soap opera is crap.

1. The disagreement begins with the first word of the title ;-) that I
would personally write as "demystifying" because what we're removing is
mystery rather than mist (although the two are related words for "fog").

Once again demonstrating your superb grasp of the English language! ;-)

2. 'While he says that it has "some merit", he thinks that it is "just a guess". '

But really it's just a Gauss!

3. Thanks for this big analysis, lucretius. But I didn't understand what financial quantities are non-normal and why should one expect the normal or any sensible distribution for them in the first place. Of course that the change of log(stockprice) in 5 years has totally no reasons to be normal. Growth by 2-3 orders of magnitude is the maximum but drops may be faster, when price goes to zero, so there is an asymmetry. And there's no good reason why it should resemble the Gaussian in between, I think. It seems bizarre to describe it as a failure of the central limit theorem because the assumptions of the theorem are totally violated.

The difference between your non-Gaussian examples and those in natural sciences is that the natural sciences actually do respect the assumptions of the central limit theorem in many cases so they must also respect and they do respect the conclusions of the theorem.

4. Dear Luboš, I am a big fan of these "middlebrow" articles that many of your readers will probably skip due to the subject matter being "old hat" for them, but for me it's new and I get more out of it than I would from reading a dry, dusty textbook or Wikipedia article, due to the lively presentation.

5. Well, the assumption of Gaussian distribution of both prices and returns was the "gold standard" in finance since Bachelier (student of Poincare) first made it until about 1970s - and several Nobel prizes in economics have been awarded for analyses that crucially relies on this assumption (including, of course, Merton and Scholes for the Black-Scholes formula). So it could not be that stupid. Normality actually holds pretty well over long periods. The scaling properties that Madelbrot suggested have been shown empirically not to hold (on the other hand, the assumption od discontinuous jumps, typical of Levy processes other than the Wiener process, has a strong statistical basis).

The analysis of the departure from normality that I gave is now considered uncontorversial. The central limit theorem is only a very special case of a much more general theorems (see Jacod and Shiryayev Limit Theorems for Sochastic Processes) or http://en.wikipedia.org/wiki/Infinite_divisibility_(probability) for a simplified version.

6. Dear lucretius, tx. Surely the normal-distribution-based standards were tried because they were so helpful elsewhere.

However, the very only disclaimer against the usage that I mentioned - quantities that are nonlinearly applied at the very end - applies to vastly changing prices because the price is really exp(Y) where Y is a more natural quantity. So if Y changes by more than 1 or so, it cannot be that both delta Y and delta exp(Y) are normal-distributed.

A special argument, such as the Markovian one, would be needed to suggest that one of them, namely Y, is normally distributed. But because the changes aren't really Markovian, this proof can't be made.

7. I agree with what you are saying in general, but you have significantly misstated the central limit theorem. It does not apply to *arbitrary* distributions, only to those with finite mean and variance.

Here is an example of a simplified physics experiment in which the exceptions can bite you:

Suppose we have an isotropic light source (in two dimensions) and an infinite line of photon detectors. We wish to determine the location of the source in the direction parallel to the line by observing where the photons hit the line. The hitting position distribution will *not* have finite mean and variance, the sum over many photons will not have a normal distribution, and taking a sum over more photons will not give a better estimate. We would need to use the median instead of the mean.

Also, note that the central limit theorem does not apply to any quantity influenced by many *factors*, only to quantities that are sums of many *terms*, how the parts combine matters. For example, there is a *different* central limit theorem, with a different limit distribution (log normal), for the product of many factors.

Now, the noise in accelerator experiments is mostly a sum, so the Gaussian approximation is largely justified in that case, but that isn't true for *all* experiments, and it is important to know when it isn't.

8. No derivatives market assumes normality. The black-scholes model which assumes that log returns are normal is just used for quoting prices in units of volatility. It is not used by any dealers (investment banks). They use stochastic and local volatility models with maybe jump diffusion processes added on. They use these models to both price and risk-manage. The crisis was caused by shitty lending practices and cheap credit. Not models.

9. Well, I can tell you that I used to get paid ((not my main job, of course) for helping to develop a computer model based on the normality assumption (for commercial mortgage pricing). I did not really like the normality assumption but it was what the clients wanted. So such things were certainly done before 2008 (the model is still be up on the web in a demo version) but the company no longer exists (not surprisingly perhaps).

By the way, jump diffusion processes (I also used to work with these) and the others you mention are not fashionable these days (in academic finance) as they lack the property of self-decomposibility. That means that their return distributions do not satisfy Sato's "class L property". A random variable has the class L property if the same distribution as the limit of some sequence of normalized sums of independent random variables - which is, of course, a generalization of the central limit theorem. Examples of processes that have this property are the Meixner, Variance Gamma, Generalized Hyperbolic and more general additive processes. I am not claiming that dealers know anything about this.

10. Yes, but the arguments were always applied to the distribution of returns, and of course the assumption that the process was Markovian was right at the heart. (That was essentially the "Efficient Market Hypothesis"). Actually, one can show that for option pricing it is not essential since the Markov property is not preserved by change of measure. Thus even if the underlying price process i not Markovian it is quite reasonable to use a Markovian process for pricing derivatives.

11. About Dorigo's third point about non-gaussian distributions: when the data fits a known non-gaussian distribution, I have often seen physicists calculate the appropriate p-value for their results and then convert that into a number of standard deviations for convenience (as much as maybe I shouldn't be, I'm much more comfortable with "5 sigma" than "p=0.00000057").

12. Thanks, Eugene, for your kind words.

My guess is that the people have held misconceptions about everything. ;-)

But it's hard to imagine what would the normal distribution for these quantities spanning many (dozens of) orders of magnitude would even mean.

It makes sense to talk about Gaussians as a function of a "linear" variable X. But the strength of an earthquake S isn't naturally linear. You don't even know whether you should express it by the amplitude or its second power or its logarithm and such choices matter because the would-be Gaussian bell curve is wide - wider than the curvature radius.

So there were some people who realized that one should talk about the logarithms of the magnitude because it's more natural, and others who figured out how the frequency of an earthquake etc. depends on this logarithm. There's no natural way to write a Gaussian here so I think that no *important* people have ever done so. ;-)

13. Fred, I think you're wrong to say that the models didn't cause the problem, the only reason the cheap credit was available (to the masses) was because of the trading of convoluted derivative instruments based on flawed models.

14. Hi Luboš,
There are two problems I see with a heavy reliance on the normal distribution and its summary statistics.

It is often necessary to make an inference or detection claim in the face of small number statistics.
Generic distributions like the Γ and β can have significant skew despite high statistics.

In both of these cases, using the mean and variance to define confidence intervals, whether one-sided or two sided, is wrong. Well it probably is good enough for a first internal estimate, but not something that ought to go into a final results paper.

In addition, I cannot help but feel uneasy at the prominence of p-values as the arbiter of statistical significance. In the case of the Higgs, why use a test on the falsity of the null hypothesis and not one of the model selection methods from the Bayesian or Neymann/Pearson frameworks? What we ultimately care about is whether there is a Higgs boson, not whether background is well modeled by the measured data. But since I don't work on a particle detector, I shall have to wait for Tommaso's 'part 2'.

This is not say I don't see the utility in null hypothesis tests, for I use them all the time in my work. I just share the concern with many others that banner headline, "5σ p-value == DETECTION" is prone to both misunderstand and misuse.

Cheers,
hμν

15. Dear Hmunu, I know that what you write sounds logical to you but it really contradicts the whole logic of the scientific method, especially when it comes to your comment about the p-value.

You ask why we falsify the null hypothesis instead of confirming the new hypothesis. Because it has to be so. Science may only proceed by falsification. The point is that the new hypothesis, as long as it is sufficiently accurately defined, is never *quite* correct. It will be falsified in the future. So if you ever confirm it as an absolute truth, you may be sure that you are doing something wrong. One may only confirm theories as temporary truths which always really means that some simplified models without them, the null hypotheses, may be ruled out. We don't really know that the 126 Higgs boson is "exactly" the Standard Model Higgs boson. We don't know that it is the exact boson from any other specific theory. We only know that it has approximately some properties and it's there - it means that theories that wildly deviate from the SM with the Higgs are ruled out. The definition of what we know about precise theories describing Nature is always negative.

Concerning the skewed distributions etc., your mistake isn't that serious but I still think you're wrong. You may invent some complicated functions, call them distributions, and claim that they're equally relevant for probability distributions as the normal distribution. But they're not.

In physics and science, we're ultimately interested in the sharp value of a quantity. As we're increasing our insights, we're converging towards the only right distribution for X which is delta(X-XR) where XR is the right value. All other functions are just temporary parametrizations of our incomplete knowledge. It makes no sense to develop a superconvoluted science with very unusual functions called distributions here because they're not real in any sense.

Whenever the deviations of a quantity seem to be very small so that the quantity may be linearized in the approximate interval where it probably sits according to the experiments, i.e. if this interval is much narrower than the "curvature radius" where things change qualitatively, it's always correct to measure the mean value and standard deviation from multiple measurements and assume that the quantity is normally distributed.

16. I forgot to add that one reason why Gaussian based models dominated before 2008 was the difficulty of using any of the other in multi-factor situation. At that time there was no real tractable way to deal with dependence. So although more realistic models could be used in single factor settings, almost all serious applications involve multiple dependent factors. Your claim that models had no role does not seem to based on any actual experience of these things; in fact the situation as it was before 2008 is described fairly accurately at the end of this article:

http://en.wikipedia.org/wiki/Copula_(probability_theory)

17. Hi Luboš,

Let me first give an example of a situation where using a non-normal distribution is important for making logical and correct inferences. Suppose one wants to estimate the efficiency of an analysis code to recover a known type of event that is rare in nature. A large monte-carlo simulation is run with 10^4 points and behold all but 1 of the events are recovered. What then is the 99.8% confidence interval? If one quotes (0.9996,1.0002) which is the mean minus/plus three standard deviations you would be wrong on two counts: quoting illogical efficiencies and not correctly defining the area with 99.8% of the probability.

In problems analysts regularly deal with, the β/binomial distribution (above) or the Γ/Poisson distribution are better models of the parameter in question than the normal distribution when faced with limited data. These get heavy use in particle physics within searches for dark matter, extra-galactic neutrinos and even the Higgs.

On the subject of hypothesis tests and decision making, it seems there is a bit of a misunderstanding here. As scientists, we ask questions like "which is better supported by experimental evidence, the SM or the MSSM" all the time. The methods of the Neyman-Pearson and Bayesian frameworks allow one to answer these sorts of question in a quantitative fashion. If the likelihood of the null (SM) is much larger than the alternative, one can safely say there is insufficient evidence the null is incorrect and that it therefore survives falsification. One is always free to go get more data or propose a better alternative model and try again.

But again,"how likely is the null true given the data" is just as reasonable and scientific a question as "how unlikely is the data given the null?"

Fisher's hypothesis tests are very useful tools, but not every situation requires a hammer. A wide variety of powerful methods have been developed in the hundred years since Neyman, Fisher and Jeffreys pondered these issues. It would be folly for a researcher to ignore them, particularly if the reason was based on the prejudices one gained as an undergraduate.

Cheers,
H\mu\nu

18. Dear H\mu\nu,

if you only observe 1 event and you want to state what is the exact expected number of events, it's clear that the error margin is of order 100%. It could have been almost 0 (p below one with probability p or so), it could have been one, but it could have been many, too.

Obviously, the normal distribution isn't good for this situation because the calculable 1-sigma interval is "nonlinear" and contains numbers for which the nonlinearities (such as forbidden negative values of the number of events) and quantization (the number is integer) are very important.

If you know it's a Poisson process, it's straightforward to calculate a better distribution based on that but it won't really help you with the generic fact that when only 1 event exists, then the statistics available for your empirical data is inadequate for most quantitative tests. Unless the event was predicted to occur with some "really impossible", tiny probability, you can't say much if you only observe one event.

Talking about extremely complex new distributions seems like an excessive case of "scientism" in the pejorative sense, if I use another recent topic. The detection of one event means that the true number of events is pretty much unknown - one may only eliminate hypotheses that predict 0 almost certainly and one may exclude very large numbers of events, too - but otherwise whether the right number is 1 or 3 is pretty much unknown, regardless of your attempts to replace the normal distribution by a better one.

Cheers
LM

19. So, were the deviant cats for Schrodinger's birthday? Or are they not there?

20. An example of a discovery declared and accepted by the community with one event (1) is the discovery of the omega-. "At left is a sketch of the bubble chamber photograph in which the omega-minus baryon was discovered." Summarized in http://hyperphysics.phy-astr.gsu.edu/hbase/particles/omega.html .

They went with the excellent probability of the chisquare fit to the hypothesis and that it was expected to complete the baryon decouplet. They did not calculate a statistical probability for it to be background, it was assumed very small, as for all bubble chamber photos: When one can see the interaction points , and non interacting beam particles can be counted as they come in, it is not possible to imagine confusions from other events.

Even in the dirty environment of the LHC , if something very unusual appeared in both experiments nobody would be waiting for five sigma. Example: suppose that one charged track enters the electromagnetic calorimeter and bursts into a shower of tracks, electromagnetic and hadron like, with a mass of 100 GeV. Even ten of these in each experiment and a discovery would be in the news.

21. There is so me logic to rethink whether the statistical errors we calculate are relevant to the problem at hand.

In the case of the omega minus discovery it was not only the probability of the chi square fit that made it indisputable. It was the very small probability that it could be background, because in a bubble chamber the background for individual events is miniscule.

If one would qualify the five sigma rule one should qualify it on the side of the background. When the background is 100 events and the signal is composed of ten events on top, the statistics are absolutely significant. If the background is zero +/-.0000001 events then even one event cannot be statistical background, one has to look for systematic errors.

In every day life, if an alien landed in Times Square would there be any meaning asking for a five sigma statistical existence of the alien?

22. I don't understand your arguments at all.

The strength of the background is surely reflected in the calculation of how many sigmas we have, isn't it? When done properly and when the normal distribution is OK, 5-sigma is the very same thing as 99.9999% confidence level and it's the same confidence, the same risk of a false discovery, in every situation, regardless of the size of the background.

LM