Friday, July 22, 2011

Using all information from particle collisions

A maximally accurate Bayesian evaluation of hypotheses in particle physics

Experiments in particle physics usually repeat the same collisions - like the 3.5+3.5 TeV proton-proton collisions at the LHC - and study the final state. So far, each major LHC detector has recorded about \( C = \) 100 trillion of such collisions.

For each collision, \( i=1,2,\dots,C \), there are \(N_i\) particles in the final state. Each of the particles carries some 3-momentum \(\vec p_{i,j}\) where \(i\) is the subscript denoting the collision and \( j=1,2,\dots,N_i \) identifies the particle in the final state. The three-dimensional vector knows about the energy and direction (two angles) of each particle in each collision's final state.

On the other hand, we have \( M \) hypotheses \(H_1,H_2,\dots,H_M\) that compete to describe the data. The subscript \(k\) in \(H_k\) may hide the information about qualitatively different quantum field theories but also about its parameters. Let's only consider some of their values. For example, some hypotheses may represent the Standard Model with the Higgs mass between 100 and 300 GeV with a spacing equal to 1 GeV. But there can be many other theories, too.

We want to know which hypothesis - and which values of the parameters such as the Higgs mass - are favored by the \(C\) collisions. How do we do it?

Ad hoc binning

The usual practice is to divide the collisions into qualitatively different groups, apply various Yes/No filters, separate the collisions of the desired type into bins in which energies etc. belong to various intervals.

All these choices - filters, size of bins, and so on - depend on our choices. Some of them will allow us to extract more information (i.e. more quickly discriminate between the hypotheses) and some of them will allow us to extract less information. But you must surely feel that this "craft" of inventing filters and choosing the right bins is artificial. It's an art, all the discreteness is bogus, and there must surely exist a canonical way to find out which value of the parameters is favored, right?

Moreover, the collisions contain lots of information that we might be neglecting. Among 50 rare collisions, one hypothesis may "slightly" favor collisions with no extra jets. So events with jets still support the hypothesis but not too much. The same thing holds not only for the number of jets but also for relative angles between the particles and many other kinds of subtle information that the collisions contain. Playful physicists may invent various criteria that allow them to learn some things. But we must ask:

Isn't there a better, canonical way to incorporate all the information from all the collisions so that you will be able to pinpoint the right hypothesis and the right value of the parameters with a minimum number of collisions \(C\)? Can't all the artificial discrete filtering, vetoing, and binning be avoided altogether?

You bet.

And the right way to evaluate the validity of the hypothesis \(H_k\) is extremely natural and sort of obvious. You just calculate the probability that the \(C\) collisions you have observed are predicted according to the hypothesis \(H_k\). Probability is a continuous quantity and the selection process for the events becomes "continuous" - the probability counting allows you to punish a hypothesis for an event, highlight it, or anything in between.

Probability of all the collisions

How do you do that? It's simple. First, consider the \(i\)-th collision. You have some final particles. Each theory \(H_k\) allows you to calculate the probability of the observed final state. But because the final state is continuous - the final momenta are real 3-vectors - we must write the differential probability:
\[ dP_{i,k} = \rho_k (p_{i,1},p_{i,2},\dots,p_{i,N_i}) \quad\prod_{q=1}^{N_i} d^3 p_{i,q} \] The probability density \(\rho_k\) on the right hand side is calculated from the hypothesis \(H_k\) as the squared absolute value of the scattering amplitude that depends on all the momenta - including all the partons distribution functions and kinematic factors. At any rate, it's calculable.

The funny thing is that you may take the product of these infinitesimal probabilities over all collisions. In this case, you get the overall probability that \(H_k\) predicted all those \(C=\)100 trillion collisions you started with:
\[ dP_k = \prod_{i=1}^C dP_{i,k} \] It's simply a product over all the collisions labeled by \(i\). In effect, this expression is some kind of density: it contains the product of all the probability densities
\[ \rho_k (p_{i,1},p_{i,2},\dots,p_{i,N_i}) \] over \(i\) but it also includes the factor of \(d^3 p_{i,q}\) multiplied over all allowed values of \(i\) and \(q\). The density - the product of the \(\rho\) objects - is dimensionful and it has a huge dimension, something like mass to the power of minus 3 times the number of all particles that have ever appeared in the final states during the \(C\) collisions. It looks scary but it's not too difficult.

Of course, you shouldn't compare apples and oranges, so you may only compare the quantities \(dP_k\) with the same units. But it's OK because the dimension of \(dP_k\) only depends on the number of particles in the actual observed collisions: it doesn't depend on the hypothesis index \(k\). So the ratios \(dP_{k} / dP_{k'}\) are completely meaningful and independent of units and they really store the information about the relative likelihoods of hypotheses \(H_k\) and \(H_{k'}\).

According to Bayes' theorem, the posterior probability of \(H_k\) is given by the prior probability \(P(H_k)\) times the probability \(P(c_1\dots c_C|H_k)\) that \(H_k\) predicts the right final states of all the collisions. (Yes, all the probabilities must be normalized so that the total probability of all hypotheses equals one - but this normalization doesn't affect the ratios of probabilities.) And this posterior probability is exactly given by \(dP_k\) above. Don't forget that the probability is really a density in a quadrillion-dimensional multi-particle momentum space.

The product of the \(\rho\) factors may obviously be converted to the exponential of a sum:
\[ \log\left( \frac{dP_k }{ \mu(p_{i,q})}\right) = \sum_{i=1}^C \log \rho_k (p_{i,1},p_{i,2},\dots,p_{i,N_i}) \] So the logarithm of the (unnormalized) probability that a hypothesis, \(H_k\), agrees with the results of all the collisions may be written as a sum of a number over all collisions. And this number from each collision is nothing else than the logarithm of the cross section predicted for the right species of the particles and the observed values of the final momenta in a given collision (plus logarithms of kinematic factors that are shared by all the hypotheses, so we may neglect them, much like we may ignore the measure \(\mu\) on the left hand side).

It may not be instantly obvious (if you're used to think about the calculations in terms of many man-made algorithms) but this simple formula "knows" about all the statistically predicted features of the data. And it extracts as much information from the collisions as is possible.

For example, if 30% of the collisions satisfy a given condition, \(R\), the formula for \(\log dP_k\) will automatically favor hypotheses \(H_k\) that predict that approximately 30% of the collisions should obey the condition \(R\) - with some error margin, and the strength of the punishment will be exactly as severe as required. It's because the combinatorial factors will correctly combine with the factors of \(\rho\) to give you - as clear from a saddle point evaluation - a peak for the theories that have the right percentage.

However, the formula may fail if you only apply it to some non-randomly selected subset of events. It's actually crucial to include all collisions that have taken place. You may try to approximate the product of the \(\rho\) cross-section-like factors by various tricks but you should never omit them - unless all the hypotheses predict exactly the same probability of a given collision, in which case these events may be ignored, indeed.

The probability densities \(\rho\) in some ordinary units differ from one at least by a few dozen orders of magnitude, so \(\log\rho\) is of order one - if you strip \(\log\) of the measure. It follows that the sum of \(\log\rho\) over the 100 trillion collisions will be of order 100 trillion.

Note that this sum of order 100 trillion has to be exponentiated to get the actual probability. It's obviously very important that your hypotheses predict almost the same thing for almost all the collisions. If two of them predicted substantially different things for 1% of the collisions, the ratio of probabilities of these two theories would be of order \(\exp(1\,\,{\rm trillion})\) which is huge, indeed. This is also a hint that you may really need just a very small number of critical collisions to substantially discriminate between your hypotheses.

Of course, all these things describe well-known objects in statistics and similar concepts are being used in particle physics. For example, there's an article on TRF about the units of evidence and particle physicists and others use a related "LLR", the log likelihood ratio.

However, as far as I know, no one has tried to calculate the probability of all the 100 trillion collisions, including all their detailed features, according to a given hypothesis. It could turn out to be doable and the method could extract several times more information from the data than what is being currently done - because many features of the data, including angular information, number of jets etc., are just being completely neglected. Even properties of the data that are known and looked at are not treated optimally if people just divide them by arbitrary Yes/No criteria.

On a sunny day in the future, the calculation of the overall probability of all collisions could replace all the ad hoc methods to draw graphs and manually search for bumps etc. Maybe with this method, 1/fb of the CMS or ATLAS data already contains as much information as 15/fb of the data that are being treated by ad hoc, artificially invented, discrete, sub-optimal techniques.

Of course, the method has a lot of difficult parts. One needs to be able to evaluate the cross sections rather accurately - especially to be sure that the boring data are treated in the same way by all hypotheses etc. Also, one needs to know what to do with jets - whether a final state is first reclassified in terms of jets as final particles, or whether one tries to substitute all the individual hadrons and requires the hypothesis \(H_k\) to calculate the probability for the hadrons manually.

There are undoubtedly lots of computational and numerical hurdles. But the benefits could be high enough so that it could make sense to reformulate all the statistical calculations simply in terms of the calculation of probabilities that a given hypothesis predicts the right final state of all 100 trillion collisions. (In reality, the relevant information discriminating the hypotheses will only come from a subset of the "interesting collisions", just like you're used to.)

One should admit one more drawback of the method: if your computer quantifies the probability of a given theory "automatically" and you don't try to see any graphs in which you may manually find bumps and deviations, you will not have any feeling "why" a given hypothesis ended up being much more likely or much less likely than another one. But maybe even this thing could be automatized a little bit. However, there's no "unique answer" to such questions. You may only invent a couple of possible answers and a computer may tell you which one is the best one.

1 comment:

  1. I'm lost

    But there can be many other theories, too.....uhau

    An admission of many many theoretical worlds?