Tuesday, June 10, 2014

Basics of the ATLAS contest

Update 6/15: After several days, I returned to top three out of the 656 competitors (or teams). 3.74428 would be enough to lead a week ago but times are changing. We are dangerously approaching the 3.8 territory at which I am likely to lose a $100 bet that the final score won't surpass 3.8, and I am contributing to this potential loss myself. ;-)
...and some relativistic kinematics and statistics...

In the ATLAS machine learning contest, somebody jumped above me yesterday so I am at the fourth place (out of nearly 600 athletes) right now. Mathieu Cliche made Dorigo's kind article about me (yes, some lying anti-Lumo human trash has instantly and inevitably joined the comments) a little bit less justifiable. The leader's advantage is 0.02 relatively to my score. I actually believe that up to 0.1 or so may easily change by flukes so the first top ten if not top hundred could be in a statistical tie – which means that the final score, using a different part of the dataset, may bring anyone from the group to the top.

(Correction in the evening. It's the fifth place now, BlackMagic got an incredible 3.76 or so. I am close to giving up because the standard deviation in the final score is about 0.04, I was told.)

I have both "experimental" and theoretical reasons to think that 0.1 score difference may be noise. Please skip this paragraph if it becomes too technical. Concerning the "experimental" case, well, I have run several modified versions of my code which were extremely similar to my near-record at AMS=3.709 but which seemed locally better, faster, less overfitted. The expected improvement of the score was up to 0.05 but instead, I got 0.15 deterioration. Concerning the theoretical case, I believe that there may be around 5,000 false negatives among the 80,000+ or so (out of 550,000) that the leaders like me are probably labeling as "signal". The root mean square deviation for 5,000 is \(\sqrt{5,000}\sim 70\) so statistically, \(5,000\) really means \(5,000\pm 70\) which is \(1.5\%\). That translates to almost \(1\%\) error in \(\sqrt{b}\) i.e. \(1\%\) error in \(s/\sqrt{b}\) (the quantity \(s\) probably has a much smaller relative statistical error because it's taken from the 75,000 base) which is 0.04 difference in the score.

It may be a good time to try to review some basics of the contest. Because the contest is extremely close to what the statisticians among the experimental particle physicists are doing (it's likely that any programming breakthrough you would make would be directly applicable), this review is also a review of basic particle physics and special relativity.

The basic purpose of the content is simple to state. It combines particle physics with machine learning, something that computer programmers focusing on statistics and data mining know very well and that is arguably more important for you to win than particle physics. (A huge majority of the contestants are recidivists and mass Kagglers, often earning huge salaries in data-mining departments of banks and corporations. Some of the "similar people" came from experimental particle physics but it's less than 10% so I estimate that 5% of the world's "best statistical programmers" of this kind are working in experimental particle physics.) You download the data, especially two large files, "training" and "test".

The training file contains 250,000 events with weights (they are really telling you "how many times each event should be copied" for them to cover all the possibilities with the right measure, sort of). Each event is labeled as "s" (signal) or "b" (background). You and your computer are being told 250,000 times: look at this event, baby, and remember and learn and learn and learn, events like this (whatever it means) are "s" or events like that are "b". Then you are asked 550,000 times whether another event is "b" or "s" and you must make as many correct guesses as possible. It's easy, isn't it?

Well, your score isn't really the absolute number of correct guesses or the absolute number of wrong guesses or the ratio. None of these "simple" quantities is behaving nicely. Instead, your score is essentially\[

{\rm AMS}_2 = \frac{s}{\sqrt{b}}

\] It's the "signal over the square root of noise".

(The exact formula is more complicated and involves logarithms but I believe that the difference between the simplified formula above and the exact one doesn't impact any participant at all – and you couldn't even find out which one is being used "experimentally" – except that the exact formula with the logarithms is telling you to avoid tricks to try to guess just a few "s" events and make \(b=0\). That could give you an infinite \({\rm AMS}_2\) score but the regulated formula with the logarithms would punish you by effectively adding some false negatives, anyway. The exact formula reduces to my approximate one in the \(b\gg s\) limit which you can check by a Taylor expansion up to the second order. Check this PDF file with the detailed technical paper-like documentation for the contest which is an extended version of this blog post.)

The signal \(s\) is the number of true positives – events that you label as "s" and they are indeed "s"; the background \(b\) is the number of false positives – events that you label as "s" but they are disappointingly "b". More precisely, we are not counting the events as "one". Each event contributes to \(s\) or to \(b\) according to its weight. The weights of the training b-events are about \(2-5\) or so while the weights of the training \(s\) events are about \(0.001-0.01\) or so, i.e. two or three orders of magnitude smaller. These asymmetric weights are why the actual numbers \(s,b\) substituted to the score formula obey \(s\ll b\).

Off-topic: Konya in Central Turkey is as large as Prague, with its 1.1 million people, but it could afford to buy the Prague-15T-model-based streetcars with some extra bold design and some generous air-conditioning because they have enough energy, it seems (not to mention money for free Wi-Fi in each wagon). The passengers seem excited so far and yes, the streetcars were made in my hometown of Pilsen. Our city may be earning more from the trams (Prague, Bratislava, Miskolc in Hungary, license for China) than from beer.

The events that you label as \(b\) (false and true negatives) don't affect your score. Your task is really to find a region of the parameter space that is signal-rich. You identify everything outside this region as "background" and even though there may be some "signal" events over there, you are only punished for excluding them by having a smaller \(s\) than you could have otherwise. The exclusion of the events labeled as "b" has a simple justification: you could "extend" the region of candidate events so that it would contain totally different events that are "certainly" not "s". You would correctly guess that they are "b" but the number of such correct "b" guesses would depend on how much you would extend the candidate region. However, the score calculated from the "compact" events labeled as "s" is independent of the size of the candidate region as long as it contains almost all good "s" candidates (as long as it is large enough).

(Even though there naively seems to be another way to make the large "distant background" inconsequential for the score, you couldn't define a nicely behaving score that would only depend on the actual s-events – how many of the events known to be "s" to the organizers the contestant has rated as "s" or "b". The contestant could simply label all events as "s" to maximize this stupid score: there would be no "b" mistakes in this set. Such a cheap single-label strategy doesn't help him to maximize the score defined from the events he chose to be "s".)

Train to distinguish events

I must finally tell you what are the "objects" that you have to classify as "b" or "s". They are events i.e. individual proton-proton collisions. In the terminology of the machine learning experts, each event i.e. each collision is known as an "instance". So you get 250,000 training instances (with the known weights and the known b/s classification) and 550,000 (con)test instances (both weights and b/s are unknown for these). So these 250,000 instances are used to train your computer program to distinguish b/s. After your program has been trained in some way, it should offer its verdicts about the 550,000 con(test) instances.

By drawing various histograms, you may check that the relative distribution of pretty much any quantity in the 250,000 training events and the 550,000 test events is the same. So they were pretty clearly generated from the same uniform dataset of 800,000 events. Some of the 800,000 events were used for the training file of 250,000 collisions; others were reserved for the 550,000 collisions defining the (con)test.

We're told that the unknown weights of the 550,000 (con)test instances are equally distributed as the known weights of the 250,000 instances. The only difference is some overall rescaling of these weights that guarantees that the \({\rm AMS}\) score is expected to be the same.

Moreover, you are not rated according to the full sample of 550,000 test collisions. The preliminary score is only calculated from 100,000 (i.e. 18% or so) of them. The final score will be computed from your \(s\) and \(b\) in the remaining \(450,000\) (i.e. 82% or so) events. This may permute the contestants a little bit (or significantly, as I suggested at the beginning) and add some unexpected noise. I think it is a clever "twist" to the rules of the contest because you could take the strategy of trying to get a preliminary score that is high by chance. This may help you in the preliminary leaderboard but not in the final results because if you happen to get a high preliminary score calculated from the 100,000 "public" events purely by chance, this chance will probably not be repeated for the 450,000 "private" events used to calculate the final leaderboard in September.

So this is not supposed to be a game of luck. You only increase your chances if your improvements to the program have a good underlying reason to be "real improvements" of the principle. Improvements that are due to good luck won't help you – and they are discouraged by the way how the score is being calculated. However, good luck may still help some random people in September – but the idea is that good luck will be less important in September because the sample 450,000 events used to calculated the winners is larger and therefore has smaller relative fluctuations – less room for luck.

What are the events?

I have mentioned that each event carries a classification b/s (background or signal) and some weight. These are only known to the contestants for the 250,000 training events. However, both training events and the (con)test events contain 30 additional numbers – not quite independent from each other. One may say that essentially all of them are some functions of the 4-momenta \(p^\mu\) of 3-5 objects caught by the ATLAS detector (well, the collisions are only simulated but their 4-momenta's distributions are really, really close to the real ones; because they may simulate the background and background+signal separately, they know "exactly" which events are "b" and which events are "s" if they mix the "background" sample with some sample that "modifies b to b+s"):
a lepton coming from a tau decay, another reconstructed hadronically decaying tau, missing energy, leading jet, subleading jet
That's five 4-momenta. Each event contains 0,1,2 or 3 jets. Events with 4+ jets were manually eliminated and they can't occur. If there is only 1 jet, various quantities depending on the subleading jet (or at least two jets) are labeled as undefined, by the number \(-999.0\) that can't occur naturally. If there is no jet, all quantities that require at least one jet (except for one quantity that has been defined to be zero) are set to the undefined value, \(-999.0\), again.

So you see that some undefined entries among the 30 entries per event, \(-999.0\), appear because the event has fewer than two or fewer than one jet. You may rather easily find this correlation by yourself. (Most of the Kaggle recidivists who have a chance to win can do such things easily. If it's hard for you to find such correlations in the data or if you will continue to have no clue what to do with the "undefined" entries, and be sure that logical reasoning should be enough for you to decide what you may do although experience may be helpful, it's likely that you have no chance to win.) Aside from the absence of (at least one or two) jets, another reason for the appearance of the undefined \(-999.0\) value is that in the entry for the "reconstructed rest mass of the Higgs", some events seem to be so far from the Higgs-like events that it doesn't even make sense to calculate the reconstructed Higgs mass event from them. So they put \(-999.0\) over there, too.

OK, I said you that in each event, the ATLAS detector detects and tells you about 3-5 spacetime momenta:\[

p^\mu_{\rm lep}, p^\mu_{\rm tau}, p^\mu_{\rm MET}, p^\mu_{\text{leading jet}}, p^\mu_{\text{subleading jet}}

\] Again, the interesting process that sometimes takes place is\[

p+ p \to h+{\rm trash} \to \tau^+\tau^- + {\rm garbage}

\] where one tau is later decaying to leptons and another tau is decaying to hadrons. Garbage is similar to trash but may contain some new entries. That's how I understand it. The four-momenta of all these players (lepton, tau, each jet) obey\[

p^\mu p_\mu = 0.

\] It should be equal to \(m^2\), the squared mass of the particle (a quark or a gluon, in the jet's case) but this squared mass may be neglected relatively to the energy of the particles. Virtually all components of the 4-momenta in the events are (much) larger than \(5\GeV\) or so while the rest masses of the particles are much lower (the bottom quark would be an exception but it's so rare that you may assume that there are no bottom quarks in these 800,000 events; the charm quark is already OK within a \(\GeV\)).

So the LHC is really working in the "ultrarelativistic regime" – all the energies and momenta are much higher than the rest masses which means that the speed of almost all elementary particles in almost all these events is extremely close to the speed of light! Extreme special relativistic events are indeed the business-as-usual at high-energy particle accelerators such as the LHC. Special relativity is the LHC folks' bread and butter. They are the routine engineers seeing the \(v\sim c\) effects of relativity millions of times each second.

As you can imagine, this masslessness assumption simplifies the situation. Also, you are never told about the charges of the particles so you can't distinguish the final state \(\ket\psi\) from its charge-conjugated state \(C\ket\psi\). If you add the assumption that the events don't really see the third-generation quarks, it follows that they are not sensitive to the \(CP\) violation in the Standard Model (the phase in the CKM matrix). Because the events don't feel the \(CP\) violation and you are not told about the difference between events and their \(C\) conjugation, you may also assume the \(P\) symmetry, too.

Well, you may also assume the rotational symmetry except that it seems that there is a systematic dependence on the direction in the particle detector (a bias that may have been artificially inserted to the Monte Carlo-generated dataset!) which is why throwing away this information might be a bad idea.

I said that the elementary particles' momenta obey \(p^\mu p_\mu=0\) within a good approximation. A momentum vector, \(MET\) (missing transverse energy), is an exception. In fact, we don't know the full 4-vector \(p^\mu_{\rm MET}\). The missing transverse energy includes the momenta and energy of the particles that are almost invisible and remain undetected by the detector (neutrinos or, more excitingly, neutralinos and other dark matter candidates). This missing transverse energy is nothing else than the calculated "violation of the 4-momentum conservation law". One adds the total energy-momentum of the particles in the final state and subtracts those of the initial state.

This missing transverse energy 4-vector has no reason to be null, i.e. to obey \(p^\mu p_\mu=0\), because it isn't the 4-momentum of a single elementary particle. (Well, at least if there are at least two neutrinos, it isn't.) Moreover, we can't really measure the "longitudinal" component of the missing energy, either (that's why we add the adjective "transverse": the "longitudinal" component along the direction of the proton beams is unknown). We can't measure it because in this direction, the two protons at the LHC carry a huge energy and momentum, \(4\TeV\) per beam, and most of it continues to fly in the form of some hadrons in the nearly same direction. A few \(\GeV\) could be missing but we simply can't measure this deviation in the direction of the proton beams with this great relative accuracy – especially because the detectors can't have a pixel exactly in the direction where the protons need to be flying. So only 2 out of 4 components of the missing energy-momentum 4-vector, the components \(p^x_{\rm MET},p^y_{\rm MET}\), are known, and they are known as the missing transverse energy. Quantitatively speaking, the "missing transverse energy" is the length of this 2-vector. We are also told about its \(\phi\) direction.

Quite generally, the momentum 3-vectors are separated to the transverse \(x,y\) part that are expressed in the \(r,\phi\) polar coordinates (the total momentum – radius – and an azimuthal angle); and the longitudinal \(z\) component that is usually expressed in terms of the pseudorapidity. The pseudorapidity \(\eta\) (eta) is nothing else than the rapidity in the \(tz\)-plane only (\(z\) is the direction of the proton beams). And rapidity is the "hyperbolic" counterpart of an angle in the case of the Minkowski geometry. So just like the \((p_x,p_y)\) components are given by \((|\vec p| \cos\phi,|\vec p|\sin\phi)\), the \((E,p_z)=(p_t,p_z)\) components are given by \((|\vec p| \cosh\eta,|\vec p|\sinh\eta)\) where \(\eta\) is the pseudorapidity, the imaginary/hyperbolic angle "rotating" (boosting) the \(t,z\) axes into one another. Note that the 4-vector I just defined is null (I could write the generalized formulae for the case of a nonzero mass, too) because in the \(p_\mu p^\mu\), the factor \(|\vec p|^2\) is everywhere and because \[

\cosh^2\eta - \sinh^2\eta - \cos^2\phi-\sin^2\phi = 1 - 1 = 0.

\] Also note that the \(tz\)-boosts commute with the \(xy\)-rotations (around the \(z\) axis).

In the training and (con)test datasets, each event is represented by 30 real numbers. One of them is an integer – the number of jets (jets with too low energies are not jets, by definition; jets flying in too similar directions are unified into one jet, by definition). The others are effectively the 3-5 (depending on the number of jets) spacetime momenta in some spherical coordinates (the energy doesn't have to be specified, it's effectively \(E\approx |\vec p|c\) because of the masslessness). Note that you need 3 numbers to specify a massless momentum \(\vec p\). The longitudinal component of MET is non-existent so you have up to \(3\times 5-1=14\) pieces of "really raw data" in each event. They are labeled "PRI" for "primary". One parameter is the number of jets.

The 30 features of an event, along with the weight and the classification (the latter two are only known publicly for the training events). The 33rd feature numbered as "0" at the beginning is the event ID – simply 100,000-349,000 for the training events and 350,000-899,999 for the (con)test events

The remaining 15 (recall \(14+1+15=30\)) "derived" quantities (labeled as "DER") conveniently calculated from the \(14+1\) raw numbers – and perhaps from the 4-momentum of the third jet which is not reported separately in the (small fraction of) events that have exactly 3 jets (as I said, more than 3-jet events were eliminated). Well, out of these 15 "derived" quantities, two of them are actually labeled as "PRI" (met sumet, jet all pt) but that's a matter of terminology. The remaining 13 "purely derived" quantities are various invariants – invariant masses computed from some subsets of the particles, absolute values of some momenta, centrality ratios of various types, and so on. Check the appendix B of the technical documentation if you really want to know what these invariants are.

To summarize, each event – each instance – is a point in a 30-dimensional space \(\RR^{30}\) – or, if you want to remember that one of the parameters is the integer number of jets, it's \(\RR^{29}\times \{0,1,2,3\}\). However, because not all quantities are really independent from each other, the points are aligned in some \(14-17\) or so dimensional subspace, and points describing one-jet or jet-free events really live in an even lower-dimensional submanifold of that \(\RR^{30}\). The redundant invariants may be helpful for you and your software to "see the patterns" more quickly. But it is not guaranteed that they help. A good computer program may see the patterns even in the raw data and doesn't need the redundancy. There are millions of technical questions on whether some extra derived data or redundancies are helpful for the machine-learning software to do the job really well.

Why is there the square root of the background

As a contestant, you are labeling each of the 550,000 (con)test events either as a "b" or an "s". Your program typically starts by assigning a "continuous score" or "probability" to each event, and you choose the \(N_s\) events with the highest probability or score as "s".

The training file contains 250,000 events and about 1/3 of them are "s". You may easily count them. The (con)test file seems to be prepared from the same uniform package of 800,000 events – because they have the same shapes of histograms of pretty much everything which can't occur by chance – so you may be pretty sure that about 1/3 of the 550,000 of the (con)test events are signal, too! (Histograms mentioned in the previous sentence and used as evidence are histograms assigning the same weight – one – to each event. You can't verify the identical shapes of the histograms with the right weights because you don't know the weights of the (con)test events. If you knew them, you could separate b and s exactly. "b" events have 100-1,000 times greater weight than the "s" events LOL.)

But forget about being able to exactly guess which 1/3 of the events are "s". You have no chance to do that because many "b" events are almost identical to "s" events; they live in regions of the space \(\RR^{30}\) above where the number of b-events and s-events are comparable so there's just no way to tell "b" and "s" apart over there! In reality, you must remember that you want to maximize the \({\rm AMS}\) score. So you find out that there is some "sweet spot" in between zero and 1/3 for the percentage of the events that you want to classify as "s". Find this percentage e.g. experimentally by trying to maximize the actual score returned by the Kaggle server – or by maximizing your estimate of the score calculated from the training data only. I won't tell you the optimum percentage especially because I don't know it – and no one knows it. Its exact value isn't quite well-defined and will depend on the idiosyncrasies of your algorithm. Still, all the leaders are likely to use the percentage of "s" events in the same ballpark.

OK, so you choose some percentage you have decided to be the best one – say 20 percent – as "s". By this decision, you have effectively drawn a boundary in the space \(\RR^{30}\) defining the events. The events "inside" the boundary are labeled as "s" by you, it's the region \(S\); the events outside are labeled as "b", it's the region \(B\). That's what you must have done if you have attributed the labels b/s and your algorithm to do so had some rational justification (if you assigned the labels randomly, the region of "s" inside \(\RR^{30}\) is highly disconnected or ill-defined).

Finally, I can tell you why the score is calculated as\[

{\rm AMS}_2 = \frac{s}{\sqrt{b}}

\] You have the region in \(\RR^{30}\) that you called \(S\) and where you classify the events as s-events. If this is a good decision, \(S\) is a region rich in s-events, according to some measure, right? But it will still contain some background or b-events, too. It's predicted by the Standard Model and any realistic theory that pretty much any "signal" event may be emulated by the uninteresting (in this case, Higgs-less) "background" events with a nonzero probability. If you run the LHC for some period of time, \(N_b\) background events will make it to the region \(S\) of the \(\RR^{30}\) multi-momentum space. However, by the usual statistics of the random walk etc., or the Poisson distribution, the number \(N_b\) cannot be predicted accurately. There will be a statistical noise which turns this quantity to\[

N_b \to N_b \pm \sqrt{N_b}

\] The error margin is the standard deviation. For \(N_b\gg 1\), which is almost always satisfied, the distribution of \(N_b\) may be assumed to be Gaussian – at least if you ignore the "extremely unlikely" events in the tails.

If you want to announce a discovery of a new particle – or, in this modest case, just the Higgs particle decaying in a particular way – you must only work with the total number of events \(N\) that the ATLAS detector measures in your signal region \(S\). It's the region such that in your contest, you are eager to classify all the events in this region as s-events. You only measure the total number of events \(N\) in \(S\). The detector doesn't tell you which of the collisions were signal and which of them are noise.

(In fact, both processes interfere and I am always a bit worried whether the experimenters correctly account for the quantum interference between the higgsful and higgsless histories. But because they mention that some weights are sometimes negative in their "real deal" calculations, a slight difference from the contest, I am inclined to believe that they can't be doing the stupid error of denying a basic feature of quantum mechanics, the interference between different histories. One first sums the probability amplitudes and then squares the absolute value of the result. If you first compute the probabilities and then you square them, you are using the wrong classical intuition that neglects the mixed, interference term that are characteristic for quantum mechanics. Even in the correct QM formulae, you may interpret the difference of b+s probabilities and b-only probabilities as "a probability" that may be interpreted classically – as long as you realize that this "a probability" will contain mixed terms that will allow this "a probability" to go negative, just like you expect from the Wigner pseudodistribution and from the negative interference.)

OK, so you know that the background-only theory (the null hypothesis) predicts \[

N \sim N_b \pm \sqrt{N_b}

\] events in the region \(S\) while the prediction of the background-plus-signal theory predicts a number that is effectively higher by \(N_s\). \[

N\sim N_b + N_s \pm \sqrt{N_b}, \quad \text{assuming } N_b\gg N_s

\] Can you distinguish the null hypothesis with the new sexy hypothesis that also includes the decaying Higgs boson? Well, it depends on whether the addition of the "bold higgsful theory", \(N_s\), is sufficiently greater than the inevitable statistical error margin of the "boring null hypothesis", \(N_b\), and this error margin is \(\sqrt{N_b}\).

This exercise is nothing else than the usual counting of "sigmas" (standard deviations) of a statistical significance of an experimental insight. The number of sigmas is indeed\[

{\rm AMS}_{\rm here} = \frac{N_s}{\sqrt{N_b}}

\] You may also rescale the integers \(N_s,N_b\) by real numbers \(k,k^2\) and the ratio above stays the same. These rescaled numbers of signal-events and background-events in your preferred region \(S\) are the same quantities as the sums of the weights of the false positives (events) and the true positives (events). Up to some proportionality factors that may be chosen differently for the s-events and the b-events, the weights in the contest really are the number of real-world events that a given event in the contest represents.

At any rate, you see that \(s/\sqrt{b}\) will really measure the "number of sigmas" of an experimental claim. The ATLAS folks claim that they have normalized the weights so that these ratios are expected to agree with the significance you may get from the real 2012 run of the ATLAS experiment! So the score \(3.7\) really means that the experiment has a 3.7-sigma confidence level that it has seen a Higgs boson that decays into a tau lepton particle-antiparticle pair where one tau decays leptonically and one tau decays hadronically. It's pretty high. Just to be sure, the Higgs decaying to \(ZZ\) or \(\gamma\gamma\) is more visible and the significance level of the existence of a Higgs from these two combined \(S\) regions might be above 10 sigma now, not sure about the exact number. It is surely above 10 sigma if the ATLAS and CMS collisions are combined. Even though Elvis Presley's haters may find this claim counterintuitive, the idea that the Higgs observations at the LHC were a statistical fluke is currently less likely than the Elvis Presley living on the Moon. One in a quadrillion or more. It's much more likely that as a powerful U.S. government official, you're able to sufficiently extend a singer's life and secretly send him to a nearby satellite.

I have already discussed that it's optimal to classify less than 1/3 of the events as "s" to optimize the score – the significance level of the observation of the decay. You could perhaps construct a "more convoluted quantity" using several more or less inclusive regions \(S\) but if there is just one region, the contest really describes the essence of everything that the ATLAS Collaboration's statisticians are doing, anyway. Moreover, around the optimum percentage of the "s" events you choose to submit, the dependence on the percentage becomes nearly non-existent (functions are nearly constant near their maxima). However, you must realize that the actual \({\rm AMS}\) score will in no way be a smooth function of the parameters – such as the percentage of events you label as "s" or other adjustable parameters of your software. Instead, the preliminary score as well as the final score will depend on these quantities non-smoothly (with the Brownian-noise-like fluctuations) and perhaps sometimes discontinuously (because your software is likely to do some Yes/No decisions whose outcome also depends at the parameters discontinuously).

It's useful to emphasize once again that all "simpler formulae for the score" that wouldn't involve square roots could probably be increased by a trick which wouldn't help the physicists. The formula similar to \(s/\sqrt{b}\) really agrees with the calculation of the "number of sigmas" that the ATLAS physicists are calculating from the real events as well as the Monte Carlo simulations. The training and test events in this contest are generated purely by Monte Carlo but it's being done so well that you couldn't easily prove that they are not real data from the ATLAS detector.

Because of the \(s/\sqrt{b}\) formula, you don't want to label too many events as s-events because the number of false positives \(b\), and even its square root, would start to be way too high. But you don't want too narrow \(S\) regions i.e. too low values of \(s\) either because while you would make \(\sqrt{b}\) smaller, you would be punished by the numerator. Moreover, the exact formula for the score deviates for small \(b\) and even more discourages those who pick too small a region \(S\). There is some optimum range in between and in this range, the ratio \(s/\sqrt{b}\) isn't really too dependent on the size of the \(S\) region, anyway. But the percentage of the "s" events you choose is, in some sense, the smallest, easiest, and last task you have to choose. The bulk of your task is to find the boundary of a signal-rich region \(S\) inside a 15- or 30-dimensional space! ;-) That contains much more information than one number – in principle, infinitely many numbers.

Strategies to fine the \(S\) region

When I started with the contest, I submitted the standardized random sample with the usual score. Then I tried to check whether I can increase the score a little bit. My first 4 "really my" attempts were based on the idea of the nearest neighbors programmed in Mathematica. (The neighbor approach says that if the distance of a (con)test event from a (nearly?) nearest training s-event is sufficiently smaller than the distance from a (nearly?) nearest b-event, then the unclassified event is labeled as an s, too. You must choose some metric and you must give up the idea that for each of the 550,000 contest events, you calculate the distance from each of the 250,000 training events, because a million of millions is a trillion, a large number. Too much CPU time.) Several times, I made an error because I was assuming that the Mathematica function Ordering[x] is producing the inverse permutation to the permutation than it actually does! Thankfully, Ordering[Ordering[x]] does exactly what I thought that Ordering[x] should have been doing. Yes, that's a way to write down the inverse permutation.

With the fix, I could get easily above one, even with the superfast approximation of the "nearest neighbor". But I decided that Mathematica would be too slow and I should use some standardized software, anyway. None of the easily available ones was giving me the nearest-neighbor strategy so I switched to the method that seems to be a leader right now – the gradient boosted decision trees. (However, I am currently using lots of my overhead Python plus Mathematica code that goes beyond or works outside the XGboost package.) It's decision trees because the events are separated to two subgroups by a "Yes/No" question and each question is split to two subgroups by a new question (that may be different, and so on). The "Yes/No" questions should optimally split the dataset into s-richer and b-richer parts, and do it several times.

The gradient boosting is a "more continuous" upgrade of this algorithm in which you try to approximate the probability that a point in \(\RR^{30}\) is "b" or "s" ever more accurately by trying various estimates based on averages, gradients, and so on. This boosting paradigm isn't too different from various simple tricks to interpolate a function that you know "experimentally" from its value at isolated points. (The function you are trying to interpolate here is pretty much the probability that an event near a point is "s".) I can imagine systematically "wiser", more natural techniques, but this one may work well enough.

I have doubts whether the score may be increased much more than that – and I have made a bet that the final score won't surpass 3.8. Now I realize that the deviations of the exact score formula from \(s/\sqrt{b}\) may be normalized differently for the 100,000 events and the remaining 450,000 events so I may lose the bet for a really stupid reason. But if the scores are normalized to be the same even to the subleading order, I actually believe that the final scores from the "private" sample will be lower than 3.7 (the current leaders' public preliminary score). Perhaps, you may prove me wrong!

Summary: was that comprehensible?

I am not sure at all whether this blog post is in any way clearer than the official documentation of the contest. Maybe it is clearer. Maybe it is more technical, unclear, and demanding, and will motivate some TRF readers to read the whole official documentation instead. ;-)


  1. @Lubos--wow, impressive--left you a Facebook chat.

  2. Congratulations Lubos! Really awesome. Although I think this award should not be only cash, it should imply an open door for the CERN too... many of us would be glad that our generic "humble correspondent" also would become our "humble correspondent at CERN" :-D

  3. " I am at the fourth place"
    Motl Magnus! Motl Magnus!

    Being correct carries no weight within managed research. The only trusted employee is one whose sole marketable asset is loyalty. Intelligence does not beg permission (and is rarely forgiven). Will there be a large enough fraction of minorities within the top ten finishers? Social advocacy lawsuits must be pro bono filed.

  4. Impressive! I think I'll dedicate a few nights to this contest thing.

  5. @Luboš Motl So now you've given away a little of your strategy, does this mean you're leaving the contest, having achieved third place?

  6. RobotUnicornAttackJun 10, 2014, 10:50:00 PM

    Very helpful post. I had a go on Kaggle after reading about this in the previously on TRF. It is fun playing with the "Starter Kits" they reference although I think I have learned more about getting Python to work with all these various add-ons than I have about data science. I am getting about 18% signal as the "sweet spot" (and am now only about 180 places below our humble correspondent.) Interesting to read the explanation as to why this is much lower than 30%. I also naively thought that increasing the numTrees was a surefire way of making progress, but it works for a while and then stops. Why is this? Overfitting on the training set?

  7. If you like Python be sure to check out Cython, Numpy, and Pandas.
    Python is wonderful for prototyping and flow controlling. Doing somewhat complex computations on large data sets, however, it becomes _way_ too slow in many cases. This is no reason to abandon Python, though.
    Cython is an extension module for Python allowing you to compile Python code (with minor syntactic additions like static typing) into C-code. Typical speed-ups range between 100 and 1000 times usually with trivial effort.
    Numpy is an extension module for Python adding multi-dimensional arrays with lots of functionality (like advanced slicing). It also interfaces perfectly with Cython allowing you to access the arrays with C-speed.
    Pandas is an extension module for Python adding functionality to process large data sets, serialize and manipulate them. It allows you to read and write (large) text files easily and offers functionality like (un)stacking (analogous to MS Excel's Pivot tables), filtering, aggregating, and so on. Pandas also builds on Numpy and integrates with Cython, so everything fits together.
    The described software is incredibly powerful and useful. Check it out!

  8. Hi John, I have given away only 5% of my know-how about this stuff, nothing really about my code and algorithm, but yes, I am inclined to give up after I saw BlackMagic at 3.76 at the top. He's threatening my $100 bet about 3.8 not reached, too. ;-)

    I will probably stop my daily 5 attempts unless I see some improvements in a couple of days, then I will switch to waiting for a revolutionary idea.

  9. Uncle Al, your comments are so goddamn useless!

  10. On behalf of the BICEP2 collaboration, please do not compare our critics to Nazis or joke about suicide. We of course believe in our results but are open to constructive criticism. We would in no way wish to equate professional criticism to this sad episode of history. If you have questions on the matter, please contact us directly.

  11. Dear Mr Teplý, on behalf of the citizens of the free Wester world, please be aware that in 1989, I won the freedom of speech and I won't allow some politically correct random graduate students or their bosses to trample on it. If you have some questions what the freedom of speech means, please f*ck off.

  12. Yes, I am well aware, and we support free speech, hence the "please." Just let it be known to readers that this blog and the remarks therein are not to be associated with the views and opinions of the BICEP2 collaboration.

  13. Dear Mr Teplý,

    there's no rational reason why you should be associated with – i.e. take credit for – my texts that are demonstrably and obviously my personal blog posts.

    But if there are some people who like demagogy and similar nonsensical "associations", they will "associate" you, anyway. You can't really do anything against them.

    Were you instructed to write this intimidating comment or was it your personal initiative? I am pretty disgusted by this way of directing the discourse.

  14. No, the BICEP2 collaboration does not associate or take credit for your posts. As long as this is clear, you can post what you will.

  15. Clear or not, Lubos and anyone else can post what they will.

  16. "As long as this is clear, you can post what you will."

    That's very kind of you, adolf, but just who the hell do you think you are to tell him what he can and cannot do on his OWN blog?

    We've just been commemorating the 70th anniversary of D-Day and celebrating the subsequent defeat of the Nazis, but obviously prematurely — there's still a lot of mopping up to do. You jumped up little prick.

    Shove your advice and your fuckwit holier-than-thou poncy romper-stomper cheap totalitarian 'moral' creed right up your bleeding arsehole.

    To think that men died so that 'priests' like you could parade around breathing air. What a fucking waste of good lives. What a crime.

    You've no idea what freedom is. Go to Hell.

  17. I guess all readers here are aware of the fact that TRF is NOT associated with the BICEP2 collaboration ... ;-)

    Of course I know that you appreciate justified professional criticism, as it should be ...

    But I and probably others think that what happens even in public popular discussions, goes beyond (or more accurately below concerning the level) rational professional criticism...
    And professionaly criticizing the criticism (if it contains errors in the understanding of the scientific method for example), is also an important thing to do.

    All we do here is make use of our freedom to express our disagreement with the good work of the BICEP2 collaboration outright being bad-mouthed at some places in the online and real world in more or less strong words.


  18. A
    society that confiscates achievement to reward the smartless is not only
    insane, it is evil.

    Uncle Al was good to Europe, Patent EP0438043B1. Koenig was the master
    machinist. Engineer Healy pissed off Uncle Al to obtain some Swedlow dumpster
    fill (the "invention"). Yang and Krug were managers responsible for
    getting their names on the patent.

    Lumo and I disagree about exact vacuum isotropy toward
    hadronic matter. Perform a geometric Eötvös experiment. A good idea need only
    be testable.

  19. Perhaps you should direct your attention to the efforts of Dr. Steinhardt who portrays your collaboration as not merely wrong but also sloppy and inappropriate in method and in announcing prematurely. Or you should be offended that the press especially Nature that first announces without report of your caveats and makes it sound as though you were simply over eager to claim credit and were unprofessional etc.
    that is what I got out of it. Everyone who has enough intelligence to follow this realizes what Lubos said is correct. If you think kissing ass will help, fine , but when it turns out somewhere in the future that the possibility of dust has disappeared.

  20. Should the 30 parameters not be called primaries and descendents .... ;-)?

  21. "you must give up the idea that for each of the 550,000 contest events,
    you calculate the distance from each of the 250,000 training events.... "
    It is exactly the first thing I would do.
    "Too much CPU time.)" Who cares? One can optimize it after (and it seems that you're undervaluing the clock's speed.).

    "The training and test events in this contest are generated purely by Monte Carlo": so there emerges :-) ... that it's more a matter of gambling mathematics than of machine learning...

  22. ??

    Where that came from?

    What a bizarre pointless intervention…

  23. Congratulations!!!! Back to 3rd place with a 3.744.

    I still think the 3.81 limit will hold. The extra 0.01 is for stats errors. Why?? Because it is so close to (i^(-I)-1). ;-)

  24. I know you have better things to do, and I know that the answer to my question would be, if answered properly, too long for writing. But I am interested in what you may think about when you, for example, observe a phenomenon such as the rainbow. How do your thoughts flow? Do they start with human feelings of beauty or do they start with classical wave theory or do your thoughts wander into the modern domain straight away? Does your mind wander through all the historical contributions to the understanding of the photon or is it QED you see at first glance? Maybe string theory is too dominant in your thoughts to jump elsewhere? I hope I have asked the question in such a way that you understand me. I think it would be absolutely wonderful if you could account a personal narrative when it comes to how you think about seemingly everyday situations. It would not be often one could have the chance to "listen in" on a mind that knows so much about the place we live in. With kind regards, Rasmus.

  25. Dear Rasmus, I've loved rainbows since I was a kid, and was fascinated by double ones, and so on, and this still remains in me.

    Still, there is at least an equally large new counterpart of this naive excitement - the ability to compute the position of the rainbow, color by color.

    This was actually a standard question in my Rutgers PhD qualifying exams, one of thousands of things I mastered, and I think that I would still be able to calculate the angles. One must roughly know the transmission and reflection of the lightrays from spherical water droplets, and the detailed calculations according to the indices of refraction etc. that depend on the color nail it down.

    I find this quantitative part of the rainbow to be at least as beautiful as the naive feelings from the rainbow. It's not a normal beauty. It's mixed with some gospel-like belief. I think it is cool we may exactly nail this problem down and I feel that everyone should know how to calculate the direction of the rainbow. It's so cool that it works.

    You don't need string theory for the calculation of the rainbow but of course that all these calculations fit within Maxwell's electrodynamics and they're special cases of QED, Standard Model, and then string theory. So string theory is a theory of rainbows for me, too. I partially do think about rainbows (and thousands of different things) when I am thinking about QED, QFT, or string theory.

  26. Is Uncle Al a computer or a real person?

  27. "I am likely to lose a $100 bet that the final score won't surpass 3.8, and I am contributing to this potential loss myself." I'll front you the $(USD)100 if you exceed 4.0.

    No, no - don't thank me. Thank US Socialist Security stealing from the productive to reward the despised. I'd be but a conduit.

  28. Good to see Tommiseeothingamagig is rooting for you!

  29. Finally, the paper: "However, these models are not sufficiently constrained by external public data to exclude the possibility of dust emission bright enough to explain the entire excess signal."