Tuesday, July 21, 2015

A new LHC Kaggle contest: discover "\(\tau \to 3 \mu\)" decay

A year ago, the Kaggle.com machine learning contest server along with the ATLAS Collaboration at the LHC organized a contest in which you were asked to determine whether a collision of two protons was involving the Higgs boson (that later decayed to the \(\tau^+\tau^-\) pair, one of the taus is leptonic and the other is hadronic). To make the story short, there's a new similar contest out there:
Identify an unknown decay phenomenon
Again, you will submit a file in which each "test" collision is labeled as either "interesting" or "uninteresting". But in this case, you may actually discover a phenomenon that is believed not to exist at the LHC, according to the state-of-the-art theory (the Standard Model)!

The Higgs contest was all about the simulated data. They looked real but they were not real and several technicalities were switched off in the simulation, to simplify things. Incredibly enough, here you are going to work with the real data from the relevant detector at the LHC, the LHCb detector: the LHCb collaboration is the co-organizer.

For each test event, you will have to announce a probability \(P_i\) that the event involved the following decay of a tau:\[

\tau^\pm \to \mu^\pm \mu^+\mu^-

\] The tau lepton decayed to three muons. The charge is conserved but the lepton number is not: among the decay products, the negative muon and the positive muon cancel but there's still another muon – and it was created from a tau. \(L_\mu\) and \(L_\tau\) conservation laws were violated.

At many leading orders of the Standard Model, the probability of such a decay is zero. I believe that the actual predicted rate is nonzero but unmeasurably tiny. New physics allows this "flavor-violating" process to take place, however.

To show you the unexpected relationships between different TRF blog posts, let me tell you that the blog post right before this one talked about the \(Z'\) boson and this new spin-one particle could actually cause this "so far non-existent" process.

In fact, this option appears in the logo of the contest! The \(\tau^\pm\) lepton decays to one \(\mu^\pm\) and a virtual \(Z'\), and the virtual \(Z'\) decays to \(\mu^+\mu^-\). The first vertex violates the flavor numbers but it's not so shocking for a new heavy particle to couple to leptons in this "non-diagonal" way.

The LHCb contest is harder than the Higgs contest in several respects such as
  1. lower prizes: $7k, $5k, $3k for the winner, silver medal, and bronze medal. It's harder to write difficult programs if you're less financially motivated. But LHCb is smaller than ATLAS so you should have expected that. ;-)
  2. no sharing of scripts: you won't be permitted to share your scripts for this contest so everyone has to start from "scratch". Sadly, you may still use your programs and experience from other projects so the machine learning folks will still have a huge advantage, perhaps a bigger one than in the Higgs contest.
  3. agreement and correlation pre-checks: to make things worse, your submission won't be counted at all if it fails to pass two tests: the agreement test and the correlation test. This feature of the contest, along with the previous one, will make the leaderboard much smaller than in the Higgs contest. The two tests reflect the fact that the dataset is composed of several groups of events – real collisions, simulated realistic ones, and simulated new-physics ones for verification purposes.
  4. larger files to download: in total, you have to download 400 MB worth of ZIP files that decompress to many gigabytes.
  5. messy details of the LHC are kept: lots of the technical details that make the real life of experimental physicists hard were kept – although translated to the machine-learning-friendly conventions. Also, the evaluation metric is more sophisticated – some weighted area under the curve (depicting the graph relating the number of false positives and the false negatives).
  6. and I forgot about 3 more complications that have scared me...
An ambitious contestant may view all these vices as virtues (or at least some of them). After all, money corrupts and sucks; sharing encourages losers to accidentally mix with the skillful guys; it's good for the submissions to pass some extra tests so that one doesn't coincidentally submit garbage; all these difficulties will keep the leaderboard of true competitors shorter and easier to follow (instead of the 2,000 people in the Higgs contest); I vaguely guess that the final, private leaderboard will be much closer to the preliminary, public one (there was a substantial change in the Higgs contest, sadly for your humble correspondent LOL). The reason for this belief of mine is that the contestants submit a larger number of guesses, they're continuous numbers, and the evaluation metric is a more continuous function of those, too. So the room for overfitting will probably be much lower than in the Higgs contest.

So far, there are only 13 people in the leaderboard and it's plausible that the total number will remain very low throughout the contest. If you write a single script that passes the tests at all, chances are high that you will be immediately placed very high in the leaderboard.

At any rate, you have 2 months left to win this contest and proudly announce it to the world on this blog and in The Wall Street Journal. Your solution may be much more useful than in the Higgs case; technicalities weren't eliminated, so your ideas may be used directly. And what you may discover is a genuinely new, surprising process – but one that may actually be already present in the LHCb data (as the hints of a \(Z'\) and flavor-violating Higgs decays suggest).

Good luck.

Correction: the Higgs money was just $7k, $4k, $2k, so this contest actually has better prizes. The money comes from CERN, Intel, two subdivisions of Yandex (a Russian Google competitor), and universities in Zurich, Warwick, Poland, and Russia.

No comments:

Post a Comment