Sunday, August 10, 2014

Kaggle Higgs: approaching 3.85

If you follow the preliminary leaderboard of the Higgs ATLAS Kaggle contest where 1,288 teams from various places of planet Earth are competing, you may have noticed that I have invited Christian Veelken of CERN to join my team. He kindly agreed. I believe he is one of the best programmers who reconstructs the Higgs properties from the tau-tau decays in the CMS team, the other big collaboration at CERN aside from ATLAS whose folks organize the competition.

The current decision is that so far the viable scores were obtained predominantly by me so I own 90% of the team which is enough not to ask the minority shareholders whether they like the name of the team. ;-) Of course that it may change in the future. My belief is that the relative importance of members of such a team has to be based on the preliminary scores and their contributions to the high ones. It's not a perfect way to rate things but it's better than all others, for reasons I could explain. This question is analogous to the question whether managers' incomes in companies should depend on the profits, revenues, and the stock price. Even though there are risks and things can go wrong, I would answer Yes because this arrangement rooted in imperfect yet measurable data at least guarantees some correlation between the salary and the future of the company and some motivation for the manager to fundamentally improve things.

For the first time in the human history, Christian has applied the CMS' methods to evaluate these tau-tau decays (SVFIT) to the ATLAS data, the data of his intra-CERN competitors. It works. So far, it doesn't produce detectable improvements in the AMS score by itself (or in combination with the ATLAS methods): SVFIT, although more sophisticated, behaves almost identically to ATLAS' MMC. Christian has some really professional ideas what to do and I also believe that if they fail to produce high scores, he will help me to professionalize the codes that I used to get where I/we seem to be because, as you can imagine, the codes have become messy.

Meanwhile, however, I kept on improving the score. Our best one currently stands at 3.83674, just 0.014 below the current leader Gábos Melis. That's exactly equal to my last improvement and I got two of them in the last 24 hours so feel free to estimate how much time it should take to take over. ;-)

There have been moments when my mood was one of resignation. It seemed impossible to reach the heights of the leaders and the progress was so slow (my jumping up by 1 place ahead of the marijuana guy is the only change in the top 16 during the last week). One simply couldn't have thought about beating Melis, Salimans, or even the marijuana guy – bright kids and men with years of experience in manipulating similar data and doing machine learning.

Without much kidding, my life's only experience with manipulating "big data" was the conversion of 80,000 Echo comments on this blog to the DISQUS platform when Echo came out of business three years ago or so.

But the mood is very different now. It seems that I can add 0.01 to the score more easily than to prepare coffee. It's almost as easy as writing +1-1 at the end of a command y=f(x). ;-) Well, not quite but it is almost mechanically straightforward and it has repeatedly (but not quite always) worked.

One of the proprietary ideas that I've been fond of from the beginning and that I turned into a more viable one by having refined certain functions became even more effective when I realized what are probably the other conditions of the evaluation that are needed for the proprietary idea to become truly efficient, to show its muscles.

Because this explanation seems to be justified by some abstract theoretical thinking as well as the real-world empirical data, I will probably automatize the system and try to prepare a submission without self-evident fine-tuning that could produce a very high score immediately.

Now it even seems plausible to me that even the final scores – which will be computed from 2 submissions per team compared against 450k test events not included in the 100k test events that are the basis of the preliminary leaderboard – could exceed 3.8 so that I will lose a $100 bet. But it's too early to tell. The bet is as open as it can get. Note that the "best score per team" is almost certainly an overestimate of the final score because the preliminary AMS scores contain some noise with the standard deviation of 0.08. So with 300+ submissions, like mine, the best preliminary score could actually be up to a 3-sigma i.e. 0.24 overestimate of the genuine score. There are some reasons to think that the overestimates aren't this brutal but I don't want to go into technicalities that are partly speculative, anyway.


  1. I see that you have 410 submissions. Aren't you afraid of overfitting?

  2. Dear Nat, overfitting is everywhere and it's a primary culprit one fights against almost all the time. The overfitting is so strong that the local estimated AMS self-rated score (without separation of a validation set) of entries that are currently most important for further developments is whopping 30% (too optimistic) in compoarison with the preliminary leaderboard score.

    But in comparison with those who have many fewer submissions, I think that there's ultimately not much difference.

    I am not computing accurate local estimates of the AMS score - and all the professional guys almost certainly are. So they are quite certainly dismissing lots of attempted submissions (with different codes and choices of parameters etc.) that give them low local AMS scores computed against validation sets in training.csv.

    I am similarly dismissing entries with too low AMS scores but they're AMS scores computed from the preliminary leaderboard. So my guess is that we ultimately do a similar amount of adjustments of the code. The professional guys are just making adjustments to get higher scores from the testing.csv validation subset - with the risk that their entries are adjusted to special features of testing.csv that are not shared by the 450k events to compute the final score - while I am making adjustments to push the submissions closer to the 100k events in test.csv used to calculate the preliminary score. There's a risk (well, certainty) that the final 450k "decisive" events will differ in the fine details from the 100k preliminary events.

    My higher submission count - I have used 5 days per day (a limit) even on days when I was in the mountains LOL ;-) - just reflects the fact that I am using the Kaggle server for a more accurate estimate of the AMS score while the professional guys are doing it at home, so their count is lower.

    In some sense, I believe that my entries might be less overfitted because I am using the 250k events in training.csv - as well as, perhaps, the 100k events (part of the test.csv to make the preliminary table) i.e. 350k events in total to adjust the submissions while those guys are only using 250k if they pay no attention to the preliminary score.

    So formally, I have 300+ events which could pick 3-sigma "overly optimistic" flukes while some professional guys have just 20-50 submissions so they only have 2 or 2.5 sigma, so that's 0.5 sigma, or 0.04 difference in their favor. But in reality, the difference is probably smaller because they secretly do the same amount of picking with the same risk of overfitting.

    Yesterday, I actually decided that I was too afraid of overfitting (and was choosing too high values of some parameters that are meant to fight overfitting) and the overfitted calculations should become more allowed in my attempts, I can't tell you details, however.

  3. Great going, Dr. Motl!


    Where are diversity, privileged minorities, and gender/sexual preference Equal Opportunity? Value is wholly vested within Intent not effect.

  5. Lubos,

    it's looking *very* close between you and the other top 2 guys; It would be interesting to compare their positions in previous competitions using the partial and full data sets; and to see how successful they've been in not over-fitting.

  6. But the feelings is very different now. It seems that I can add 0.01 to the ranking more quickly than to get ready java. It's almost as simple as composing +1-1 at the end of a control y=f(x). ;-) Well, not quite but it is almost automatically uncomplicated and it has continuously (but not quite always) proved helpful.