Thursday, July 31, 2014

Joining the Kaggle Higgs 3.8+ club

Briefly, some news on Friday, August 1st. As I expected (see the text below), Tim Salimans is now ahead of Gábor Melis although his advantage is infinitesimal. (Friday 4 pm update: Melis is at the top again.) With the so far minor help of incredible variables from Christian Veelken of CMS, I (or counting the promised 10% share for CV, we) joined the club of those with the score above 3.8, see the leaderboard. Every contestant who is not a complete loser must feel safely above 3.8, so the score associated with my name is now 3.80007. ;-) I am not selecting that submission for the contest because I don't have all the sources that produced it – it was very complicated.

The text below was originally posted on July 24th.

Gábor Melis' new formidable challenger

Tim Salimans makes the Terminator look like Pokémon

As recently as two hours ago, I thought it was conceivable that I would end up in the top three of the Higgs Kaggle challenge. See the leaderboard.

The top 5 contestants hadn't changed for a week. Gábor Melis was at the top followed by the Marijuana Hybrid guy, by your humble correspondent, and by 1,100+ other participants.

Terminator, Ironman, Batman, and a few Transformers as seen from the optics of a company in Utrecht.

Times are changing. For more than an hour, Tim Salimans of Utrecht, the Netherlands has been the new #2 warrior. His 7th submission with the score 3.81888 catapulted him to that place and made the victory of Gábor Melis uncertain.

Almost all contestants at the top are experienced machine learning software experts – your humble correspondent is a true rural bumpkin in this company (my experience with machine learning and computers is that I managed to jump a few trains on Subway Surfers and solve a hard level at Candy Crush Saga after 500 attempts) – but Tim Salimans makes even most of the urban contestants look like bumpkins.

Just to be sure, his Kaggle profile says that he has won (!) 4 previous Kaggle contests, including one on dark matter data, was the 2nd several times, too, 10 times in top ten (in total), and he has hosted his own Kaggle contest, too. More shockingly, he is
[a f]ounding partner and data scientist at predictive analytics consulting firm Algoritmica, with a PhD in computational Econometrics and a strong academic background in Machine Learning.
The company web,, explains that
Algoritmica combines machine-learning algorithms with the power of supercomputers to build unparalleled predictive models for marketing, risk, fraud, supply chains, and maintenance. We lead companies around the world from average business processes to a truly data-driven organization. Empowered with predictive models, these companies learn from data to stay ahead of the competition, cut waste, and delight customers.

Algoritmica also supervises the NSA and FBI and keeps track of all the data and patterns in the 2 trillion telephone calls and e-mails that they record every month.
OK, I added the last sentence but it may be true, anyway.

Salimans seems to have no specific training in physics but it's clear that he does care what the LHC collision data mean. In a question he had posted to the Kaggle forums, he was asking where he could find the algorithm used by the ATLAS Collaboration to estimate the Higgs mass from the candidate event. This is a rather difficult calculation whose result, the MMC mass, is the first "feature" describing each event and by far the most complicated "derived" quantity calculated from the raw collision data.

I am pretty sure that by today, he has incorporated the improved version of the MMC mass estimator to his supercomputer superprograms. In fact, I find it likely that he has added the CMS' not-so-frequently used alternative to the MMC estimator, the (N)SVFIT algorithm, as well, and the help of the (N)SVFIT feature as an added one may help one to jump above 3.8 even if other things are lousy. I was thinking about adding (N)SVFIT but it's a rather complicated program that I would have to reverse-engineer, rewrite from scratch, and you know, two hours ago, I felt that I would be the only contestant to waste my time in this way.

Whatever Salimans has exactly done, I feel that it's ludicrous to try to compete with such a monster. My mobilization against him is only going to be as symbolic as the Czechoslovak army's mobilization against the Third Reich right after the Munich Betrayal, in September 1938. ;-) My codes and software infrastructure is based on several legs and lots of partial cute ideas. But I don't even have any systematic "quality control", like strictly dividing the training dataset to training and validation. I am sure that he not only does so but does so dynamically, with some meta-machine-learning that adjusts the learning computer to make it learn better than the previous programs, and so on. The possibilities are endless.

Of course, it's great if really powerful guys like this one make their job and switch from econometrics to particle physics at least once. I hope that it will be useful for the LHC research, too. If his (or other commercial professionals') methods are significantly more effective than those at the LHC, I believe that CERN should simply hire them or buy their software etc. to perform similar tasks. If the LHC experimenters are "clearly amateurs" in comparison, they should admit it and CERN should fire some of them and replace them by true professionals.

On the other hand, if his AMS score got stuck at the current level just 0.03 above the score of your humble correspondent who is doing all these things with a $0 software on a $500 laptop and with 0 pre-contest experience with machine learning software, it would be rather stupid to pay millions and millions of dollars to a special Dutch machine learning company designed to conquer the world. ;-)

My respect to the Dutchmen's sophistication is immense and they have my condolences after the downing of the MH17 flight. However, it's probably more natural for me to root for the fellow Austrian-Hungarian Visegrád guy now. Gábor, István, Balázs, Jánosz, go, go, go! ;-)


  1. Penalize any competitor backed by less than €100K in hardware. Somebody is ruining process, PERT chart, and budgeting by being competent. Obscuration is fundable. Solution is a one-trick pony.

  2. My guess is that Salisman is building a probabilistic model (At least that is what he has being doing in the all other competions). Ultimately, I think that is much more natural approach for the problem than any decission tree based algorithms.

    So, I think your might be on the right track with the SVFIT thing. It's a relatively straightforward probability model of the event, but at least the basic version is missing all the information from the jets etc.

  3. He wrote me that he doesn't want to blog about ideas because he's being conservative. Of course no technical details could be said but I would personally guess he is doing decision trees in the right way.

  4. Ooh, that's nasty, Al.

    Anyone would think you had something in general against state-sector operatives expanding their fiefdoms and feathering their own nests! What's wrong with you, man — don't you believe in the big rock candy mountain public titty?

    OK, I guess you're not exactly persuaded that all these people are needed. But just imagine what the employment figures would look like without those 'jobs'!


  5. I respect your efforts, so if you want to easily add the SVFIT to your toolbox, you can find a standalone version:

    You'll have to link with ROOT and the HiggsML team does not give the MET covariance matrix so you have to assume something.

  6. LOL, I have had this Standalone version for 5 hours and 3 minutes ago, I managed to install Root correctly.

    Yes, I am just reading papers how to calculate a reasonable covariance matrix. If I won't find a reasonable refined enough formula, I will use something like unit matrix times 200 GeV^2, or something like that. ;-)

  7. Off-topic:

    Is Lawrence M. Krauss crackpot?

  8. Not really off-topic: [Algoritmica builds] predictive models for marketing, risk, fraud, supply chains, and maintenance. Please note an absence of climate. Or weather. Apparently we know more about fraud than about climate.

  9. Did you really mean /usr/include/root or rather /usr/root/include?Anyways, does the pathcontain the ".h"-files of the relevant libraries? If not, just find out where they are located and add those paths to your include statement. In general it may be necessary to obtain and install the relevant libraries (including their source code for linking) first. Getting the includes to work is usually straightforward and should not take much time, though.

  10. Look like you are well on your way to get the sw going.

    You can cp all the include.h files for the libs to a single Dir of your choosing and use them there. You have to specify the libs by name with a -l.

    You should be able to do 1000 records per second with floating point hardware. Just don't fry your laptop. They tend to get hot ;--(

    Post a error log if you need help ;--(

    Have fun. ;--)

  11. "If the LHC experimenters are "clearly amateurs" in comparison, they
    should admit it and CERN should fire some of them and replace them by
    true professionals.".

    Experimenters in HEP are always amateurs in all these technical stuff, by construction. They spent their years in learning physics, not the behavior of bits and bytes . They pick up knowledge as necessary for the specific set up, and work as best as they can.trying to squeeze out of the data the maximum information. And they are not hired by CERN. They belong to individual HEP groups as physicists and often professors in their respective institutes.

    I do not think that if LHC analysis was run as an engineering project, as you suggest I hope jokingly, if it would be more successful in finding new physics. Engineering mentality usually works on fixed goals, not on searches for new phenomena thinking outside the box.

  12. Dear Anna, I wasn't really joking. Of course that I think that true physicists who may be amateurs in most other things are *essential* in HEP experiments.

    On the other hand, lots of work is being done by highly specialized people who are really supposed to be good at something - not necessarily "physics" (and especially "theoretical physics") related - in engineering tasks. If someone is really doing data mining only, he should be fully exposed to the competition on the job market of other people who know how to do data mining.

  13. Haha! Oh very good! Point nicely made. :)

    (I assume it's genuine since it looks highly plausible and I have no reason to think otherwise. BTW I wouldn't have known your MSSM required so many parameters. That seems a lot.)


    "120 new parameters" so it in fact requires way over 120 in summation.

  15. Gábor, István, Balázs, Jánosz, go, go, go! ;-)

    Lubos what are you doing ? This looks like a betrayal of everything we stood for in KuK ! Did you forget sapér Vodicka ? :)

    "To by tak jeste schazelo," rozdurdil se Vodicka, "aby nam jeste ten Mad'ar chtel hodit neco na hlavu. Ja ho chytnu za krk a shodim ho z prvniho poschodi dolu po schodech, ze poleti jako srapnel. Na ty kluky mad'arsky se musi jit vostre. Jakypak s nimi cavyky."

  16. LOL, yes, sorry for that change, Tom! But haven't some things about the Hungarians and their relations to us changed over the century? ;-)

  17. I actually joined the challenge officially last couple of hours. My thing is 3.41359 on 4 entries. I am almost done with some special juice, Consider this the kaggle equivalent of a "roof knock" . Ktahn should easily be top 5 once this juice is complete.

  18. What took you so long. ;--)