Tuesday, September 16, 2014

Kaggle Higgs: lessons from the critical development

Vaguely related: Mathematica Online (in the cloud) became available yesterday, see Wolfram's blog. Unfortunately, it's not quite free.
When the Kaggle Higgs contest switched from the public 100k "public" (preliminary) collisions to the official 450k "private" (final) collisions a few hours ago, your humble correspondent's team (including Christian Velkeen of CMS, with a 15% share) dropped from the 1st place to the 9th place (correction: it's 8th place now, Friday Sep 19th). This corresponds to the hypothesis that the changes of AMS will be comparable to the "full combined noise" 0.1 or so.
Fun link: See all of my/our 589 submissions' preliminary scores, final scores, filenames, and comments (with a somewhat funny and esoteric terminology). It's the last, HTML file over there; save it and open locally.
Because the "random" changes of the score were generally this high, you could say that chance decided about the winners, and the 0.045 gap between the winner and my team was our bad luck. Still, I feel that the three guys at the top – Melis, Salimans, and the marijuana guy – didn't become winners by pure chance. All of them are professional big-data programmers of a sort. It's true of everyone in the top 10 and almost everyone in the top 100, too. I am not aware of anyone who was significantly helped by some physics wisdom.

I still think it's clear that much (well, almost exactly 1/2) of my/our improvements done from the public xgboost demo – whose score actually increased from the preliminary 3.60003 to the final 3.64655 – was genuine. After all, those who only used the basic xgboost code ended up at the 357th-450th place, well below my 9th. But instead of an increase to 3.85060 as seen on the preliminary leaderboard, the actual increase was just to 3.76050. The efficiency is lousy because my methods to rate the improvements were amateurish: the preliminary AMS is much better than nothing but it isn't a good enough measure to safely beat the top players.

One may say that among my replacements of the "best code" by an "even better code" that looked like improvements from the viewpoint of the preliminary AMS score, 3/4 were genuine and 1/4 were false (they made it worse) simply because the preliminary AMS score and the final AMS score are correlated but not so perfectly. You may see that if you make 3 steps up and 1 step down, it's like making 2 steps up which is 1/2 of the 4 steps up that you could get otherwise.

Of course, it's important to realize that it's easy to "rationalize" some answer once you know it. One could probably find a way to "rationalize" wrong answers, too. ;-) The experimental data – in this case, the observation of the final leaderboard – is much more important for me than non-rigorous theoretical predictions and philosophical speculations.

The best submission among my/our 589 submissions was a "megacombo" of 44 previous submissions with a clever but not really deeply justified "minimax" method to choose the weights; its final score 3.77367 would have placed me/us at the 6th place (and the score was very close to the preliminary one for the same submission, 3.77496). Of course, there was a part of me that said that the "huge combos with some automatically calculated weights" was a "fundamentally solid idea", even more so than the slightly smaller combos with somewhat random weights, but I didn't have enough local technology that would have convinced me that such a submission with the preliminary score of 3.77+ was actually better than some of my other submissions that were already above 3.80. It's not too important, the 6th place wouldn't have been great, either. I had plans to create a program with really uniform gigacombos and automatic smart weights obtained by local validation but I and Christian didn't find time for that.

Most of my submissions were "combos" – submissions where the rankorder of each collision is computed as a non-arithmetic average of the rankorder from previous 2-45 submissions. It takes less than 10 minutes for Mathematica to calculate such a combo so be sure that sending hundreds of files like that isn't stealing too much of your time (still tens of hours of the CPU time over the contest). An xgboost run takes longer, almost an hour (but only some of these submissions had "lots of human work" behind them – to modify the programs generating some of them sometimes only takes seconds). The idea of the combo is that it is supposed to reduce the "statistical noise". I've converged to a way to do the right "nonlinear" average – the resulting algorithm would be based on the weighted average of \(\exp(-12R)\) where \(R\) is the rankorder divided by \(550,000\), roughly speaking. This exponential is proportional, roughly speaking and approximately (according to my analysis of histograms), to the probability ratio that "it is a b" over "it is an s". The weights were a bit randomly assigned.

So this is "something in between" the arithmetic average (that just reduces the statistical noise, if the individual submissions' rankorders differ from those in the optimum submission by statistical noise) and a "method to veto collisions that look really bad in some submissions". This hybridization of the individual component submissions was referred to as the "emperor nose". I had submissions with codenames like "emperor nose 87b2", and so on, the last one was "emperor nose 92". ;-) The name refers to Richard Feynman's comment that you can't find a more accurate answer (to the question What is the length of the Chinese emperor's nose?) by averaging the answers of many people.

And of course, you often can't. But there is an issue. If the people were random and independent and their deviations from the right answer went in random directions uncorrelated to other people's directions, then the averaging would help. By averaging \(N\) submissions/people like that, you would reduce the statistical noise – the deviation of an individual rankorder from the "optimum, right rankorder" – by the factor of \(\sqrt{N}\).

However, in the real world, the people and the component submissions aren't that independent, so many of them or all of them tend to "err on the same side" and the averaging doesn't help. The error is a "systematic error". So the "emperor nose" method only works and produces better submissions if the component submissions are sufficiently different and diverse, cutting or grouping the 550,000 test events by completely different cuts, clumping them differently, and so on. Only the "statistical part" of the errors gets reduced by the hybridization of many submissions.

For example, the 3.85060 submission (well, 3.85058, but it's not too different: they differed by 5 events that were relabeled as an "s") that topped the preliminary leaderboard was a combo of 16 component submissions. They were xgboost runs that differed by the "usual characteristics". Five of them have used the SVFIT variables from Christian (and CMS: their fancy way to reconstruct the Higgs mass from the \(\tau^+\tau^-\) final products – this method is a slightly superior counterpart of ATLAS' CMS), the rest didn't. Sometimes, one only used 1 SVFIT variable, sometimes, all 30 of them. Sometimes there were new simple features (nonlinear functions of primitive features), sometimes it was limited. About 5 parameters of xgboost (number of rounds, eta, gamma, depth) had different values. The \(SO(2)\times SO(1,1)\times \ZZ_2\times \ZZ_2\), roughly speaking, unbroken Lorentz symmetry of the ATLAS detector was either gauge-fixed or not. If it were gauge-fixed, it was gauge-fixed in different ways (e.g. to make "lepeta-taueta" positive, or to make something else positive etc.), and so on, and so on.

Some of the component submissions also rescaled the weights of both "s" and "b" by some common function of the 30 features and this "Weyl transformation" doesn't actually change the ranks in the perfect world because the density ratio of "b" and "s" (including the weights both in the numerator and denominator) is unchanged! Well, I have played with "not quite Weyl" modifications of the weights, too, but they didn't make it to the officially submitted combos.

So I had something like 200 or so component submissions like that – individual runs of xgboost (the highest final AMS score among them surpassed 3.73 but they were mostly around 3.6+) – and combined them in the "emperor nose" combos in a rather spontaneous way. This combination procedure indeed did add about 0.1 to the score of the individual components, less than 0.15 that was needed to win the contest. One may see that my methods to determine the "weights" of the individual submissions were more or less guesses – whatever increased the AMS score was kind of good and my "progress" to another point that is even better resembled Brownian motion. One can see that the trend for the final score was indeed "up", just the slope was only about 1/2 of what it seemed to be according to the preliminary scores.

I've been applying the hybridization paradigm since the very beginning of the contest – more precisely, from May 24th, 2014 when my first 2-component combo produced the preliminary score of 3.58 which actually looks like a happier 3.66 as the final score. Also, I was increasingly convinced that it's pretty much the same thing that the xgboost program is doing internally, anyway – it's combining scores from many different ways to divide the dataset to decision trees. My method is superior over a single xgboost run because one includes diverse kinds of trees obtained from various ways to reorganize the collisions, nonlinearly transformed features relatively to those of other trees, and so on. So one gets a greater diversity of the "types of errors" in the estimates of the rankorder of each collision, and therefore the estimate of the "optimum rankorder" may be reduced more significantly by the averaging.

I also believe that the method of the ultimate winners must be de facto equivalent to what I was doing – using a combination of many classifiers at the same moment – they just used a more professional, and not my manual "whatever works", method to calculate the weights etc. As a guy with virtually no training in programming (a hot high school teacher told me not to attend the IT classes after the first one just because I was smarter than she was – so I wasn't exposed to virtually any "teaching" of programming in my whole life – well, since the age of 10 when I still attended a "Station of Young Technologists", and with the exception of a few undergraduate DOS-focused classes how to use computers by Jiří Dolejší of ATLAS), I was too lazy to write the code that would evaluate the right combinations "locally", by dividing training.csv to validation sets etc. Dividing an array into pieces in dozens of different ways in Python (that I hadn't have used before the contest, in my whole life) would be too much hassle. In fact, I haven't separated a validation set in testing.csv once in the whole contest even though I know it's obviously needed to do these things professionally (I won't argue for a second if you tell me that it must be done to do things right). So of course that all the "AMS scores" I saw locally were much more brutal overestimates of the final score (and even the preliminary ones). Without a special validation set (when you evaluate the success by a "test ensemble" that was used for the training as well), you are bound to get overoptimistic estimates due to "overfitting".

It seems to me that the winners de facto use many classifiers and calculate the right weights/combinations by looking at some good enough locally calculable "score" variable that doesn't suffer from as much noise as the preliminary AMS, perhaps because it's computed by averaging many "similar ways" and using different splittings of training.csv, too. It must be morally the same what I was doing except at the professional level. I think that I know how to produce a genuine 3.85 run but I didn't have the technological resources for that (e.g. grad students who would quickly create a program according to my dreams LOL, or at least a much less busy Christian).

Just to be sure, if I were not trying to be at the top of the preliminary leaderboard, it wouldn't have helped to increase my final score, of course.

There's been lots of amazing nonsense on the forums about "new variables". This has been pure hype. Now, when the data are out, everyone can see that those things have nothing whatever to do with the substance of the competition. Most people actually saw that the CakeA or CakeB variable, for example, have worsened their score. Either they saw that a "run with cakes" worsened their score relatively to what they consider "an analogous run without cakes" (which is a problematic concept by itself; a truly analogous run "without cakes" should perhaps have more iterations and other parameters); or they at least have submissions "without cakes" with a higher final score than their best submission "with cakes". And there also people who may find a higher score with a cake-including submission, by 0.02 or so. Everything is compatible with the description that "the effect of cakes is noise".

In particular, the C.A.K.E. team ended up at the 484th place with score 3.64007. It's actually less than the "xgboost initial demo free for everyone" – which, I remind you, produced 3.64655 in the final leaderboard (you find a whopping 94 contestants in the final leaderboard with this score as the best one) even though their preliminary score looked like a 0.12 improvement. They may have solved a highly simplified (yet still sort of nontrivial) version of the problem to "analytically calculate the probability that an event is signal out of the features" – you wouldn't need training.csv at all if you could solve the full analytic problem – but the simplification is so severe (by banning the dependence on many variables, and by ignoring 2 of the 3 backgrounds) that it's no useful for the calculation whether a real-world event is a "b" or an "s". At the end, although they had solved a fun toy physics task that might be an OK exercise in a QFT textbook, adding their variables isn't too different from adding any other random nonlinear function of other features (I have known for months that manually redefining the variables to make it "easier" for the computer to see how to use them for discrimination is almost always useless).

It's useful to compare this hype to the general hype about some "science" in the media. The C.A.K.E. team ended up as 484th. Some of the people above them only adjusted some parameters in the xgboost demo. But many of them, perhaps 1/2 of them – over 240 people or teams – had some genuinely new ideas and many of them, clearly including mine, were vastly more relevant for the discrimination and more powerful when dealing with the problem of the contest.

But just because the C.A.K.E. guys posted a comment on the forums with a lot of hype, lots of naive people started to think that the Cake variables are some kind of a "holy grail" that changes the contest. They would be arguing whether it was even ethical for C.A.K.E. to post it. Of course that it was ethical to share the ideas – it's explicitly allowed – and it didn't make any difference to the substance of the contest so the people who knew what they were doing (and I count myself in this broadly defined camp) didn't pay any attention to the "cakes" that were served.

So in reality, the C.A.K.E. team differs from 240+ idea-ful contestants above them by having achieved a worse score and by having posted an idea of theirs. But each of those 240+ contestants had some idea (or, sometimes, dozens of ideas) – not necessarily a new variable (new variables aren't really the smartest or deepest kind of knowledge – which is also why Abhay Ashtekar isn't among the top 100 quantum gravity theorists in the world) but a new idea – and except for the C.A.K.E. team, no one has published the idea before the deadline. Now, just because it's the only thing that is published before the deadline, some people (including the C.A.K.E. team itself) blow it out of proportion and sell it as some game-changer or a holy grail even though it's just irrelevant noise, a random one of the 200+ OK ideas that are not terribly good.

This is unfortunately analogous to the "positive" coverage of science in the "science media". We constantly read about bullšit breakthroughs involving "droplets and pilot waves simplifying the foundations of quantum mechanics" and about a dozen of persistently repeating "evergreens" all the time. In reality, these topics represent something like 1/240+ of the genuine research, and not exactly the most important or correct one (in the case of the pilot waves, it's a completely wrong research, of course, but even things that are not quite wrong are usually superficial, cheap, not profound, and irrelevant if they are positively hyped in the science media).

So there's been lots of nonsense that many gullible competitors have believed but otherwise it was a great contest and I have learned a lot from it.

At the end of May 2014 when the contest was getting started, I had some really naive ideas about what can be done, what can't be done, how much time it takes something, how accurately the probability of an "s" for a collision may be calculated, how finely the dataset should be divided, how important the detector effects could be (I actually dealt with lots of special features of the muon spectrometer, crack region for the electrons, and so on, and so on, and some of them have improved the score but none of those things was enough to catch up with Melis, Salimans, and the marijuana guy – the statistical errors due to low statistics end up being vastly more important than a less-than-a-percent bias for the probability of an "s" in some directions where the lepton can be detected etc.), what the boosted tree programs are actually doing, whether the "nearest neighbor" strategies may get you very far (nope), what is the range of the probability ratio of "b" vs "s" in the dataset (it's going from \(exp(9)\) to \(\exp(-5)\), more or less, with the exponent being more or less linear in the rankorder) and so on, and so on.

Now I have a much better idea about all of these topics – about the character of the data at the LHC, the patterns hiding in the data, some detector effects etc. (This \(\tau^+\tau^-\) decay channel of the Higgs is more complicated than the average analyses that the LHC collaborations do.) 8 other people/teams got above me – the winner ended up 0.045 above my team (this is not the first time I see the 0.045 gap between Melis and myself in this contest LOL! That's probably no coincidence – all my "reductions" of the gap from that point were spurious as our "final" scores only grew at the same rate), but there are still 1,780+ people and teams below me, including all the world's physicists and all the CERN professionals (congratulations to Tommaso Dorigo's 750th place, for example), so be sure that I don't feel too painful. ;-)

Maybe the datasets could have been larger so that the partly random change of the score wouldn't be this large, of order almost 0.1. But it would have the disadvantage of slowing down all the programs and so on (very bad for me and my laptop!), so I don't actually have a clear idea how this aspect of the contest could have been improved. Of course that the preliminary leaderboard could have been declared the final one. But if this were the rule, I guess that Melis and Salimans would have behaved differently and they (or someone else?) could have won, anyway. Such a contest would be much less useful for CERN to learn something useful for the actual physics analyses, of course.

Three months ago or so, I made a $100 bet with a Finnish guy that the ultimate winner's final score would be below 3.8 (I said). I should have said 3.81! ;-) This is what you may call a "superfair way to define a bet". Because there's exactly one participant whose score is above 3.8, Melis with 3.805 (Salimans is already below 3.79), I will have to pay $100. Bad luck, indeed! I've already paid to Heikki. Feel free to help me with PayPal if you agree with me that this kind of bad luck is something like a small natural catastrophe. :-)

Look-elsewhere effect

Tommaso Dorigo says that the public-to-private score drop is higher for users with many submissions because of the look-elsewhere effect. I agree with him if it is formulated in this way, of course: If you have many submissions with the same "underlying strength" and "white noise" upon it, the maximum-scoring one among them will obviously have a higher score if there are many submissions. The subtlety is that one doesn't have to choose the "highest preliminary AMS score submissions". However, he also says that "one is deliberately choosing an upward fluctuation if he picks a submission with the highest preliminary score" among those of "equal strength". This is of course also true but completely unusable within the limited (due to my not being a programmer) software framework I had because the only truly empirical way to measure the "strength" of a submission was to look at the preliminary AMS score. So for me, by definition, the high-AMS-score were the "highest strength" submissions.

This is a rule-of-thumb that partially works – the preliminary AMS score is surely positively correlated with the final one and you may indeed see that even the final scores of my submissions continued to grow with time. But the imperfection of this correlation was enough to reduce all the progress from "single xgboost runs" by 50 percent or so (or, to say it differently, it was enough for the increases of the preliminary AMS score to twice overestimate the improvements that were occurring "fundamentally" i.e. in the final score). With a more accurate way to compute the "strength" (e.g. to estimate the final AMS score) than the preliminary score, I or we would have chosen a better combination of the runs, of course. But without this complicated software meta-infrastructure, following the AMS preliminary score was really the only feasible, non-superstitious way to go. In particular, choosing "lower-preliminary-AMS submissions" from morally similar ones would surely do no good. As I have already said, the submission I picked was just 0.015 worse (by its final score) than my best one among the 589 submissions so I "approximately knew which submissions are the best ones".

If I look back, I would still find it as a waste of time to (try to?) develop the software for cross-validation and optimization etc. although I probably know exactly what the software would have to look like. These things have been more or less discovered before and by others – and I am too conservative to constantly rediscover the wheel or America. The results clearly show that such a job belongs to the (machine learning) professionals. No one else could have safely competed – and I was one sigma below the winner which you may interpret as a safe gap or just noise. Some of the gurus should perhaps be hired by CERN to make the future discovery claims of CERN a bit stronger although, again, the difference between 3.805 of Melis and 3.76 of myself (or even 3.64 of the basic xgboost package) is too small to justify multi-million investments or analyses that no one else at CERN will understand.

So I am uncertain whether the winner's score justifies some radical changes in the way how the LHC collaborations do their research. Boosted trees should almost certainly be used in similar situations, however. On the other hand, I think that using boosted trees without any MMC or SVFIT (and even any "cakes") would be just enough. I sent about 2 or 3 submissions without any MMC, SVIFT, or "cakes", and the better or best one among those gave the final score 3.57711 (preliminary 3.57257) which is almost the same as the runs with MMC. Given the extra complications that MMC brings, I would probably recommend to sacrifice 7% in the confidence level and avoid MMC.

No comments:

Post a Comment