Wednesday, August 27, 2014

Kaggle Higgs contest: the solution file for everyone

Well, an approximate one

Have you ever searched for the solution file for the Higgs Kaggle contest? Have you ever asked why the organizers don't just publish it so that everyone is smarter? ;-)

Did you ever want to be able to estimate your submission's score without sending it to the Kaggle server? Have you ever been confused by the normalization of the weights?

Because I just got a permission from my teammate and it is allowed for the contestants to be generous and share their wisdom with the whole Internet and all the competitors, I just decided to help everyone – and, perhaps, to re-energize the contest a little bit.

The solution file you are going to get isn't quite the real one – it isn't the file that only the organizers possess (and hopefully protect by a highly confidential behavior and a strong enough password). But the file below is "rather close" to the right one.

You unzip the file and extract the CSV file that is inside the archive. This file will look as follows:
EventId,FakeWeight,FakeClass
350000,0.9520000000000001,b
350001,0.8270000000000001,b
350002,0.884,b
350003,0.0039000000000000003,s
350004,1.1500000000000001,b
350005,0.8300000000000001,b
350006,0.786,b
350007,0.994,b
350008,0.925,b
350009,0.0049,s
350010,0.978,b
...
899997,0.886,b
899998,0.8280000000000001,b
899999,1.1500000000000001,b
The silly unrounded figures are due to a Mathematica bug – they have appeared despite the command Round[x,0.0001] etc.

At any rate, the file tells you which of the 550,000 events are "signal" and which of them are "background", and it tells you the weights, too!

You are supposed to use the whole file to rate your submission.

Take your candidate submission. Look at all the events you conjectured to be "s". Compute $$s$$ as the sum of the "fake weights" in my solution file of all the events that you labeled "s" and that actually are "s" (true positives). Compute $$b$$ as the sum of all the "fake weights" in my solution file that you guessed to be "s" but they are actually identified as "b" in my solution file (false positives). You are supposed to use all the 550,000 events in the solution file (even though only 100,000 are being used in the preliminary leaderboard and the remaining 450,000 will be used to calculate the winners at the end).

Now, compute${\rm AMS}_{\rm approx} = \frac{s}{\sqrt{b}}$ or, more precisely, $${\rm AMS}=\dots$$$= \sqrt{ 2\zav{ (s+b+10)\log\zav{ 1+\frac{s}{b+10} } -s } }$ and you will get a pretty good estimate of the score that your submission will produce at the Kaggle server. For example, perform this procedure for a random submission file with 30% entries identified as "s" and 70% labeled as "b" and you get the AMS score slightly above 0.58 that you may find in the leaderboard repeatedly.

Invent a better submission and you may get closer to 3.8 – the ballpark of the scores achieved by the three or four semigods at the top of the leaderboard. ;-) I promise you that the file was created from a submission whose score earns the bronze medal on the preliminary leaderboard which for 95+ percent of the competitor is almost equivalent to the "nearly perfect submission" they are dreaming about now, before their self-confidence reaches the heaven.

Finally, here is the OneDrive URL of the folder where you may download the zipped file:
Higgs Kaggle approximate solution file (CSV)
Happy kaggling and higgsing. Incidentally, Microsoft's OneDrive is a really handy place to get tens of gigabytes for free. I am particularly satisfied with the way how nicely and quickly it synchronizes photographs between my Lumia 520 that takes them and the laptop.

You may create your OneDrive account, too. You will need to register a Microsoft (Hotmail-like) account if you don't have one.

1. Powers and abilities beyond those of moral men!

2. It's really very simple. I am being amused on the Kaggle forums where numerous readers think that they may use this file to immediately jump at my score LOL. What a cool naivite. ;-) One may perhaps get score around AMS=1 or 2 using such a file directly. It just happens to differ from the "right" solution file by so many tiny changes that they don't impact the AMS score calculated with them "too much".

3. But anyone can get a score around AMS=1 or 2 by simply running the python code in the starter kit (I got a miserable score of 1.54451), so your gift is quite stingy :-D

4. If there are confounded variables, addressing them individually cannot optimize them together. Sabotage the competition by giving them diversionary small successes. They redirect resources to not the solution.

Death is cheap. You want the opposition to have casualties.

5. Totally OT, but czech-out the new developments in Donbas.

6. Congrats on 500 entries. U have got that record.

What u gonna do with the price .
Buy a new PC with quad graphics cards and a 3000 watt power supply????

That should keep u toasty warm this winter.

Cheers