Monday, May 04, 2015

How the Kaggle restaurant contest was hacked is a server that organized the ATLAS Higgs machine learning contest but it is organizing many others. One of them is the Turkish restaurant revenue prediction that is ending tonight.

You may download the information about 137 restaurants including the revenue – along with the values of 37 real-number parameters describing each "training" restaurant (plus some "discrete information" about each) as well as the revenue. And you should guess the revenue of 100,000 restaurants for which you are only told the 37 real-valued parameters (plus the "discrete information").

You are rated according to the root mean square average of the errors of your 100,000 predicted numbers. Well, the preliminary table is calculated from about 30% of the restaurants, and the final leaderboard deciding about $30,000 in prizes from the remaining 70% or so restaurants. If you open the current leaderboard, you will see that the BAYZ team has guessed the 30,000 numbers exactly.

They apologized for their omniscience on the Kaggle forums. How did they do it?

It seems utterly impossible to guess. However, I think that I know how they did it. You have several hours left to convert my recipe to a full algorithm – and try to send a perfect score and compete with those people, too. Note that with the perfect knowledge of 30,000 revenues, you get a delicious training sample of 30,137 or so restaurants instead of 30,000. That must help you to guess the remaining 70,000 accurately, too.

I think that they got all the exact information about the 30,000 restaurants in the public sample from the exact score that was returned once they submitted some very lousy but carefully designed submissions. I noticed that if you send a really lousy submission, you can get really insanely high scores like quadrillions etc. And those scares contain many digits of information!

A simple sketch how you may determine all the revenues.

Make a guess that the \(n\)-th restaurant has the revenue of \(10^{15n}\). So the initial restaurant has $1, the second one has a quadrillion, the third one has one quadrillion to the second, and so on. To be simple-minded, you could choose the revenues in this way up to the 100,000th restaurants. But it may produce too long, multi-gigabyte files. So you may divide the dataset to 100 pieces and only use these "exponentially huge" revenues for 1,000 restaurant in each piece, while you choose "finite" numbers close to the average revenue $4,400,000 or so for the remaining 99,000 restaurants. If you divide the job in this way, you will need 100 submissions.

The score effectively allows you to calculate\[

K = \sum_{i=1}^{100,000} \zav{ y_i - \hat y_i }^2

\] You need to undo the square root entering the score and you need to determine the normalization factor from a "small number of options" – only for the right normalization, you get \(K\) that literally looks like a power expansion described below!

If you look at \(K\), you will be able to reconstruct each \(y_i-\hat y_i\) separately, and because you know the estimated revenue, you may calculate the actual revenue, too. Just to be sure, if I simplify just a little bit, \(K\) will have the form similar to\[

\dots 14125 00000 25236 00000 32675 00000 93992

\] which is a sequence of "many zeroes" followed by "many generic numbers", and so on. You may attribute \(93992\) to the first restaurant, the \(32675 00000 00000\) to the second restaurant, and so on. The actual length of the clusters is different than 10 which I showed above but you will be able to do it right if you're smart enough.

Note that all the predicted and guessed scores only have 7, exceptionally 8 digits.

Consequently, the preliminary "very bad" score – if it can really be a hopeless sequence of thousands of digits – will give you enough information to reconstruct all the 30,000 precise revenues, and create a submission whose preliminary score is "perfect", namely zero.

Also, you will be able to use these 30,000 now "perfectly known" restaurant revenues as a much bigger training sample and create a good model that makes decent predictions for the remaining 70,000 restaurants, too. Does it make sense? ;-)

I am not 100% sure whether this is the kind of a trick that they have actually used. But whatever it was, I think that their trick is cooler than the actual work expected from the "down-to-earth" participants of the contest! In the actual contest, one is told 37 parameters describing each restaurants – such as the breast size of the waitress and similar things.

Of course, some correlation exists but it is very noisy and intrinsically acausal. So if you use the best models that people have – without the trick of the current leaders – you may only reduce the RMS average error in the revenue of $1.9 million (that you get if you simply guess that every revenue is $4.4 million) to something like $1.45 million – a rather small relative decrease for so much work.

The leaders were able to reduce this error to zero. Even though they clearly had to go "outside the box" and I doubt whether their solution will be accepted as "kosher" by the organizers for them to win the prize (if they were also able to make good enough guesses of the remaining 70,000 restaurants), their creativity makes me much more excited than the mindless uninspiring algorithms that the average "honest" users are supposed to exploit.

If their method does depend on "seeing very many digits" from the preliminary scores, there is an easy fix to protect the from similar hackers: just round the preliminary scores to 10 (or fewer) significant figures! ;-)

P.S.: You don't have 100 submissions left today anymore. If the basic paradigm above works, you may still get decent scores if you have 3 today's submissions left. Use my algorithm for the first 1,000 restaurants to determine their revenue. The submission file will contain 1,000 revenues of average length 10,000 digits or so – just some 10 extra megabytes or so. You will extend the training sample with 137 restaurants to 1,137, for example, and that should give you a nice enough improvement if you use some normal regression or sklearn methods to guess the revenues.

P.P.S.: A problem. Numbers such as "1e320" in the submission are already too high, producing errors during submissions. "1e300" seems OK. So the strategy above doesn't seem to work in the real world if you want to get all the data from one submission. Keeping the paradigm above and taking the numerical limitations into account, I think that I could only get about 300 digits of information from one preliminary score. The perfect leaders have about 100 submissions which would produce about 30,000 digits – barely enough to reconstruct "one digit of information" from each public restaurant. With this accuracy, I couldn't get their perfect accuracy – which requires to learn at least 6-7 digits from each of the 30,000 public restaurants. Where did they get these 200,000 digits of information?

A fellow Pilsner who knows much more about codes is telling me why this could be a doable exercise for cryptographers. Finding the 30,000 revenues is like finding a key in a "rather easy" cryptographic code returning 64 bits. But at this moment, I don't know how I would complete the task.

One more addition: I tried to send my submissions with 10 numbers like 1e30-1e300 for the first 10 restaurantsas well as the last 10 restaurants – and that didn't change my score at all! So it seems very likely to me that most of the restaurants are completely ignored in the preliminary score, and most of them will probably be ignored in the final score, too. So these folks only found "several" restaurants that are used in the preliminary score, and then those 100 submissions were enough to reconstruct their exact revenues from the preliminary scores.

Someone wrote that by looking at repetitions of dates in test.csv, there are just 321 "real" restaurants there, 96 of which are used to compute the preliminary score. If that's so, I think it would be easy to find out which 96 restaurants are used (just 96 x 5 = 500 or so digits of information, possible to get from 50 or 100 scores), and then find their scores by the methods above. That would also less-than-double the training sample. An improvement but not necessarily a dealbreaker.


  1. I guess the only point of this competition for the organizers was to look for some creative guys to hire (what else can you do with this 'dataset'?). So BAYZ got the point :).

  2. "Someone wrote that by looking at repetitions of dates in test.csv, there are just 321 "real" restaurants there"
    Do you have the source for this analysis?

  3. It was taken from a comment on the forums, search for 321 here (and the following pages):