Tuesday, December 22, 2020

The master of overfitting

Merry Christmas 2020!

Back in 2014, I competed in the
Higgs Boson Machine Learning Challenge (leaderboards)
At the end of the contest, when everyone could see that I was "apparently a frontrunner", I teamed up with Christian Veelken (giving him some 20% of "my team") who was actually an active coder, unlike me, and (while he probably didn't change things too much) we ended up as #1 among #1784 teams in the public leaderboard. That nice ranking dropped to #8 in the private leaderboard which mattered so we ended up without any financial compensation (winners got tens of thousands of dollars).

"With four parameters you can fit an elephant to a curve, with five you can make him wiggle his trunk" - John von Neumann

Just to be clear, the #8 was still many categories ahead of all the people who are supposed to do such things professionally. Only machine learning experts i.e. "purely computer people" (plus me, perhaps) succeeded in the contest. The best CMS professional experimental physicist who also enjoyed this challenge (co-organized by ATLAS, the competitor of CMS) was Tommaso Dorigo at #640 (public leaderboard) and #743 (final, decisive, private leaderboard). With all my humility, CMS is a different league than your humble correspondent, isn't it? ;-)

Recall that in the contest, you downloaded a huge amount of raw data describing the products of very many LHC collisions (energies and directions of particles that were created). Your task was to divide the collisions to "background" and "decay of the Higgs boson to tau pairs". The organizers knew the "right" classification because the "signal" was included artificially in the dataset. Almost everyone had to use some form of "machine learning" or "neural network". Many participants used just a small modification of some "example code" that was given to everyone; I semi-belonged to this subset (because my modifications involved a dozen of minirevolutions executed in Python and in Wolfram Mathematica, my "programming language of choice").

Now, in late 2020, two dudes (David Rousseau of Paris-Saclay and Andrej Usťužanin of Moscow) who have been close to the organization of that contest (and others) published a chapter contributed to a book about machine learning in particle physics:
Machine Learning scientific competitions and datasets
Some TRF readers are interested in my name. If you search for Motl in that paper, you will see:
...Particularly interesting curves are the ones from Lubos Motl’s Team who was number 1 on the public leaderboard but fell to number 8 on the final leaderboard. A sharp peak on the public test curve (with no counterpart on the private test curve) is due to public leaderboard overfitting as the team has claimed to “play” the public leaderboard, adjusting parameters in a semi-automatic fashion to improve their public score.
Here is the graph:

The x-axis is the percentage of rejected events – linked to the expectation about the percentage of the signal events – while the y-axis is the score (for any value of the percentage, it was calculated from the ranking of the events from the most signal-like to the most background-like that you had to submit as your "solution"). The red curve was my public leaderboard score – what the ranking looked like – and you may see the impressive peak in that curve. That red peak wasn't reproduced in the blue, private leaderboard. The "apparent" (public), red ranking was calculated from one part of the events; the "real" (private), blue ranking was calculated from the remaining events.

My submission was the only one that showed this "red peak beating the blue one" (I was the only one who really fought for the Republican Party LOL). And the difference between the curves pretty much proves that I was doing something that others were not doing. And indeed, as we knew, it was "overfitting". I was picking the submissions that were producing "apparently good scores" in the public leaderboard and "breeding" better submissions out of them. This sounds like the perfect natural selection in action, doesn't it?

Well, not quite, up to one "detail". I relied on the public leaderboard scores as if they were my "experiment" in order to select features of the submissions and to determine the direction of my efforts in general. The funny thing is that all the truly successful, machine learning guys didn't rely on the public leaderboard scores at all! How is it possible? Well, they calculated their own scores and that's how they were producing their own submissions (the ranking of the LHC events). A good metaphor might be that all of us (in top 20, to say the least) have relied on some "empirical data" but I had to rely on observations of distant objects through the telescopes; while Gábor Mellis and the other folks who beat me have had their own lab where they did the experiments locally! ;-)

The Kaggle.com server was comparing the classification (signal/background) guessed by the contestant against the "real classification", either in the public subset or the private subset. But all the skillful machine learning and neural network guys were emulating this process locally. They basically developed algorithms that were producing an accurate enough classification – and the whole process of checking which algorithms are promising was done locally! They mostly evaluated their algorithms by computing their own local score for one subset of the Kaggle event while the model was extracted from another subset.

Of course I have always known that this was the optimal, professional method to do these things. But not being an active coder, I found it difficult or too time-consuming to write down any code that would be locally computing scores after it separated the data into parts, aside from similar things. So all my submissions were modifications of the simple machine learning code that we could use; with additional pre-processing and post-processing of all the data; with the ranking being manipulated and averaged in various ways; and with some additional, rather clever, intellectually exciting tricks that I can't produce right now because it's been over 6 years. Many of the improvements were "natural mathematics and physics" (like better variables) in nature; I believe that they could be used as additions to the winners' strategies to improve their results further.

But the point was that for me as a non-coder, certain machineries to "self-test" the algorithms were too difficult which is why I avoided them; and the tricks that I did include had to rely on the "public leaderboard scores" as a guide because I "outsourced" the democratic process.

The reasons why I did this "outsourcing" are easily understandable – I sucked as a coder. I had never written a program to realize some neural networks or machine learning from scratch which is why it was unlikely that I would do so along with all the complicated processing of the (possible) Higgs decay events' parameters. I was simply restricted to programs that others created and they didn't give me the framework for the local separation of the data and similar things.

What's ironic is that I ended up doing things that were exactly of the character that I have criticized in a majority of the TRF blog posts in the recent 16 years! ;-) So first of all, I did care about the "experimental data" and some people think it's great. But the data were really "what it looks like", the public leaderboard scores, and caring about such scores is similar to "caring what other people think"! It's about the appearances, not the reality. The reality about "how accurate an algorithm is" is something that a good coder+theorist may evaluate locally, without any outsourcing. In this sense, the successful participants didn't care about the "experimental" data (the public leaderboard scores) at all.

You may even say that my reliance on the public leaderboard scores was analogous to the Dominion Software voting machines that were used in the U.S. presidential elections. The Democrat Party was allowed to "outsource" the evaluation of votes to communist criminal organizations in Venezuela and what a surprise that the winners are someone whom Hugo Chavéz would have preferred! A vibrant democratic country just shouldn't outsource the evaluation of votes; and a successful Kaggle Higgs contestant shouldn't outsource the evaluation of the quality of algorithms to a murky Kaggle.com calculation. (I could only learn several such scores a day which was another source of the superiority of the people who did rate their algorithms locally.)

Again, my usage of the public leaderboard score as the "experimental data" could be praised by naive enough people as a nice example of "empirical science". But it was clearly (for me and the machine learning competitors) "inferior science" because the scores should be calculated locally. As a result, I unavoidably ended overemphasizing particular features (and noise) present in the public leaderboard subset of the collisions. I was describing combined traits of their properties and my algorithms, not quite properties of my algorithms. So I ended up optimizing the public leaderboard score which isn't quite the same thing as optimizing the algorithm.

What you wanted to optimize was the private leaderboard score – which wasn't quite identical to the "precise quality rating of your algorithm", either, because the private leaderboard subset was just another random subset of the collisions. But the skillful guys rated their algorithms according to their treatment of "arbitrary subsets" which is why they avoided my overfitting – my exaggerated dependence on the features of (and noise in) the public leaderboard subset of collisions. Gábor Melis, the winner, believes that he got to the top due to the "careful use of nested (!) cross-validation". Another consequence of my framework was that the algorithm behind my apparently best submissions was a contrived hybrid of a sort. Up to some point, the averaging of several submissions may lower the amount of overfitting. However, once there are too many adjustable weights and other parameters (and my programs did have them), it's clear that they're ultimately adjusted to overfit many noisy features of the public dataset (which aren't present in the remaining collisions). Even in physics, a great theory just shouldn't have too many parameters, especially not parameters that were clearly not being accurately enough measured (separately).

OK, my strategy in the contest ended up being "do care what other people (and Kaggle-like servers) think" and "care about the appearances". Because the appearances (the public leaderboard scores) are rather highly correlated with the "truth", I ended up dropping from the 1st place "only" to the 8th place (which was still the biggest drop among the top 13 contestants).

But the main message of this blog post and thousands of other blog posts on this website is more dramatic. In many situations of the real life and business, the correlation between "appearances" and "reality" is far weaker. Too many people and companies end up focusing on "what the things look like" and they are "overfitting" in the sense of doing things that "look good" or even those that are "politically correct". The difference is that the selection of winners in the Kaggle contest depended on a clever meritocratic procedure involving the private subset of events and that's why my excessive attention paid to the "apparent scores" wasn't a safe path to the victory. And indeed, it was a sign of the quality in the contest that my approach didn't win (although many clever ideas underlying those efforts have remained unused and unappreciated).

Sadly, in the real world, people and companies are increasingly evaluated according to "what it looks like" and not according to the actual meritocracy, according to "how things actually are". And that's why we're already drowning in lies, bubbles, Ponzi schemes, hypocrisy, left-wing pseudoscientific superstitions, and excrements in general. My "apparent" score was 3.85 while the real one was just 3.76. In the real world, the differences between the truth and the appearances are deeper. For example, both Tesla and Bitcoin seem to be worth "half a trillion dollars or so" each while the real value is self-evidently zero (or very close to it). And the real world doesn't seem to have any adult in the room left, it doesn't have any "private leaderboard" that would produce the right results at the end, namely "dear Elon, Greta, and Satoši, appearances notwithstanding, you are worth just a piece of šit".

To be more precise, there is at least one adult left in the room, TRF, so let me say something that you may expect by now: Dear Elon, Greta, and Satoši, appearances notwithstanding, you are worth just a piece of šit.

And that's the memo.

No comments:

Post a Comment