Thursday, September 18, 2014

A top 2% Kaggle Higgs solution

Guest blog by Hongliang Liu (UC Riverside)

This blog is for describing my selected top submission sent to the Kaggle Higgs competition. It has the public score of 3.75+ and the private score of 3.73+ which has ranked at 26th. This solution uses a single classifier with some feature work from basic high-school physics plus a few advanced but calculable physical features.

1. The model

I choose XGBoost which is a parallel implementation of gradient boosting tree. It is a great piece of work: parallel, fast, efficient and tune-able parameters.

The parameter tuning-up is simple. I know GBM can have a good learning result by providing a small shrinkage eta value. According to a simply brute-force grid search for the parameters and the cross-validation score, I choose:

  • eta = 0.01 (small shrinkage)
  • max_depth = 9 (default max_depth=6 which is not deep enough)
  • sub_sample = 0.9 (giving some randomness for preventing overfitting)
  • num_rounds = 3000 (because of slow shrinkage, many trees are needed)
and leave other parameters by default. This parameter works good, however, since I only have limited knowledge of GBM, this model is not very optimal for:

  • shrinkage too slow and training time too long, so a training takes about 30 minutes and cross-validation takes longer time. On the forum, some faster parameters are published which can limit the training time to less than 5 min.
  • sub_sample parameter gives some randomness, especially when complicated non-linear features are involved, and the submission AMS score is not stable and not 100% reproducible. In the feature reduction section, I have a short discussion.
The cross validation is done as usually. A reminder about the AMS is that, from the AMS equation, one can simplify it to\[

{\rm AMS} = \frac{S}{\sqrt{B}}

\] which means, if we evenly partition the training set for K-fold cross validation where the \(S\) and \(B\) have applied the factor of \((K-1)/K\), the AMS score from CV is artificially lowered by approximately \(\sqrt{(K-1)/K}\). The same with estimating the test AMS: the weight should be scaled by the number of test samples.

XGboost also supports customized loss function. People have some discussions in this thread and I have some tries. I haven’t apply this customized function for my submission.

2. The features

The key for this solution is the feature engineering. With the basic physics features, I can reach the public leaderboard score around 3.71, and the other advanced physics features push it to public 3.75 and private 3.73.

2.1 General idea for feature engineering

Since the signal (positive) samples are Higgs to tau-tau events where two taus coming from the same Higgs boson, while the background (negative) samples are tau-tau-look-like events where particles have no correlation with the Higgs boson, I have the general idea that, in the signal, these particles should have their kinematic correlations with Higgs, while the background particles don’t. This kinematic correlation can be represented as:
  • Simply as the open angles (see part 2.2)
  • Complicated as CAKE features (see this link in the Higgs forum, this feature is not developed by me nor being used in the submission/solution)
2.2 Basic physics features

Because of the general idea of finding the “correlation” in the kinematic system, the correlation between each pair of particles can be useful. The possible features are:
  • The open angles in the transverse (phi) plane and the longitudinal plan (eta angle). The reminder is that, the phi angle difference must be mapped into ±pi, and the absolute values of these angles work better for xgboost.
  • Some careful study on the open angle can find that, tau-MET open angles in phi direction is somehow useless. It is understandable that in tau-l-nu case the identified tau angle has no much useful correlation with nu angle.
  • Cartesian of each momentum value \((p_x, p_y, p_z)\): it works with xgboost.
  • The momentum correlation in the longitudinal direction (Z direction), for example, jet momentum in Z direction vs tau-lepton momentum in Z direction is important. This momentum in Z direction can be calculated using the \(p_T\) (transverse momentum) and the eta angle.
  • The longitudinal eta open angle for the leading jet and the subleading jet: the physics reason is from the jet production mechanism from tau, but it is easy to be noticed when plotting PRI_jet_leading_eta – PRI_jet_subleading_eta without physics knowledge.
  • The transverse momentum ratio of tau-lep, jet-jet, MET to the total transverse momentum.
Overlapping the feature distribution for the signal and the background is a common technique for visualizing features in the experimental high energy physics, for example, the jet-lepton eta open angle distribution for the positive and negative samples can be visualized as following:

Jet-lepton open angle distributions

2.3 Advanced physics features

If one reads the CMS/ATLAS Higgs to tau-tau analysis paper, one can have some advanced features for eliminating particular background. For example, this CMS talk has covered the fundamental points of Higgs to tau-tau search.

In Higgs to tau tau search, one of the most important background is the Z particle where Z can decay into lepton pairs (l-l, tau-tau) and mimic the Higgs signal. Moreover, considering the known Higgs mass is 126 GeV which is close to Z mass 91 GeV, the tau-tau kinematics can be very similar in Z and Higgs.

The common technique for eliminating this background is reconstructing Z invariant mass peak which is around 91 GeV. The provided ‘PRI’ features only have tau momentum and lepton momentum, from which we can’t have precise reconstruction of Z particle invariant, however, this invariant mass idea gives some hint that, the pseudo transverse invariant mass can be a good feature where the transverse invariant mass distribution can be:

Tau-lepton pseudo invariant mass

QCD and W+jet background are also important where lepton/tau can be mis-identified as jet, however, these two have much lower cross-section so the feature work on these two background are not very important. Tau-jet pseudo invariant mass features are useful too.

Some other not important features are:
  • For tau-tau channel, the di-tau momentum summation can be roughly estimated by tau-lep-jet combination although it is far from truth.
  • For 2-jets VBF channel, the jet-jet invariant mass should be high for signal.
Some comments about SVFIT and CAKE-A:
  • I know some team (e.g. Lubos’ team) is using SVFIT, which is the CMS’ counter-partner of ALTAS’ MMC method (the DER_invariant_MMC feature). SVFIT has more variables and more considerable features than MMC, so SVFIT can do better Higgs invariant mass construction. Deriving SVFIT feature using the current provided features is very hard, so I haven’t done it.
  • CAKE-A feature is basically the likelihood if a mass peak belongs to Z mass or Higgs mass. CAKE team claims that, it is a very complicated feature and some positive reports exist in the leaderboard, so ATLAS should investigate this feature for their Higgs search model.
2.4 Feature selection

Gradient boosting tree method is generally sensitive to confusing samples. In this particular problem of Higgs to tau-tau, many background events can have very similar features to the signal ones, thus, a good understanding of features and feature selection can help reducing confusions for better model building.

The feature selection uses some logic and/or domain knowledge, and I haven’t applied any automatic feature selection techniques, e.g. PCA. The reason is that, most auto feature selection methods are designed for capturing the max errors while this competition is for maximizing the AMS score, so I am afraid some feature selection can accidentally drop some important features.

Because of the proton-proton collision in the longitudinal direction, the transverse direction is symmetric, thus the raw phi values of these particles are mostly useless and should be removed for less confusions for the boosting process. It is also the reason why ATLAS and CMS detector are cylinders.

The tau and lep raw eta values don’t help much too. The reason behind it is the Higgs to tau tau production mechanism, but one can easily see it from the symmetric distributions of these variables.

Jets raw eta values are important. The physics reason behind it is from the jet production mechanism where the jets from the signal should be more centralized, but one can easily find it by overlapping the distributions of PRI_jet_(sub)leading_eta for the signal and the background without physics knowledge.

2.5 Discussions

More advanced features or not? It is a question. I only keep physical meaningful features for the final submission. I have experimented some other tricky features, e.g. the weighted transverse invariant mass with PT ratio, and some of them help scoring on the public LB. However, it doesn’t show significant improvement in my CV score. To be safe, I spend the last 3 days before the deadline removing these ‘tricky’ features, and keeping only the basic physics feature (linear combinations) as well as these pseudo invariant mass features, in order not to overfit the public LB. After checking the private LB scores, I find some of them can help, but only a little. @Raising Spirit on the forum has posted a feature which is DER_mass_MMC*DER_pt_ratio_lep_tau/DER_sum_pt and Lubos has a nice comment if it is good idea or not.

CAKE feature effect. I have used 3 test submissions for testing CAKE-A and CAKE-B feature. With CAKE A+B, both of my public and private LB submission score drops around 0.01; with CAKE-A, my model score has almost no change (reduce 0.0001); with CAKE-B, my model score improves 0.01. I think it is because CAKE-A feature may have strong correlations with my current feature, while CAKE-B is essentially the MT2 variable in physics which can help for the hadronic (jet) final state. I haven’t include these submissions in my final scoring ones, but thanks to CAKE team for providing these features.

3. Conclusion and lessons learned

What I haven’t used:
  • loss function using AMS score: In these two posts (post 1 and post 2), they proposed the AMS loss function. XGboost has a good interface for these customized loss function, but I just didn’t have chance to tune up the parameters.
  • Tricky non-linear non-physical features.
  • Vector PT summations of tau-lep, lep-tau and other particle pairs, and their open angles with other particles. They are physically meaningful, but my model doesn’t pick them up :-(
  • Categorizing the jet multiplicity (PRI_jet_num). Usually this categorizing technique works better since it increase the feature dimension for better separation, but not for this time, maybe because of my model parameters.
  • Split models by PRI_jet_num. In the common Higgs analysis, the analysis is divided into different num-of-jets categories, e.g. 0-jet, 1-jet, 2-jets, because each partition can have different physical meanings in the Higgs production. XGboost has caught this partition nicely with features, and it handles the missing values in a nice way.
Lesson learned
  • Automate the process: script for filing CV, script for the full workflow of training + testing, script for parameter scan, library of adding features so CV and training can have consistent features. It can save very much time.
  • Discuss with the team members and check the forum.
  • Renting a cloud is easier than buying a machine.
  • I show learn ensembling classifiers for better score in future.
  • Spend some time: studying the features and the model needs some time. I paused my submission for 2 months for preparing for my O1-A visa application (I got no luck in this year’s H1B lottery, so I had to write ‘a lot of a lot’ for this visa instead) and only fully resumed it about 2 weeks before deadline when my visa was approved, so my VM instance has run like crazy for feature work, model tuning and CV since then while I sleep or during daily work. Fortunately, these work (plus electricity bills to Google) has good payback on the rank.

4. Acknowledgments

I want to thank @Tianqi Chen (the author of xgboost), @yr, @Bing Xu, @Luboš Motl for their nice work and the discussions with them. I also want to thank to Google Cloud Engine for providing 500$ free credit for using their cloud computer, so I don’t have to buy my own computer but just rent a 16-core VM.

5. Suggestions to ATLAS, ROOT and Kaggle

To ATLAS and ROOT: XGboost is a great idea for parallelizing the GBM learning. ROOT’s current TMVA is using single thread which is slow, and ROOT should have some similar idea of xgboost into the next version of TMVA.

To Kaggle: it might be a good idea of having some sponsored computing credits from some cloud computing providers and giving them to the competitors, e.g. Google Cloud and Amazon AWS. It can remove the obstacles of computing resources for competitors, and also a good advertisement for these cloud computing providers.

6. Our background

My teammate @dlp78 and I are both data scientists now. We used to work on the CMS experiment. He worked on \({\rm Higgs}\to ZZ\) and \({\rm Higgs}\to b\bar b\) discovery channel, and I worked on \({\rm Higgs}\to \gamma\gamma\) discovery channel and some supersymmetry/exotics particle search. My PhD dissertation is search for long-lived particle decay into photons, in which the method is inspired by the tau discovery method, and my advisor is one of the scientists who discovered the tau particle (well, she also discovered many others, for example: Psi, D0, Tau, jets, Higgs). I have my Linkedin page linked on my Kaggle profile page, and I would like to link to you great Kaggle competitors.


  1. Dear phunter, thanks for this very interesting text about some intellectual adventures within a very similar software and math framework in which I was immersed over many hours since late May, too. ;-) My best final score of an individual run was also slightly above 3.73.

    There were many semi-conceptual questions about the optimization that I couldn't answer with the preliminary scores, and without a workable cross-validation local quality control. When one sees the final scores and combines it with the previous experimenting, it seems clear to me (aside from older insights) that:

    1. new features, although I have also spent a lot with some clever geometry and repararametrizations, don't systematically help. The number of features listed in the contest was already somewhat high. If some features help, it's those that better and more "linearly" describe corners of the parameter space that look singular in all the existing features (but even some additional eta or phi centrality I added seemed to have been inconsequential, and even the impact of the very MMC, the only non-elementary function of the PRI variables, was very limited). But whether such features are "canonically interpolated" to some other corners, or whether one uses a more "physically motivated" nonlinear transformation of these features seems irrelevant for the score.

    2. individual runs produce AMS that fluctuates by the full 0.08 (prelim.) or 0.04 (final); averages tend to produce smoother submissions whose AMS score tends to be much less fluctuating - and whose final score tends to be much closer to the preliminary one

    3. eta, the rate of learning, should better be much smaller than 0.1 that I used most of the time (default xgboost demo setting), perhaps closer to 0.01 - you clearly did learn that better than I did. This must be compensated by a much larger number of steps. These low-eta submissions seem to get better scores.

    4. combining (averaging) many submissions, as I was sure throughout the contest, indeed helps a big deal, to raise and stabilize the score

    5. the large number of diverse components in these ensemble submissions is more important for the final score than the preliminary score; I didn't know how many is "enough" and it's plausible that averages of many more than my typical 9-16 are still helpful. My best submission 3.773 was a "megacombo" averaging 44 single xgboost runs

    If I had known these things a month ago, it must have been trivial to safely beat Gábor Melis and pals, I think. But the lacking experience, rudimentary programming skills, lack of CPU time and RAM, and other things just made it impossible to learn all these lessons (among many lessons that I did learn during the contest and aren't listed here), so the good luck (some upward, not downward, fluctuation of the final score) was a necessary condition to get a medal and the good luck wasn't there.

  2. Hey Lumo,

    Hola people. I was a little happier than usual today. Turns out even a bumpkin like me can do things in algebraic topology. Epic post ! I just wanted to share some photos. Mostly hinting at what I am working on (Complete refined stuff that I would enjoy watching people share :) ) Also a photo of first bitcoin machine, me with a lot of money in different currencies. Me in Stockholm, a clue about a nice physics paper to come out soon, and lots of propaganda. Some more photos of me screwing around etc. I am not one to say or share much, but me entire existence has been scrutinized, so again something fun happened today and no it was not the usual straight xes I have all the time. It was a magical moment in math? Phys? . Anyways here's the link

    God bless my soul, I can't seem to figure out that magical sophisticated thing you guys do to make the link redirect. I have never known, and am stupid.


  3. Also,

    since everyone was so curious, I have put below the original inspiration for breaking walls , and my time in Nice, Hamburg and Stockholm.

    PS~ Asking never hurts :)

  4. Hey Luboš...macroeconomics. "Teach it phenomenology" Dark Star (1973). However, as with all macroeconomic implementations, "Unfortunately, Doolittle has mistakenly taught the bomb Cartesian doubt. " That's for later.