Sunday, September 21, 2014

Kaggle: quantifying the African soil

If the most important task for the mankind and the computerkind (the fourth best friend of man's, after puppies, books, and women) was to recognize tau-tau semileptonic decays of the Higgs boson, the second most important task was to
predict the properties of the African soil.
Currently it's the only open Kaggle contest whose data aren't huge – in gigabytes – and that also offers the winners a few bucks.

You download a 13 MB training file and an 8 MB test file – an order of magnitude smaller files than what one needed in the Higgs contest.

The training file contains 1157 soil samples; the test file has 727 soil samples. As always, the training samples are equipped both with the quantities you are always told; and the quantities you should predict. The quantities you should predict are missing in the test file, of course. You should calculate them and send them as a submission with 727 x 5 numbers (and IDs) in it; a sample submission may be downloaded, too.

Each soil sample comes (with an unreadable ID and) with quite some information about the infrared spectrum, namely 3,578 values of the absorbance for various intervals of the wavelengths somewhere in the infrared spectrum. Note that the absorbance is the coefficient in the exponent of the exponential that measures how the incoming light is weakened by absorption. Each soil sample also has something like 15 extra (albedo etc.) non-spectral numbers plus one binary-valued "string" (tropical or subtropical "depth").

What you should predict are five quantities, how much carbon, what's pH, how much calcium, phosphorus, and sand. These numbers are listed in the training file.

It's fair to say that all the numbers – absorbances and the predicted quantities – take values in intervals whose width is comparable to 2-5. You should pretty much minimize the "root mean square" distance of all these guesses from the right ones. (This rating is much more algebraically instinctive than the sometimes counterintuitive AMS score in the Higgs contest – although you could have gotten used to the AMS, too.) A benchmark you may find in the forum – a very simple 1-page Python script – produces 0.43621.

After 14 submissions that take about 5% of the Higgs time to be prepared and sent (and one-half of my submissions were clear mistakes), my best score is 0.40926 which is at the 42nd out of 653th place, top 10%, of the leaderboard. I feel it should be easy to get much better (smaller) scores. The current leader has 0.38859.

Only 3 submissions per day are allowed and the contest closes in 30 days.

Other solvable problems in Africa

Some problems in Africa may have rather easy solutions. For example, today, Bill Gates announced a big progress in the development of his super-thin condom. The female organizers of the climate march in the New York City today will surely be happy to learn that the condoms on the market are too small for the Ugandan men (although they have been tested by Chinese workers!) which makes it harder to fight AIDS over there.

One shouldn't be surprised by these discrepancies if he realizes that much of this charity work was governed from the money and with the vision of Micro-Soft. After all, 640 kB (or centi-inch) should be enough for everyone.

No comments:

Post a Comment