## Thursday, January 14, 2016 ... //

### A LIGO Kaggle contest would be fun

Vaguely related: my contribution to the rumor mongering was mentioned by SciAm and Nature (the same article) but also by PerfScience, The Indian Republic, and especially SkyAndTelescope which seems like the best English article on the story to me. However, all of these texts are beaten by this Czech story whose birth I struggled for, a little bit. Also, that Czech article has gotten over a hundred of comments in few hours and it will continue – somewhat amazing that the Internet's key language, English, doesn't accumulate the same level of readers' interest about LIGO as a Czech article.
One of the topics we touched in some recent mails with Bill Z. was the question how LIGO – or LIGO/VIRGO – will determine the statistical significance of their discoveries. Bill has had some interesting thoughts. They surely have some algorithms, too. (I have actually looked at some pages with them but I don't quite know how they're doing it.) They must have been thinking about it for quite some time.

A chirp

It would be nice if LIGO/VIRGO has released their data publicly – maybe the rawest possible data you can get. People could play with them – maybe, they would be able to discover gravitational waves from many more events than those that are visible to the LIGO/VIRGO teams. (Next month, we are pretty likely to be told about at most two events but the data taking was only completed on January 12th so maybe some events were added.) Why do I think so?

Because I am heavily intrigued by the standard picture above showing the inspiral gravitational waves from two objects orbiting each other. The number of periods that you can see is basically unlimited. But the intensity only becomes big enough at the end of their life. Even if each period of the wave may contribute a fraction of a sigma to your confidence level, they may combine to a very high confidence level.

You just wait for a sufficient amount of time – and your confidence level may grow. I think that it (the number of sigmas) grows like the square root of the time. Do they have some nice methods? Are they canonical or accepted?

In September 2010, LIGO discovered some gravitational waves. Well, it was a fire drill. All the members had to work hard. Although they probably suspected it had to be a fire drill, they did their work up to the moment of publication when they were told "it was fun but it was a fire drill".

This picture – which you may magnify by clicking – shows one of the representations of the fake discovery. The left picture is from the Washington state; the right picture is from Louisiana. They show the strength of the signal as a function of time (the horizontal axis covers 0.5 seconds of the run) and of frequency (the vertical axis is logarithmic and from 32 to 2,000 Hz or so).

You see that the resolution of the colorful chart is poorer at the bottom of the picture – because lower frequencies imply a greater uncertainty of the time. The two detectors detect the wave with different amplitudes because they measure the plus/cross polarizations with respect to different pairs of axes. The Washington state detector (Google Maps) has the arm going in the Northwest and Southwest directions from their intersection while the Louisiana detector has the arms going in the Southwestwest and Southsoutheast from the intersection, if you understand me. Because the Earth isn't quite flat, the directions of the arms wouldn't quite coincide even if you tried hard (as long as the arms would be horizontal).

Nevertheless, the frequency charts make it very clear that the same "arc" indicating a (fake but realistic) signal with an increasing frequency (according to some largely predictable law) is seen in both detectors. When the "scratches" with the signal are this localized, it may be that the discovery and the significance level obeys some rather standard calculations.

However, what I feel is the fact that weaker inspiral waves were coming to the detectors long before the intensity peaked and the information contained in these weak but persistent waves may be pretty much wasted. It seems plausible to me (correct me if my guess is demonstrably wrong) that "most of the confidence" could actually arrive from the moments well before the peak, from thousands of maxima and minima when the wave wasn't too strong yet.

If you make a simple Fourier transform, you see the fixed-frequency components. But the inspiral waves don't have any fixed frequency. I feel that one should systematically search for the "signals with the increasing frequency" – increasing basically according to the most likely predictions of GR. If I simplify things just a little bit, calculate the inner product of the raw data from the detector with the first graph in this article – and do it for various/all choices of the frequency, its rate of increase (modeled realistically, perhaps only for circular orbits), and the peak time (end of the event).

Isn't it possible that one could see many more events through the gravitational waves?

There could be a fun Kaggle contest

The task could be basically to take the raw data from several months of LIGO (fake or, which would be much more exciting, real data) and make as many discoveries of gravitational waves as possible. The criterion could be simple. You could get the data from both detectors but with the hours randomly permuted, or something like that, and you would try to identify all the choices of
(frequency, parameter labeling the rate of frequency increase, timing of the peak/end)
and your score would count, using some quantitative formula, all the events that you hypothesized and that overlapped between the two detectors sufficiently accurately in real time. The preliminary score could be calculated from some period of the LIGO run, the final score could be calculated from the rest, as you are used to in Kaggle contests.

There could be some room for "chance" but if a Kaggle user were really good, she could defeat the competition by a huge amount and persuade most of us that very many events may be discovered in the data.

I have no idea about the actual formats of the data in which they're sending the LIGO observations, and so on. But to make the contest realistic and the data to be downloaded acceptably small, you could have 1 byte of information for the "change of the length" in each 0.001 second for each detector (one would search for the part of the signals where the frequency is below or well below 1 kHz), you need just 86.4 megabytes per day. That's some 5 gigabytes per two detectors per month. Maybe in this way, they could just give the contestants 20 GB to be downloaded, it's not extreme. And maybe this 20 GB file could be nicely compressed.

The timing could be scrambled in the dataset for both detectors. The time could be divided to a few thousand intervals that are in between 1 hour and 2 hours long. Your submission would contain up to 2x 50 = 100 candidate discoveries labeled by the moments of the peak, frequency, and something like the rate of the frequency increase. At most, you could make 50 discoveries if your 100 candidates were equally divided to 2 groups (Washington, Louisiana) and their "real time stamps" after the unscrambling would match with an accuracy comparable to 1 period in the waves. In reality, your success rate would be lower and you could be punished for inaccurate double hits etc. If there are some vibrations caused by something else than gravitational waves that are nevertheless expected to agree in between the two detectors, it would be nice to subtract them from the score in some controllable way (because one doesn't want to encourage the users to discover things that are not gravitational waves). I don't know what the previous sentence would mean in any detail.

Do you think it's possible? Do you have any ideas to improve the contest?

LIGO has 991 members right now but I feel that the organization looks a bit secretive and it seems rather likely to me – this is no official accusation, just a guess – that their computational methods are suboptimal and the Kaggle sort of analysts could make a meaningful contribution.

If they want to release the data publicly, maybe they should first think about a contest like that – because in that case, it could be a good idea to release the data only in some permuted or otherwise "incomplete" form.