Tuesday, October 25, 2011 ... Français/Deutsch/Español/Česky/Japanese/Related posts from blogosphere

Scientific explanation as a compression of information

Experimental physicists are collecting new information all the time. However, for physics not to degenerate into botany, physicists must do something beyond the mindless memorization of a constantly increasing body of empirical information.

They must also work hard to make sense out of it. They must explain the patterns they have seen. When they do so correctly, their explanations may not only account for the observations and patterns that have been seen in the past but also for those that will be seen in the future: successful explanations of the data from the past are also able to predict things about the future. This often allows to test theories; whenever a test succeed, our belief in a theory goes up. When a sharp test fails, the theory is excluded. A physicist whose memory is finite, if not small, may nevertheless become capable of explaining gigantic amounts of data.

Explanations in physics (and maybe all of science), from the straightforward phenomenological interpolation of points in a graph we have just measured to the most abstract examples of unification of physical theories, may be viewed as a special example of compression of information which is the main goal of explanations in science.




Let me mention that Moshe Rozali's comments from a thread about naturalness of inflation were the driver that inspired me to write this text. At Cosmic Variance, he was explaining to other readers that cosmic inflation makes the evolution of the early Universe much more natural but strictly speaking, it is not necessary if you don't care about naturalness:

Igor, I am not sure I understand. We have an initial value problem, so today’s observations are determined once you specify an initial state at some time in the distant past. If you specify the time to be the beginning of the big bang evolution, with the correct but very contrived initial state (nearly homogeneous with just the right kind of fluctuations) then you get no conflict with observation. By contruction, same applies to inflation, because it reproduces that initial state and all subsequent evolution. The only point of inflation is to make that initial state the outcome of prior evolution. By construction all current observations will then be identical, but the initial state will be more natural and less contrived. As I understand Sean’s statement, quantifying this intuitive notion of naturalness is tricky, and it is not always clear inflation indeed comes ahead. I hope I am not mangling things…

And, for the record, in my mind the notion of “naturalness” is one instance of “algorithmic compression”, which is the whole point of seeking a scientific explanation. Without invoking such criteria, by definition (for example) the particle data group review book would be always the best “theory” of particle physics, and you’d never need to learn about gauge theories and spontaneous symmetry breaking and all that stuff.

[...]

Sean, my comment about algorithmic compression is that since the initial state for the hot big bang has small entropy, it suggests that it can be “compressed”. That is, a language can be found in which it is a generic state; whether or not inflation does that is a different story. This seems to me what is meant by “naturalness” in this context. I don’t see how this is any different from trying to express regularities such as the Hubble law in terms of prior evolution such as the hot big bang.
These are deep and important thoughts. Just to be sure, Sean Carroll didn't appreciate the comments about compression because the only initial state that would make Carroll happy would be a very-high-entropy initial state. Needless to say, he will never get one because the second law of thermodynamics guarantees that the initial state had a lower entropy than any state that followed and nothing will change about this law in the future. ;-)

But let's return to the notion of compression of information. One bit is a small "package" of information – the knowledge whether a binary digit is equal to 0 or 1. Because of the computer bias, people like to think that one bit is "fundamental" or even that every information has to be a multiple of one bit but this is not true at all. We could equally well use e.g. trinary computers with 3-state elementary digits, trits, that would carry the information of \(\ln(3)/\ln(2) = 1.585 \) bits: that's not an integer. (Trinary computers have actually been built in Russia.) In physics and "natural mathematics", the natural base is neither 2 nor 3 nor 10: it is \(e=2.71828\). The exponentiation with this base has special properties; for example, the derivative of \(e^x\) is \(e^x\). So the most natural unit of information is one \(e\)-base digit, or one "nat" (derived from the word "natural"), and one bit is equivalent to \(\ln(2)=0.693\) nats.

Let's look at the compression in the computer context a little bit: the methods we will use will be much more special, limited, and narrow-minded than the ideas needed to compress information in physics but they may be viewed as a toy model for physics. Consider a 1-megabyte file. It contains about 8 million bits (yes, I know the kilo- and mega- prefixes are reinterpreted as powers of 1,024 in the computer applications but I will conveniently use powers of ten and the difference plays no role in our discussion). Is the information carried by the file really equal to 8 million bits? Well, it depends. If the file contains random digits 0 or 1 that were produced by a random generator but for some reason, you need to remember all of the bits, the file really has 8 million bits.

However, if the file contains 1 million characters with the ASCII-code zero, you know that the information is much lower. It's close to zero. You may replace the file by a shorter file saying "#For a human user: this file should have 1 million characters with code zero." The computer user will try to use the files, it won't work, so he will notice the message in this file and correct it. Of course, this process may be automatized: the files may contain sequences of characters that are understood by compression algorithms and not just humans. They will be equally able to replace those 60 bytes by 1 million bytes.

Let's try something slightly more complicated. Imagine that the file is composed out of 1 million characters and each of them is either "0" (code 48) or "1" (code 49): both possible characters are represented (approximately) equally frequently. What's the information of the file? It's no longer 8 million bits; it's really just 1 million bits. At the beginning, we must add a message – comprehensible to either a human user or a comparably intelligent algorithm – that says "#For the dear user: the file that follows is composed out of bits. Replace bit 0 by the character 0 and bit 1 by the character 1. The broadcasting will begin after the colon and a space: %#$%..."

Except for a few dozens of bytes that present the coding table and conventions, we have reduced 1 million bytes to 125,000 bytes: a great result.

That was still too easy. Imagine that those 1 million characters contain roughly 50% of characters "0", 25% of characters "1", and 25% of characters "2", and nothing else. What's the information now? How much can we compress it? We have 3 possible values of a byte which is smaller than or equal to 4, a power of two. So each byte may be replaced by 2 bits, reducing 1 million bytes to 250,000 bytes (to one quarter). I hope that you know how to write the message at the beginning of the file that will be comprehensible to the human user; we will leave the precise formatting of the corresponding header in a compressed file – that should be comprehensible to a computer – to ZIP and RAR specialists.

Can we do better? You bet. In the code above, we had the pairs of bits that had 4 possible values but we have only used 3 possible values of the bit pairs. Can we fix it? In this artificially engineered example, there is a nice solution. In the sequence of bits, the ASCII characters "0", "1", "2" may simply be represented by sequences of bits "0", "10", and "11". That's great because we may use just one bit for the character "0" instead of two bits! How much do we compress the file if we use this convention?

Well, there were about 500,000 characters "0", 250,000 characters "1", and 250,000 characters "2". The first character was translated to individual bits while the others to bit pairs. So in total, we need 500,000 + 4× 250,000 = 1,500,000 bits: that's just 75 percent of the length of the file obtained by the "uniform 2-bit compression" of the three characters. Can we still improve this result?

Let's ask Shannon, our fellow reader, and she may calculate the Shannon entropy. The characters "0", "1", "2" have probabilities \(p_{0,1,2}=(1/2,1/4,1/4)\). Each of these three characters carries the information
\[ -p_i \ln p_i = (\ln(2)/2, \ln(4)/4, \ln(4)/4). \] All these three numbers are actually equal to \(\ln(2)/2\) nats because \(2\times 2 = 2^2\): that's equal to one-half of a bit. To compute the total information carried by a single character, we must sum this expression over \(i=0,1,2\) and because we're summing three terms equal to \(1/2\) of a bit, we get \(1.5\) bits per character. So one million characters gives us 1.5 million bits after the optimal compression. We really can't do better!

Note that it's completely normal for one character to carry a fractional number of bits such as 1.5 bits: in more general situations, the amount of information per character or per degree of freedom can be an arbitrary (continuous) real number, an obvious fact that the discrete "physicists" must surely have a psychological problem with. By the way, you shouldn't be mystified by the Shannon formula
\[ {\rm Information} = \sum_{i=1}^n p_i \ln(1/p_i). \] The reason why it's true is actually very intuitive. Imagine you're measuring how much information a new character brings. For a while, you pretend that all characters have the same probability \(p_i\) but you don't know which character it is, anyway. If you know this probability, you know that there would have to be \(1/p_i\) different equally likely "digits" or "characters" if you wanted the probability of each to be \(p_i\). To distinguish between them, you need \(\ln(1/p_i)\) nats or \(\log_2(1/p_i)\) bits. That's why \(\ln(1/p_i)\) appears in the formula. The rest, namely the summing over \(i\) with the extra \(p_i\) coefficients, is just the weighted average of the information over all possible values of the "character"; the probabilities are the weights. You can see that the average character contributes the information given by the formula above.

In this particular "0,1,2" case, I could have designed a very simple algorithm to compress 1 million 8-bit characters to 1.5 million bits: each character was simply transformed into one or two bits, isolated from all others. If the relative frequencies of different ASCII-characters were different, I could still compress the file to the value dictated by the Shannon entropy but one couldn't say that a particular bit of the compressed file encodes just one character of the original file: they would mix and spread all over the file. For the most general relative abundances, one may get the compressed file by creating a "highly precise" real number which may be subsequently written in the binary basis. The Shannon entropy is not just some abstract limit: you may genuinely write down algorithms that compress the file into this small size.

So far, I have only talked about files in which the value of every new character is uncorrelated with the position or with the preceding characters. In general, that's not the case. Of course, if you know that some characters are correlated, you may use this fact for a better compression. If the word "superstring" occurs 50,000 times in the one-megabyte file, occupying 550,000 bytes out of 1 million, you may first choose a special character such as "&" for "superstring", reducing the size of your original file by 500,000 characters, and then you may use other tricks. But even less obvious correlations may be used to compress the file.

Now take another 1-megabyte file which contains these characters: "3.14159265..." You surely remember the remaining 999,990 characters. This 1-megabyte file may be replaced by a shorter file, saying: "#For James Bond 007: this file with 1,000,000 characters should just contain the first million characters from pi, starting from 3.14." James Bond would surely be able to fix the file; the corresponding computer algorithm capable of this form of decompression (easier) or compression (much harder: it takes intuition and guesswork and checks to realize that the file is "pi") could also be produced. At any rate, we could squeeze 1 million bytes to 100 bytes or so. We could even describe a formula for pi. Physicists need to learn lots of compression and decompression algorithms based on maths which is often more abstract and profound than the numerical computation of \(\pi\).

All the comments above were about lossless compression. JPG files used for photographs or MPEG files used for videos represent lossy compression algorithms where the original file can't be "exactly" extracted from the compressed one: only some approximation of it, in a suitable sense of the word (a photograph that looks almost identical to a human eye), may be extracted from the compressed JPG file. In some sense, this concept (based on wavelets, morally similar to the Fourier transform) is even closer to physics because approximate compression of the observed data – descriptions of the data in terms of formulae whose numerical values are allowed to deviate by a certain error – is something we frequently need in physics. After all, we don't believe that our measurements are 100% precise and that all of their digits are critical, anyway.

Relatively to the computer scientists who optimize algorithms to compress pictures, videos, and other files, physicists are using a much wider spectrum of methods to "compress" the information obtained from the experiments and observations. A computer compression algorithm just learns a limited collection of methods to compress and decompress (which may, however, include complicated things such as recipes for "digits of pi"); a physicist must constantly learn new methods as physics makes new advances. Most of them require some "human intellect" and classical computers would have a hard time to learn them; that's why the computers haven't yet beaten the best physicists even though the best chess players already belong to the dustbin of the human history. ;-)

Compression in physics

The application of the compression philosophy to physics starts from a trivial compression. Use the same apparatus to repeat the same experiment (and measurement) many times. You get the same results, at least within the error margin you are ready to describe as "statistical errors" that follow some statistical distribution. So of course, an arbitrarily large number of such measurements may be compressed to a single copy. The individual measurements differ from the "right answer" by random numbers, given by the "statistical errors"; however, your apparatus (or methodology) may be incorrectly adjusted so the right answer may also differ from the right answer by a "systematic error" that stays constant if you repeat the measurement using the same device and same method but that may change (or get reduced) if you use different, independent devices or methods. Nevertheless, in science, repeatability offers you the first step in a compression. The Sun rose in the morning; it has done the same thing in the last 6,000 years, to say the least. Unless you want to study long-term cosmology or nuclear physics, you may compress the information about the sunrise by saying that "the Sun rises every morning" and use this rule to predict what the Sun will do tomorrow. This is a primitive and inaccurate methodology but it may be viewed as a prototype of science.

Of course, people eventually found better theories that imply that the Sun was born less than 5 billion years ago and it will die in a comparably long time in the distant future (it will become a red giant). But even theories describing the birth of the Sun suffer from various inaccuracies etc. so the difference from the theory that "the Sun rises every day" is not really qualitative.

Once you understand that repeated experiments lead to the same results – or random results from a collection governed by a universal statistical distribution that may be extracted and quantified – you may start to unify inequivalent experiments in various ways. Your experiment may depend on a quantity \(x\) and the result of the measurement \(y\) may seemingly depend on \(x\). You may try to guess a function \(y=y(x)\) that would interpolate your measured datapoints and that would work for all values of \(x\) that you have tested, and perhaps for many others. You should have some intuition about the mathematical functions (Max Planck was a guy who knew really well how to describe a measured black-body curve by a mathematical function! It was really because he already "saw" a prototype of a derivation that he could offer just a few months later) but you may succeed. This method is clearly more sophisticated than the "Sun must rise every day because it always did" science but it is still less sophisticated than some real science that actually allows you to derive the black-body curve from a well-defined fundamental theory.

As we continue to compress the information, we are representing ever larger classes of objects and processes as elements of a broad and diverse enough "set of objects or events" that may be associated with mathematical objects and the mathematical objects may be manipulated in a clever way to calculate what happened (explanations of known facts) or what will happen in the future (predictions). In quantum mechanics, the predictions are inevitably probabilistic in nature: most of the outcomes may be predicted to occur with a probability strictly between 0 and 100 percent (only if the result is 0 or 100 percent, the prediction is unequivocal).

At the beginning, I mentioned that several inequivalent measurements that differ by the value of the parameter \(x\) may be "unified" and described by a single law. However, physics is capable of a much more dramatic compression or unification than that. Two situations don't have to differ by one real parameter or several real parameters only: they may contain objects that seemingly differ qualitatively. However, physics allows us to see that many of these qualitative differences are just quantitative in character. Also, complicated objects and processes may be divided and reduced to smaller elementary building blocks and smaller processes whose behavior has already been understood and described. In some sense, this reductionist approach is not too much harder than the way how the file was constructed out of the bit sequences for "0,1,2".

One of the points I want to make is that we must be open-minded about the methods how the information may be compressed and about the possible relationships (and seemingly hidden relationships) between all the patterns we observe. As Moshe correctly says, unnatural values of parameters that must take very specific values, otherwise the theory produces completely wrong predictions, may be viewed as some information that is waiting to be compressed. It's because very small numbers such as \(\exp(-10^{26}) \) normally require a long sequence of bytes, essentially \(10^{26}\) bytes. Well, if the number is this tiny, we may also compress it and write it as \(\exp(-10^{26}) \) which only occupies a dozen of bytes or so. However, if this is the value of a parameter that should play the role of an ordinary real parameter in your hypothesis, there is no good reason why its value should be the exponential (or the exponential of an exponential) of a number. You may say that the decompression algorithm normally isn't able to understand sequences such as \(\exp\). For symbols such as \(\exp\) to be natural in the description of the value of a parameter, you must actually find an explanation why the parameter is exponentially small (or large). Inflation offers such explanations for the flatness, size, and low density of exotics in the early Universe.

So physicists should be very creative and open-minded when it comes to the relationships between objects. Indeed, physics has unified many things that were thought to be different. Well, people could also think that things are "essentially equal" as long as they were sloppy. They could think that gravity and electricity and magnetism were the same force because they did no detailed research that would reveal how very different these forces are. Obviously, when we talk about unification in physics, we don't mean this kind of sloppiness that neglects most of the details. We only talk about a unified explanation that takes all the detailed properties into account.

In the discussion between Moshe Rozali and Sean Carroll, you may see that Sean Carroll plays the role of an uncreative, dogmatic person who isn't getting the point about the cleverness and creativity of physics at all. He is obsessed with the (preposterous) assumption that all states in the Universe, including the initial ones, should maximize the entropy. It's like focusing on one simple type of a compression algorithm, one naive way how to think about the measured data. This approach is relevant when you want to describe closed physical systems that have waited for an extremely long time (relatively to the thermalization time) and approached an equilibrium state. However, this is just a tiny portion of physics. Most of physics doesn't study closed systems at equilibrium and physics about the very early (or initial) conditions of the Universe (or any other system) doesn't study these things at all! You must be much more open-minded and ready to try much more dramatic ideas when you want to comprehend questions that are as ambitious as the beginning of the Universe. Physics is able to relate the information about the world in numerous way; everyone who just learns a very small insight and dogmatically insists that this insight should apply everywhere is just a narrow-minded, intellectually limited bigot.

Needless to say, Moshe Rozali is super polite when he talks to Sean Carroll. There are many physicists who are comparably polite, modest, and peaceful. On one hand, I think it's great, I consider myself as a member of the same group and as an advocate of these physicists. On the other hand, the peaceful reaction of those people to the screaming of aggressive peanut minds such as Sean Carroll, Peter Woit, Lee Smolin, and thousands of others just drives me up the wall! If one is a little bit realistic about the dynamics of the world and where it goes, such modest responses to similar aggressive simpletons are pretty much unforgivable because they help the civilization to dumb down. What is really needed are people who have an appreciation for the deep wisdom manifested in Nature and for others who actually can and want to understand this wisdom as intimately as Nature and our determination allow us.

And that's the memo.



Bonus: compression down to Shannon entropy

Let's write some details about the compression algorithm that realizes the Shannon entropy limit. A sequence of \(M\) characters – in our example, \(M\) was equal to 1,000,000 – will be encoded to
\[ B = M \sum_{i=1}^N p_i \log_2(1/p_i) \] bits. This will be the result; so far, I haven't demonstrated it.

I have used a base-two logarithm to write down the formula in order to show you how unnatural it is to measure the information in bits: if you replace \(\log_2\) by \(\ln\), the result will be in nats. I will continue to use the base-2 logarithms not because they're natural – they're surely unnatural – and not because I think in terms of bits (I think in terms of nats) but because I surrender to the majority that is more familiar with bits as the units of information. So how do we encode the sequence of characters, each of which takes values in a set of \(N\) elements – imagine \(N=256\) – to a sequence of \(B\) bits that may be much shorter?

We will represent the original sequence by a real number \(R\) between 0 and 1 which may be written in a base-two or any other system. When this real number is written down with a sufficient accuracy, we may extract the whole original sequence of \(M\) characters, assuming that we also know the value of \(M\) which may be stored in a few bytes or less. Fine, so how do we construct the number \(R\) and how many bits or nats do we need for the required precision?

Let's take the first character. The real number \(R\) must remember the first character; the value of \(R\) must unambiguously tell us what the character was. So clearly, we need to divide the interval \((0,1)\) to \(N\) pieces; recall that \(N=256\) in the example that is easy to remember. Once we know in which of these \(N\) intervals the number \(R\) is located, we may take this interval and divide it according to the same (but scaled) algorithm that depends on the second character, and so on and so on, up to the last, \(M\)-th character of the sequence.

Returning to the first step, we must divide the interval \((0,1)\) in the optimal way. Let's use the symbols \(L_i\), \(i=1,2,\dots N\), for the lengths of the subintervals. These numbers obey
\[\sum_{i=1}^N L_i = 1. \] What are the optimum values of \(L_i\) to minimize the required accuracy of the number \(R\) that is needed for the reconstruction of all \(M\) characters in the original file? Well, we must look at the resulting interval after the \(M\)-th division to the subintervals. We're getting shorter and shorter intervals and the width \(W\) of the narrowest one at the end tells us the required accuracy with which \(R\) has to be defined in the compressed file. What is \(W\)?

Well, during the process of division of intervals, we encounter roughly or exactly \(p_1 M\) times the symbol with code 1, \(p_2 M\) times the symbol with code 2, and so on, and \(p_N M\) times the symbol with code \(N\). The decrease of the width is commutative much like any multiplication of real numbers so the original width 1 decreases to
\[ L_1^{p_1 M} L_2^{p_2 M} \dots L_N^{p_N M} = \prod_{i=1}^N L_i^{p_i M} \] which is an exponentially tiny number. How many bits do we need to express \(R\) with this or better accuracy (plus minus one or few bits)? Well, it's simply the (negative of the) base-two logarithm of the tiny width above, i.e.
\[ {\rm Bits} = - \log_2 \prod_{i=1}^N L_i^{p_i M} = - M \sum_{i=1}^N p_i \log_2 L_i. \] The identities for the logarithms of products and powers are something you must know well when you do similar calculations. At any rate, we want to find the lengths \(L_i\) that minimize the expression above, given the condition that they add to one. The multiplication by \(M\) clearly plays no role in the minimization process so we will minimize \(BPC={\rm Bits}/M\) i.e. the number of "bits per character" where the factor of \(M\) was dropped. We require
\[ \delta BPC = \lambda \sum_{i=1}^N \delta L_i. \] Rather than being zero, the right-hand side is the Lagrange multiplier used for minimization problems with a constraint. When we substitute our formula for \(BPC\), we have
\[ 0 = -\sum_{i=1}^N \delta L_i \left[ \frac{p_i}{(\ln 2)L_i} + \lambda \right] \] where the \(\ln(2)\) in the denominator came from differentiating the base-two logarithm that I had chosen in order to surrender to the majority. This must hold for arbitrary variations \(\delta L_i\) which implies
\[ L_i = -\lambda \ln(2) p_i. \] Happily enough, \(\ln(2)\) may simply be absorbed to \(\Lambda = \lambda\cdot \ln(2)\) and we still see that the lengths of the intervals \(L_i\) must be proportional to the probabilities \(p_i\) themselves. In fact, the probabilities add to one which is the same normalization we expect from \(L_i\) which implies that \(L_i=p_i\) or \(\Lambda=-1\).

Now, how many bits do we need for this choice of \(L_i\)? We just copy the formula for \({\rm Bits}\) above and replace \(L_i\) by \(p_i\): I can do it with copy-and-paste and the editing of one character which is what I'm going to do right now:
\[ {\rm Bits} = - \log_2 \prod_{i=1}^N p_i^{p_i M} = - M \sum_{i=1}^N p_i \log_2 p_i. \] You see that the average number of bits per character is simply
\[ BPC = - \sum_{i=1}^N p_i \log_2 p_i. \] It's nothing else than the Shannon entropy. Replace \(\log_2\) by \(\ln\) to obtain the results in the natural units, nats. (The entropy in physics – which is a measure of the invisible or lost information – is almost always written down in nats.) And now you really know how to compress the data with an arbitrary probabilistic distribution of different values of the characters. When you compress [some/the/a/whatever] English text in which some characters are nearly absent while a few characters are very frequent, you may get a compression to 1/4 or something like that.

Add to del.icio.us Digg this Add to reddit

snail feedback (3) :


reader Plato said...

Lubos,

I completely agree with the idea of a compression factor algorithmically expressed too:)

I thought for sure you would have mentioned E8.:)

Best,


reader Luke Lea said...

As a rule don't theoretical physicists have more talent for abstract, logical analysis than for mastering tons of unrelated, descriptive data as, for instance, encountered on the frontiers of molecular biology nowadays? Will it more likely be physicists, mathematicians, statisticians, or theoretical biologists who tame the Niagra of genomic data pouring in every year? Or will it never be tamed?


reader Luboš Motl said...

Dear Luke, I agree with your rule for the "primary talent" of theoretical physicists – even though many of them surely have talents in other things as well.

However, you may perhaps be underestimating the hidden order that exists in molecular biology. Molecular biology is most mastered by those who mastered it – it's clearly an independent discipline from physics as we know it, not necessarily correlated in either direction. However, the search for order, while smaller in theoretical physics, doesn't have to be quite irrelevant in molecular biology.