If you don't know, Benford's law tells us that the probability that the first digit of any real random quantity - such as the price of a stock - is N equals
P(N) = log[(N+1)/N] / log(10).In particular, the different digits have the following probabilities:
Note that the captions of the English Wikipedia are written in a Slavic language, suggesting that it's more common for the Slavs to understand Benford's law. At any rate, the probability that the first digit equals one exceeds 30% while the probability that the first digit is nine is below 5%.
The Chinese authors repeat a statement that is very widespread:
One may simply presume that occurrence of the first digit of any randomly chosen data set is approximately uniformly distributed, but that is not the very case in real world.This sentence is deeply misleading but not "fully" untrue - because of the vague word "may" and the undefined word "one" at the beginning. At any rate, a more accurate version of the sentence would say
A very stupid person may simply presume that occurrence of the first digit of any randomly chosen data set is approximately uniformly distributed, but that is not the very case in real world.Why? Simply because there exists no rational reason to think why all digits should be equally likely. Note that we would have to mean "all digits except for 0" because leading zeros can't be signigicant figures, by definition. This exception we have to give to the number "0" is the first hint why the "naive" uniform distribution is wrong for the other digits, too.
But why would you think that each digit has the probability of 1/9 to be the first digit of a random real number? Well, you could consider "X" which is between 1 and 10 and has a uniform distribution on this linear scale. Clearly, each figure 1...9 is equally likely: the probability is 1/9.
However, unless you're stupid, you must realize that you have cherry-picked the endpoints. You could also consider "X" to be between 1 and 20. If you do so, the whole intervals "(1,2)" and "(10,20)" begin with "1", so the probability of a leading "1" ends up being 11/19, i.e. above 50 percent.
Once you understand the sensitivity, you may guess that the actual probability that the first digit is one is somewhere in between 11% and 50%: for example, it may be slightly above 30%. And you would be right.
Why the log formula is right?
Well, it's because "random" real numbers in the real world may a priori sit anywhere. What I mean is that even the order of magnitude of the numerical values is completely undetermined a priori.
For example, the price of a stock is equally likely to be between 1 and 10 as it is to be between 10 and 100. The two situations only differ by a rescaling of the price by a factor of ten. Very huge prices of stocks may become unlikely - because the stockholders may want to split the stocks, so that one can also sell or buy smaller units. And very tiny prices may become inconvenient, so the people may merge the tiny stocks into bigger ones.
But that only happens for pretty high or very low prices - and there is no preferred point where the stockholders "act". So it's natural to make the approximation that the real numbers such as stock prices are distributed in very long intervals, spanning many orders of magnitude. Adjacent orders of magnitude are equally likely. The distribution looks something like this:
Note that while the quantity is unlikely to be well below 1 or well above 10,000, it is pretty much equally likely to sit in the intervals "(10,100)" and "(100,1000)". That's why it's natural to use the "log(price)" as the x-axis.
Recall my promotion of exponential percentages for similar attitudes.
But once you use "log(price)" as the x-axis, you may see that the probabilistic distribution for "log(price)" is slowly changing within each order of magnitude - so it's nearly constant. If you want to determine the probability that the first digit is e.g. 8, you look at all the blue strips above where the first digit is 8.
Effectively, you compactify the graph - so that the interval "(1,10)" is just reused for all other real numbers. It's not hard to see that the portion
P = ln(9/8) / ln(10/1)of the interval describes numbers - prices - that start with "8". Such "blue" numbers are much less likely than the "red" numbers that begin with the digit 1.
The ratio of the probabilities that the first digit is 1 or 8 can be seen by a simple argument: because the function on the graph above - the distribution - is slowly changing with "x" on the x-axis, the blue and red areas may be approximately calculated as the product of the height and the width. But the height of a blue area is pretty much equal to the height of a nearby red area. So the areas only differ by the widths, and the ratios of the widths equals "ln(2/1) / ln(9/8)" - which is the ratio of probabilities that the first digit is 1 vs. that it is 8.
As a function of "price", the first digit is a quasi-periodic function of the price. More precisely, it is a periodic function of "log(price)". So the first digit is an "angular variable": prices that differ by the multiplication of 10 or its power are identified.
There is exactly one distribution of an angular variable
log(price) mod log(10)that is invariant under the multiplication of "price" by any positive constant (i.e. under the choice of "units"), namely the uniform distribution for "log price":
P[log price ∈ (y,y+dy)] = C, C = 1/log(10).The constant "C" is chosen so that the integral of the probability distribution over one "fundamental region", e.g. over "(1,10)", is normalized to unity.
You can always use any base of the logarithms in my formulae above - but you must use it consistently.
With this uniform distribution, you can easily see that the distribution is invariant under the change of the units,
price → price / newunit,simply because the multiplicative rescaling is just an additive shift of the logarithm, and the additive shift doesn't change the uniform distribution of the angular variable.
i.e. log(price) → log(price) - log(newunit)
You can also see why "0" had to be treated differently. The "egalitarian" people who would expect all digits between 1 and 9 to be fairly represented but 0 had to be completely removed because the digit is a far-right denier who doesn't enjoy the rights of the working or middle class (or whatever is the class that the "egalitarian" people want to treat democratically, while sharply suppressing everyone else).
While it's correct that "0" as a price or another real quantity has to be removed from the considerations (and be given a vanishing probability), the reason is actually different. The real reason is that "0" is a far-left digit because "log(0)" equals minus infinity, where all the probability distributions already have to drop to zero. ;-)
At any rate, there's nothing mysterious about Benford's law. It's linked to scale invariance i.e. independence on multiplicative rescaling (or choice of units). And this "symmetry" is easily and fully understood and analyzed if you consider "log(price)" because the multiplicative change become additive shifts which are simpler.
The Chinese authors test several well-defined distributions - such as the Boltzmann-Gibbs classical distribution and its quantum counterparts (Bose-Einstein and Fermi-Dirac). Not surprisingly, they find out that the distribution of first digits "fluctuates around" the universal Benford values.
What's more interesting is that the Bose-Einstein distribution, regardless of the only important parameter, the temperature (and its "first digit"), always reproduces Benford's law exactly. It's pretty interesting that one type of the quantum particles - bosons - has this property while the other (and the classical result) doesn't.
I haven't even checked the statement.
But because writing of real numbers using digits doesn't seem terribly fundamental to me and the Bose-Einstein accident is just one particular property (constancy) of a function summed over the "decades", I don't expect the finding to be more fundamental than that, either. :-)
Update: no coincidence
I see, there's no interesting identity behind the Bose-Einstein agreement. It works exactly simply because the Bose-Einstein distribution isn't normalizable. The relevant integral over "E" of "1/(exp(b.E)-1)" logarithmically diverges near "E=0", so one must use a non-normalizable distribution and uniformly cover infinitely many decades, just like in the idealized derivation of Benford's law.
The triviality of the exact agreement - because of the log divergence - makes it even more surprising why they think that they have found something deep.