PDA

View Full Version : OT: A test of Benford's Law

dp
12-25-2010, 12:24 PM
Benford's Law pops up all over the place and is quite an amazing feature of lists of numbers. The law states that in most lists of numbers the count of the first digits will appear in an ordered pattern, and the distribution is logarithmic.

These guys explain it better:
http://en.wikipedia.org/wiki/Benford%27s_law

So this morning I decided to test the file systems of some of my computers to see if they obey Benford's Law. I crafted this simple script:

find / -type f -ls |awk '{print substr(\$7,1,1)}'|sort |uniq -c |sort -rn

This is a Unix command that finds every file on the system, extracts the size in kbytes and extracts the first digit of that value, sorts it, counts the uniq occurrences of those digits, and finally does a sort in reverse order.

On my Linux server:

# find / -type f -ls |awk '{print substr(\$7,1,1)}'|sort |uniq -c |sort -rn
453222 1
336319 2
225457 3
185557 4
136550 5
123395 6
85225 7
85181 8
69318 9
26451 0

and on my Mac laptop:

find / -type f -ls |awk '{print substr(\$7,1,1)}'|sort |uniq -c |sort -r
51958 1
24019 2
15398 3
13755 4
13549 6
13281 8
12086 5
7154 7
5586 9
1077 0

Edit: Still bored, I wondered what the distribution of the first digit would be if a person were to reverse the numbers in the list so that 12345 would be 54321. Since only the initial number is important, this is exactly the same as examining the last rather than first digit, and a quick re-write of the command line looks like this:

find /usr -type f -ls |awk '{print substr(\$7,length(\$7),1)}'|sort |uniq -c |sort -rn

This produces an undesirable problem - numbers can now start with 0 and the analysis produces this:

find /usr -type f -ls |awk '{print substr(\$7,length(\$7),1)}'|sort |uniq -c |sort -rn
20305 0
14841 4
14651 2
14499 8
14459 6
13529 9
13327 7
13100 1
13088 5
12845 3

That will never do, so I'll need to exclude all numbers from the list that are multiples of 10.

And the result of not including multiples of 10 is:
find / -type f -ls |awk '\$7 % 10 > 0 {print substr(\$7,length(\$7),1)}'|sort |uniq -c |sort -rn
186294 2
183166 8
183042 4
182758 6
165157 9
155068 1
154910 5
153542 7
152251 3

So the file sizes follow Benford's Law but the inverse of the same numbers do not.

I would presume then that if a person had a list of every lottery number ever cast (winners or losers), the first digit would show up in this order.

This is just one of the ways a Unix admin can amuse himself when he is on call on Christmas day :)

Paul Alciatore
12-25-2010, 12:59 PM
The Wikepedia article says it is counter-intuitive, but I would dispute this statement. If you are counting things, you start from one. Now for any given decade 100s, 1000s, 10,000s, etc, you will arrive at the one digit in the first place before you get to the twos and at the twos before the threes, etc. This is only a rough way of looking at it.

A better way would be to look at a bell shaped distribution curve which would also tend to explain the low occurance of the 7, 8, and 9 in your example. But it does not explain the higher occurance of the 7s as opposed to the 8s and 9s: at least it does not at first glance. A deeper look may indeed find the explanation here.

Zero is a special case as most lists do not use a leading zero. I would suspect it would be the lowest occurance in most lists. In many of them it would not occur at all. Or if it is used, it could be the most common: consider a list where the average value is 10 but there are four places allocated and the first two are almost always "00".

My comments are not an exact analysis of this law, just an attempt to say that it is not really counter-intuitive.

dp
12-25-2010, 01:25 PM
In inventory counts zero is a valid quantity. The length of the number need only be at least 1, so 0 is a good number provided the length of the number is 1.

Evan
12-25-2010, 03:33 PM
Back around 1980 or so not long after I had acquired my first computer, a Commodore PET, a very good friend of ours stayed with us at Christmas. He had what I considered one major character flaw; He believed in numerology. I had tried for years to convince him that it was silly superstition but he would not budge. On that Christmas he had found a new toy to play with which was a special numerology tool for predicting features of a person's personality based on their name. It was a rather laborious calculation that had to be performed on each letter based on it's alphabet position using a method called Faddic Addition (http://www.google.com/search?num=100&hl=en&safe=off&biw=1437&bih=821&&sa=X&ei=dUQWTZasLZK6sQP81qyVCg&ved=0CBkQvwUoAQ&q=fadic+addition&spell=1).

This method follows strict logical rules and so is easy to implement on a computer. It resembles a checksum count. Once the derived values are obtained for each of a person's names the result is indexed to a table of predetermined personal characteristics as given in the book he had.

I spent a few minutes developing the algorithm in basic and then a couple of hours keying in the table of "predictions". We doubled checked it and it agreed exactly with his lengthy hand calculations.

At first he was very pleased as it saved a very long series of hand calculations but it didn't take long for it to become obvious that it was only a stupid little numerical stunt that had no real meaning. The computer did the math and spit out the answer in less than a second. What really convinced him it was worthless was the result it produced when I put in the names of our pets and then just random letter sequences.

The only thing that had made it seem mysterious and wonderful was the amount of work that he had to do to obtain a result. Without that work it became nothing but a cheap trick. He abandoned it but unfortunately not the idea that numerology and astrology had some sort of validity. Sigh.

Astronowanabe
12-25-2010, 05:18 PM
Benford's Law... +1

tmc_31
12-25-2010, 06:14 PM
I thought "Benfords Law" was: One must buy many tools!:D

thanks Tim Allen

Tim