dp

12-25-2010, 12:24 PM

Benford's Law pops up all over the place and is quite an amazing feature of lists of numbers. The law states that in most lists of numbers the count of the first digits will appear in an ordered pattern, and the distribution is logarithmic.

These guys explain it better:

http://en.wikipedia.org/wiki/Benford%27s_law

So this morning I decided to test the file systems of some of my computers to see if they obey Benford's Law. I crafted this simple script:

find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -rn

This is a Unix command that finds every file on the system, extracts the size in kbytes and extracts the first digit of that value, sorts it, counts the uniq occurrences of those digits, and finally does a sort in reverse order.

On my Linux server:

# find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -rn

453222 1

336319 2

225457 3

185557 4

136550 5

123395 6

85225 7

85181 8

69318 9

26451 0

and on my Mac laptop:

find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -r

51958 1

24019 2

15398 3

13755 4

13549 6

13281 8

12086 5

7154 7

5586 9

1077 0

Edit: Still bored, I wondered what the distribution of the first digit would be if a person were to reverse the numbers in the list so that 12345 would be 54321. Since only the initial number is important, this is exactly the same as examining the last rather than first digit, and a quick re-write of the command line looks like this:

find /usr -type f -ls |awk '{print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

This produces an undesirable problem - numbers can now start with 0 and the analysis produces this:

find /usr -type f -ls |awk '{print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

20305 0

14841 4

14651 2

14499 8

14459 6

13529 9

13327 7

13100 1

13088 5

12845 3

That will never do, so I'll need to exclude all numbers from the list that are multiples of 10.

And the result of not including multiples of 10 is:

find / -type f -ls |awk '$7 % 10 > 0 {print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

186294 2

183166 8

183042 4

182758 6

165157 9

155068 1

154910 5

153542 7

152251 3

So the file sizes follow Benford's Law but the inverse of the same numbers do not.

I would presume then that if a person had a list of every lottery number ever cast (winners or losers), the first digit would show up in this order.

This is just one of the ways a Unix admin can amuse himself when he is on call on Christmas day :)

These guys explain it better:

http://en.wikipedia.org/wiki/Benford%27s_law

So this morning I decided to test the file systems of some of my computers to see if they obey Benford's Law. I crafted this simple script:

find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -rn

This is a Unix command that finds every file on the system, extracts the size in kbytes and extracts the first digit of that value, sorts it, counts the uniq occurrences of those digits, and finally does a sort in reverse order.

On my Linux server:

# find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -rn

453222 1

336319 2

225457 3

185557 4

136550 5

123395 6

85225 7

85181 8

69318 9

26451 0

and on my Mac laptop:

find / -type f -ls |awk '{print substr($7,1,1)}'|sort |uniq -c |sort -r

51958 1

24019 2

15398 3

13755 4

13549 6

13281 8

12086 5

7154 7

5586 9

1077 0

Edit: Still bored, I wondered what the distribution of the first digit would be if a person were to reverse the numbers in the list so that 12345 would be 54321. Since only the initial number is important, this is exactly the same as examining the last rather than first digit, and a quick re-write of the command line looks like this:

find /usr -type f -ls |awk '{print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

This produces an undesirable problem - numbers can now start with 0 and the analysis produces this:

find /usr -type f -ls |awk '{print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

20305 0

14841 4

14651 2

14499 8

14459 6

13529 9

13327 7

13100 1

13088 5

12845 3

That will never do, so I'll need to exclude all numbers from the list that are multiples of 10.

And the result of not including multiples of 10 is:

find / -type f -ls |awk '$7 % 10 > 0 {print substr($7,length($7),1)}'|sort |uniq -c |sort -rn

186294 2

183166 8

183042 4

182758 6

165157 9

155068 1

154910 5

153542 7

152251 3

So the file sizes follow Benford's Law but the inverse of the same numbers do not.

I would presume then that if a person had a list of every lottery number ever cast (winners or losers), the first digit would show up in this order.

This is just one of the ways a Unix admin can amuse himself when he is on call on Christmas day :)