testing the first digit of a number

sv0f

unread,

Jun 12, 2003, 10:51:45 AM6/12/03

to

I need an efficient method to test that the
first digit of a number is 1. The number
will be positive, and can even be assumed
to be an integer if this enables a particularly
sweet solution.

I have not been able to make the numerical
functions work for me in this regard, but
perhaps I am missing something. Instead,
I have been doing horrible things like:

? (defun first-digit-one-p (n)
(eq (char (princ-to-string n) 0) #\1))
FIRST-DIGIT-ONE-P
? (first-digit-one-p 20)
NIL
? (first-digit-one-p 101)
T
? (first-digit-one-p 1.01)
T
?

[Aside #1: Is PRINC-TO-STRING guaranteed to
return a simple-string, and therefore can
I use (the possibly more efficient) SCHAR
in FIRST-ONE-P? This holds under MCL 4.1
but I could find no general guarantee in
CLtL2.]

[Aside #2: I can use EQ to compare characters,
right? Or is this implementation dependent?]

More generally, I'd like to be able to
test whether the first digit of a positive
number is any arbitrary digit. But the
special case of 1 is most pressing.

Pascal Costanza

unread,

Jun 12, 2003, 11:07:57 AM6/12/03

to

sv0f wrote:
> I need an efficient method to test that the
> first digit of a number is 1. The number
> will be positive, and can even be assumed
> to be an integer if this enables a particularly
> sweet solution.
>
> I have not been able to make the numerical
> functions work for me in this regard, but
> perhaps I am missing something. Instead,
> I have been doing horrible things like:
>
> ? (defun first-digit-one-p (n)
> (eq (char (princ-to-string n) 0) #\1))

Why do you think this is horrible? From your problem description it
seems to me that you try to reason about the textual representation of a
number, and that's what you just do. Sounds ok to me.

If you need to change your function later on, for example for handling
hexadecimal representations as well ("test if the first digit is an F"),
then it's relatively easy to adapt your code.

Pascal

--
Pascal Costanza University of Bonn
mailto:cost...@web.de Institute of Computer Science III
http://www.pascalcostanza.de Römerstr. 164, D-53117 Bonn (Germany)

Kenny Tilton

unread,

Jun 12, 2003, 11:06:40 AM6/12/03

to

sv0f wrote:
> I need an efficient method to test that the
> first digit of a number is 1.

(defun first-digit-one-p (n)
(= 1 (floor (/ n (expt 10 (floor (log n 10)))))))

? add checks for negative input, etc. better lispniks than I will have
better answers, but i say do this with math, not strings. my 2.

--

kenny tilton
clinisys, inc
http://www.tilton-technology.com/
---------------------------------------------------------------
"Everything is a cell." -- Alan Kay

Jeff Massung

unread,

Jun 12, 2003, 11:16:13 AM6/12/03

to

sv0f <no...@vanderbilt.edu> wrote in news:none-86CB24.09514512062003
@news.vanderbilt.edu:

> I need an efficient method to test that the
> first digit of a number is 1. The number
> will be positive, and can even be assumed
> to be an integer if this enables a particularly
> sweet solution.

This is the first thing I thought of, but I'm sure there is something even
easier...

(defun is-1? (n)
(if (< n 10)
(= n 1)
(is-1? (floor (/ n 10)))))

--
Best regards,
Jeff mailto:j...@mfire.com
http://www.simforth.com

Matthieu Villeneuve

unread,

Jun 12, 2003, 11:20:21 AM6/12/03

to

"sv0f" <no...@vanderbilt.edu> wrote in message
news:none-86CB24.0...@news.vanderbilt.edu...

> I need an efficient method to test that the
> first digit of a number is 1. The number
> will be positive, and can even be assumed
> to be an integer if this enables a particularly
> sweet solution.
>
> I have not been able to make the numerical
> functions work for me in this regard, but
> perhaps I am missing something. Instead,
> I have been doing horrible things like:
>
> ? (defun first-digit-one-p (n)
> (eq (char (princ-to-string n) 0) #\1))
> FIRST-DIGIT-ONE-P
> ? (first-digit-one-p 20)
> NIL
> ? (first-digit-one-p 101)
> T
> ? (first-digit-one-p 1.01)
> T
> ?

Another approach (a bit "brute force"):

(defun first-digit (n base)
(truncate n (expt base (floor (log n base)))))

(= (first-digit 123 10) 1) => T

:-)

--
Matthieu Villeneuve

Raymond Toy

unread,

Jun 12, 2003, 11:30:33 AM6/12/03

to

>>>>> "Kenny" == Kenny Tilton <kti...@nyc.rr.com> writes:

Kenny> sv0f wrote:
>> I need an efficient method to test that the
>> first digit of a number is 1.

Kenny> (defun first-digit-one-p (n)
Kenny> (= 1 (floor (/ n (expt 10 (floor (log n 10)))))))

Kenny> ? add checks for negative input, etc. better lispniks than I will have
Kenny> better answers, but i say do this with math, not strings. my 2.

How about

(= 1 (floor (expt 10 (nth-value 1 (floor (log n 10))))))

But this, like yours, suffers from round-off. But presumably if n
isn't so large that the log has no significant fraction bits, this
will be right.

Ray

Joe Marshall

unread,

Jun 12, 2003, 11:56:37 AM6/12/03

to

"sv0f" <no...@vanderbilt.edu> wrote in message news:none-86CB24.0...@news.vanderbilt.edu...

> I need an efficient method to test that the
> first digit of a number is 1.

I thought I'd try to avoid so much division by
multiplying instead.

(defun leftmost-digit (n)
(do* ((i 1 next)
(next 10 (* i 10)))
((> next n) (floor n i))))

Edi Weitz

unread,

Jun 12, 2003, 12:10:24 PM6/12/03

to

"Joe Marshall" <prunes...@attbi.com> writes:

> I thought I'd try to avoid so much division by
> multiplying instead.
>
> (defun leftmost-digit (n)
> (do* ((i 1 next)
> (next 10 (* i 10)))
> ((> next n) (floor n i))))

Or, if you're concerned about micro-efficiency and have an upper bound
for N you might want to avoid the multiplication at all:

(defun leftmost-digit (n)
(loop for (i next . rest) on '(1 10 100 1000 10000 100000 1000000 10000000)
unless next do (error "Too big...")
if (> next n) do (return (floor n i))))

Something like that...

Edi.

sv0f

unread,

Jun 12, 2003, 12:22:55 PM6/12/03

to

In article <874r2vb...@bird.agharta.de>, Edi Weitz <e...@agharta.de>
wrote:

>Or, if you're concerned about micro-efficiency and have an upper bound
>for N you might want to avoid the multiplication at all:
>
> (defun leftmost-digit (n)
> (loop for (i next . rest) on '(1 10 100 1000 10000 100000 1000000 10000000)
> unless next do (error "Too big...")
> if (> next n) do (return (floor n i))))

I actually assume an upper bound at the moment:

(when (or (= i 1)
(<= 10 i 19)
(<= 100 i 199)
(<= 1000 i 1999)
(<= 10000 i 19999)
(<= 100000 i 199999)
(<= 1000000 i 1999999)
(<= 10000000 i 19999999)
(<= 100000000 i 199999999))
...)

However, I won't be able to in the future.

Kent M Pitman

unread,

Jun 12, 2003, 12:38:18 PM6/12/03

to

"Joe Marshall" <prunes...@attbi.com> writes:

Yep. That totally whomps

(defun leftmost-digit (n)
(values (floor n (expt 10 (floor (log n 10))))))

by a good factor of 10 in LispWorks ... for fixnum args. I noticed, though,
that my version does start to creep ahead slowly when you get into bignums
and algorithmic complexity starts to dominate because you're not getting
any more pickup from fixed-cost hardware arithmetic, and the overhead of
a few floats in my version isn't terrible at that point. At some point, I
assume that floats would have their precision exceeded, though, and only
bignums would work again... ;) So a lot depends on your intended data.

[I didn't do any heavy testing on mine to assure correctness, btw. I
was focused on speed. And neither of us spent any time worrynig about
n<1.]

My ad hoc mechanism did show a funny bug in LispWorks at
(leftmost-digit 1000), returning 10 instead of 1. Apparently
(LOG 1000 10) returned 2.9999999999999996 instead of 3.
But that's not really fatal. It's not likely to have to iterate
more than twice, so I did:

(defun leftmost-digit (n)
(loop (setq n (floor n (expt 10 (floor (log n 10)))))
(when (< n 10) (return n))))

I didn't try, so it's just a guess, but I bet the answer is different
in clisp, though, where byte compiling likely makes my version faster,
since it would probably make log and * be closer to the same speed
than they are in natively compiled systems, where * might compile open.
I mention this only to underscore that 'efficiency' is often also dependent
on the platform. Not that everyone didn't know that already. ;)

Geoffrey Summerhayes

unread,

Jun 12, 2003, 1:11:10 PM6/12/03

to

"sv0f" <no...@vanderbilt.edu> wrote in message news:none-8F2610.1...@news.vanderbilt.edu...

>
> I actually assume an upper bound at the moment:
>
> (when (or (= i 1)
> (<= 10 i 19)
> (<= 100 i 199)
> (<= 1000 i 1999)
> (<= 10000 i 19999)
> (<= 100000 i 199999)
> (<= 1000000 i 1999999)
> (<= 10000000 i 19999999)
> (<= 100000000 i 199999999))
> ...)
>
> However, I won't be able to in the future.

Yet another version of the divide by 10 loop:

(defun first-digit-one-p (n)
(= 1 (do ((x n (floor x 10)))
((< x 10) x))))

Just a thought, are you sure the input has to be an integer?
All the times I've required a function like this, I've found
using a different representation for the data eliminated the
problem, it's a pretty odd test.

--
Geoff

Joe Marshall

unread,

Jun 12, 2003, 2:12:24 PM6/12/03

to

"Joe Marshall" <prunes...@attbi.com> writes:

Ok, I thought about this and decided I could do *a lot* better.

(defun leftmost-digit (n)
(if (> 10 n)
n
(do* ((i 10 next)
(next 100 (* i i)))
((> next n) (leftmost-digit (floor n i))))))

Thomas A. Russ

unread,

Jun 12, 2003, 1:26:44 PM6/12/03

to

sv0f <no...@vanderbilt.edu> writes:

> [Aside #1: Is PRINC-TO-STRING guaranteed to
> return a simple-string, and therefore can
> I use (the possibly more efficient) SCHAR
> in FIRST-ONE-P? This holds under MCL 4.1
> but I could find no general guarantee in
> CLtL2.]

No, it is not guaranteed. For example, read-line in MCL
does not return a simple-string. It returns a string with
a fill pointer.

> [Aside #2: I can use EQ to compare characters,
> right? Or is this implementation dependent?]

This is technically implementation dependent. You should use
EQL or CHAR= to compare characters. CHAR= is probably the
preferred choice if you are sure you have characters.

> More generally, I'd like to be able to
> test whether the first digit of a positive
> number is any arbitrary digit. But the
> special case of 1 is most pressing.

I agree with Pascal Costanza that the string approach is
probably the correct answer. Mainly because it will avoid
any mismatch between the printed representation of the
number and the answer of your test function. The mathematical
approach using logarithms could fail to have this property
due to floating point round-off in the computations.

--
Thomas A. Russ, USC/Information Sciences Institute

Geoffrey Summerhayes

unread,

Jun 12, 2003, 2:47:38 PM6/12/03

to

"Joe Marshall" <j...@ccs.neu.edu> wrote in message news:smqfma...@ccs.neu.edu...

>
> Ok, I thought about this and decided I could do *a lot* better.
>
> (defun leftmost-digit (n)
> (if (> 10 n)
> n
> (do* ((i 10 next)
> (next 100 (* i i)))
> ((> next n) (leftmost-digit (floor n i))))))

*stunned pause* Nice application of the idea. Way cool! :-)

--
Geoff

sv0f

unread,

Jun 12, 2003, 3:00:15 PM6/12/03

to

In article <9v2Ga.12269$Gm4.1...@news20.bellglobal.com>,
"Geoffrey Summerhayes" <sumNOS...@hotmail.com> wrote:

>Just a thought, are you sure the input has to be an integer?
>All the times I've required a function like this, I've found
>using a different representation for the data eliminated the
>problem, it's a pretty odd test.

It is and odd test, I know. Let me come clean.

I've recently run into a compelling mathematical problem
called the "first digit phenomenon" or "first digit law".
In 1881, Simon Newcomb noticed that books of logarithms in
libraries were dirtier at the beginning than the end. It
seemed as though his colleagues were looking up the
logarithms of numbers beginning with "1" more frequently
than numbers beginning with other digits. He conjectured
that the probability that the first digit of a number is
d is (log (+ 1 (/ d)) 10). This expression predicts that
approximately 30% of all numbers begin with 1; at the
other end of the spectrum, only 4.6% begin with 9.

In 1938, Gregory Benford, an American physicist,
independently made the same observation. He looked
at the distribution of first digits culled from a number
of sources, including tables of river areas, populations,
constants, newspapers, specific heats, pressures, atomic
weights, Reader's Digest, addresses, death rates, etc.
The conjecture was consistent with the data.

Since then, a trickle of statisticians and mathematicians
have tried to explain the first digit law. I ran into
it in an interview with the statistician Persi Diaconis
in the edited volume by Albers and Alexanderson
"Mathematical People".

(I highly recommend reading this interview and the one
with Benoit Mandelbrot in the same book. The sequel
"More Mathematical People" contains an interview with
Bill Gosper, to (briefly) bring the subject back to Lisp!)

A Google search will quickly put you in contact with
expository work and ongoing research by Boyle, Diaconis,
Hill, and Raimi, as well as an article that appeared in
The New York Times in 1998. Let me know if anyone wants
specific references; I even have a few PDFs.

The explanations that have been offered strike me as
wrong. I think the reason for the first digit law is
not mathematical, but social/psychological.

Zipf's law is an empirical generality that was noticed
more than a century ago. It states that the relative
word frequency of a word w in a text is inversely
proportional to its rank r: f(w) ~ (/ r).

Actually, Zipf's law is f(w) ~ (/ (* r (log (* 1.78 R))))
where R is the total number of words.

Two definitions are in order. The "relative word
frequency" of a word is its frequency divided by the
frequency of the most frequent word. In other words,
the most frequent word has a relative word frequency
of 1. Also, the rank of a word is its position in a
list of words ordered from most frequent to least
frequent. In English, for example, "the" has rank 1,
"and" rank 2, etc.

My intuition is that numbers in the real world behave
less like Platonic ideals and more like words. In
other words, they follow Zipf's law. The questions,
then, are:
(1) Can Zipf's law be applied to numbers?
(2) If so, does the first digit law result?
(3) If so, can I get more empirical evidence
for the first digit law than that provided
by Newcomb and Benford?

With respect to (1), the only real problem is that
the original formulation of Zipf's law requires
knowing the total number of numbers, R. Of course,
R is infinite for the case of numbers -- this is a
problem. However, I can think of two ways around
this problem.

First, although the number of numbers may be infinite,
the numbers that appear in the world -- in printed
matter, on the web, etc. -- are not. As an approximation,
MOST-POSITIVE-FIXNUM for MCL 4.1, the number of electrons
in the universe, or the number of possible chess games
might do.

Second, and more importantly, if I compute f(w) for each
number in some range of interest according to the simple
(reciprocal) formulation of Zipf's law, then I can divide
the sum of the relative frequency of all numbers beginning
with 1 by the sum of the relative frequency of all numbers
to get the proportion of all numbers beginning with 1.
This is because in the division, all the (log (* 1.78 R))
terms will cancel out. In other words, as long as I
compute in this laborious fashion, I'm safe.

With respect to (2), the first digit law does in
fact result if one applies Zipf's law to numbers. First,
a critical assumption must be made:
(A) The rank of a number w is simply w.
In other words, 1 is the most frequent number, 2 the second
most frequent number, etc. (This assumption can be tested
empirically as described below; it appears to hold.)

Under (A), the code:

(defun prob-one (n &key epoch)
(when (or (not (integerp n))
(< n 1)
(>= n 1000000000))
(error "N must be an integer in [1,999999999]."))
(unless (or (null epoch)
(and (integerp epoch)
(<= 1 epoch n)))
(error "EPOCH must be either NIL or an integer in [1,~A]." n))
(let ((num 0)
(den 0))
(do* ((i 1 (1+ i))
(term (/ 1 i) (/ 1 i)))
((> i n))

(when (or (= i 1)
(<= 10 i 19)
(<= 100 i 199)
(<= 1000 i 1999)
(<= 10000 i 19999)
(<= 100000 i 199999)
(<= 1000000 i 1999999)
(<= 10000000 i 19999999)
(<= 100000000 i 199999999))

(incf num term))
(incf den term)
(when (and epoch
(zerop (mod i epoch)))
(format t "~&~A:~A~A" i #\tab (float (/ num den)))))
(float (/ num den))))

produces:

? (prob-one 999999999 :epoch 10000)
10000: 0.31753649107573223
20000: 0.3626650915013386
30000: 0.3491576030219523
40000: 0.34016832787531365
50000: 0.3335081987176785
60000: 0.32825701315681066
70000: 0.3239445091319402
80000: 0.32029940436686627
90000: 0.3171516123911374
100000: 0.3143886117731376
110000: 0.319751187767352
120000: 0.32457410795821995
130000: 0.3289507653399085
140000: 0.3329526520875796
150000: 0.33663565393452183
160000: 0.3400442456218855
170000: 0.34321437227408086
180000: 0.3461754799391808
190000: 0.3489519789920579
200000: 0.35156392865612124
210000: 0.3502272148265025
220000: 0.34896212882018485
230000: 0.3477617972961802
240000: 0.3466202718086957
250000: 0.3455323716123922
260000: 0.3444935583083216
270000: 0.3434998349218677
280000: 0.3425476639347623
290000: 0.3416339001707058
300000: 0.3407557354302229
310000: 0.33991065249977476
320000: 0.3390963867004261
330000: 0.33831089354589533
340000: 0.3375523213857245
350000: 0.3368189881428105
360000: 0.336109361434329
370000: 0.33542204150464894
380000: 0.3347557465080133
390000: 0.3341092997647813
400000: 0.33348161868326376
410000: 0.33287170509366826
420000: 0.3322786367844269
430000: 0.3317015600665375
440000: 0.33113968322026466
[...]

This output is highly suggestive, isn't it? It looks
like the proportions of numbers beginning with 1
oscillates between 0.36 and 0.31, at least in the
range of numbers examined. The oscillations make
sense if you think about it. Interestingly (and
again obviously, if you think about it), they occur
in intervals orders of magnitude smaller or large.
In other words, a similar pattern occurs between,
say, 1 and 44000 when the proportion is printed every
1000 numbers.

I suspect 0.33 is the expected proportion of numbers
beginning with 1 if Zipf's law applies to numbers.
(Can you guess why?)

The function has its good and bad points. The bignums
and rational numbers of Common Lisp mean no round off
error is accumulating along the way. This is good.
However, it isn't terribly efficient. In fact, it is
a memory pig.

Finally, with respect to (3), I have only done some
preliminary investigations. Does the first digit
law hold for web pages? A little messing around
with Google was encouraging, but more work remains
to be done. For example, one problem with Google
is that the count of the total number of pages
containing a numerical search term is rounded quite
severely. Google can also be used to test the
empirical merit of assumption (A) above. Once again,
preliminary investigations indicate that it is
warranted.

Alas, I have to put this project away for a few
months and finish a dissertation. However, any ideas,
feedback, etc., are welcome.

For example, it would be nice to have a closed form
expression for the sum of 1/n where n ranges between
the integers i and j, instead of having to iterate a
la PROB-ONE above. (Of course, the sum for i=1 and
j=infinity is infinity.) A quick perusal of tables of
formulas in my mathematics references was not helpful
in this regard, but perhaps someone here can be of some
help. I'll even offer a Knuthian $2.56 for the first
claimant!

Sashank Varma

Kaz Kylheku

unread,

Jun 12, 2003, 3:11:54 PM6/12/03

to

sv0f <no...@vanderbilt.edu> wrote in message news:<none-86CB24.0...@news.vanderbilt.edu>...

> I need an efficient method to test that the
> first digit of a number is 1. The number
> will be positive, and can even be assumed
> to be an integer if this enables a particularly
> sweet solution.

How about this idea:

0. Choose some power of ten, designate as value of a
constant named P.
1. If the number N is between 0 and P-1, then carry out a ladder of
hard coded tests:
For instance if P is 1000, then test whether N is in the
ranges [1000, 2000), [100, 200), [10, 20) or [1, 2).
Return T or NIL accordingly.
2. Divide N by P and go to 1.

If P is 10, then this is just repeated division by 10 until you get 1
or bust, but larger P's give you fewer divisions which could be
faster.

Or this idea, which uses multiplication to grow the range:

Loop the values (A B) over (1 2) (10 20) (100 200) and so forth.
When A grows larger than N, return NIL. When N is found to be
in the range [A, B) return T.

> [Aside #2: I can use EQ to compare characters,
> right? Or is this implementation dependent?]

(char= ch #\1)

sv0f

unread,

Jun 12, 2003, 3:23:51 PM6/12/03

to

In article <none-86CB24.0...@news.vanderbilt.edu>,
sv0f <no...@vanderbilt.edu> wrote:

>I need an efficient method to test that the
>first digit of a number is 1.

Thanks for the suggestions. To summarize:

Joe offered two iterative solutions that work for any
number. Geoffrey offered one too. Jeff offered a
(tail-)recursive version.

Mathieu, Kenny, and Raymond offered solutions involving
exponents and logs.

Raymond and Thomas noted that round off errors can
sabotage the exponent/log approach for sufficiently
large numbers. The iterative approach does not
suffer this flaw.

Kent did my dirty work, benchmarked the two competing
solutions, and found the iterative/tail-recursive
solution the best.

Pascal pointed out the niceties of my original
solution, which Thomas seconded.

Prompted by Geoffrey, I gave more background on why
I need a function to test the first digit of a number.
In doing so, I exposed a function that creates rational
numbers so large as to probably swamp all other concerns.

I've been thinking about Pascal's and Thomas's advocacy
for the string-based approach. Mathematically speaking,
a number is a number. It's only when rendered in some
notation that the concept "first digit" makes sense.
In other words, "first digit" has less to do with numbers
than their notations. And when working with notations,
why not be happy with strings? I need to think about
this more, and perhaps do some benchmarking...

butter71

unread,

Jun 12, 2003, 3:34:02 PM6/12/03

to

> I have not been able to make the numerical
> functions work for me in this regard, but
> perhaps I am missing something.

i hope this isn't a homework problem..

the problem is doable with math only. you don't need to (and shouldn't)
resort to string manipulation. read up on (log n &optional base).

--
butter71

sv0f

unread,

Jun 12, 2003, 3:41:16 PM6/12/03

to

In article <AV3Ga.12688$Gm4.1...@news20.bellglobal.com>,
"Geoffrey Summerhayes" <sumNOS...@hotmail.com> wrote:

When I first saw it, I thought it was a casual rewrite of
Joe's first iterative effort. Then I sensed the binary
search, but couldn't "see" it. I added some FORMAT
statements:

(defun leftmost-digit (n)
(format t "~&N=~A" n)

(if (> 10 n)
n
(do* ((i 10 next)
(next 100 (* i i))

(dummy (format t "~&~A ~A" i next) (format t "~&~A ~A" i
next)))

((> next n) (leftmost-digit (floor n i))))))

and ran some test cases:

? (leftmost-digit 4)
N=4
4
? (leftmost-digit 40)
N=40
10 100
N=4
4
? (leftmost-digit 400)
N=400
10 100
100 10000
N=4
4
? (leftmost-digit 4000)
N=4000
10 100
100 10000
N=40
10 100
N=4
4
? (leftmost-digit 40000)
N=40000
10 100
100 10000
10000 100000000
N=4
4
? (leftmost-digit 400000)
N=400000
10 100
100 10000
10000 100000000
N=40
10 100
N=4
4
? (leftmost-digit 4000000)
N=4000000
10 100
100 10000
10000 100000000
N=400
10 100
100 10000
N=4
4
? (leftmost-digit 40000000)
N=40000000
10 100
100 10000
10000 100000000
N=4000
10 100
100 10000
N=40
10 100
N=4
4
?

What tripped me up was, embarassingly enough: (> 10 n).
I kept reading it as (> n 10)!!!

sv0f

unread,

Jun 12, 2003, 3:43:15 PM6/12/03

to

In article <32bc1cd3.03061...@posting.google.com>,
butte...@lunch.org (butter71) wrote:

No, not hw. Not sure why I blanked on log...

Joe Marshall

unread,

Jun 12, 2003, 4:01:10 PM6/12/03

to

sv0f <no...@vanderbilt.edu> writes:

> I've been thinking about Pascal's and Thomas's advocacy
> for the string-based approach. Mathematically speaking,
> a number is a number. It's only when rendered in some
> notation that the concept "first digit" makes sense.
> In other words, "first digit" has less to do with numbers
> than their notations. And when working with notations,
> why not be happy with strings? I need to think about
> this more, and perhaps do some benchmarking...

One reason you want to avoid strings is that the algorithm that
renders numbers to strings assumes that you are interested in *all*
the digits. Since you only want the leftmost one, you can save a
lot of time by just not generating the others.

Jeff Massung

unread,

Jun 12, 2003, 4:09:46 PM6/12/03

to

sv0f <no...@vanderbilt.edu> wrote in news:none-9C4DA5.14235112062003
@news.vanderbilt.edu:

[great ideas clipped]

> I've been thinking about Pascal's and Thomas's advocacy
> for the string-based approach. Mathematically speaking,
> a number is a number. It's only when rendered in some
> notation that the concept "first digit" makes sense.
> In other words, "first digit" has less to do with numbers
> than their notations. And when working with notations,
> why not be happy with strings? I need to think about
> this more, and perhaps do some benchmarking...
>

There sure were lots of good ideas. Keep in mind that if performance is an
issue for you, that doing the iterative division or exponent/log will be
less expensive than string manipulation. This is because to convert from
number to string you need to do the division anyways...

sv0f

unread,

Jun 12, 2003, 4:18:48 PM6/12/03

to

Since Kent took the first step in benchmarking, here's what
I found.

I used a simple benchmark that counts how many numbers start
with 1 in a set of 100,000 random numbers in {0, 1, ..., 999}.

Welcome to Macintosh Common Lisp Version 4.1!
? (defun test-first-digit (fn &optional (times 100000))
(let ((count 0))
(dotimes (i times)
(when (funcall fn (random 1000))
(incf count)))
count))
TEST-FIRST-DIGIT

My string-based version was the worst.

? (defun first-digit-one-p (n)
(char= (char (princ-to-string n) 0) #\1))
FIRST-DIGIT-ONE-P
? (time (test-first-digit #'first-digit-one-p))
(TEST-FIRST-DIGIT #'FIRST-DIGIT-ONE-P) took 6,633 milliseconds (6.633
seconds) to run.
Of that, 372 milliseconds (0.372 seconds) were spent in The Cooperative
Multitasking Experience.
7,200,000 bytes of memory allocated.
11145

Joe's iterative solution was 26x faster than my string-based
version and allocates no memory. (Geoffrey's iterative
solution performs similarly to Joe's.)

? (defun leftmost-digit-one-p (n)

(do* ((i 1 next)
(next 10 (* i 10)))

((> next n) (= (floor n i) 1))))
LEFTMOST-DIGIT-ONE-P
? (time (test-first-digit #'leftmost-digit-one-p))
(TEST-FIRST-DIGIT #'LEFTMOST-DIGIT-ONE-P) took 246 milliseconds (0.246
seconds) to run.
Of that, 1 milliseconds (0.001 seconds) were spent in The Cooperative
Multitasking Experience.
11139

Jeff's recursive solution fell between the string-based
and iterative solutions in time and space. MCL has a
sizeable overhead for the recursive calls.

? (defun is-1? (n)

(if (< n 10)
(= n 1)
(is-1? (floor (/ n 10)))))

IS-1?
? (time (test-first-digit #'is-1?))
(TEST-FIRST-DIGIT #'IS-1?) took 1,189 milliseconds (1.189 seconds) to
run.
Of that, 89 milliseconds (0.089 seconds) were spent in The Cooperative
Multitasking Experience.
5,445,568 bytes of memory allocated.
11182

The log/expt solution due to Mathieu, Kenny, and others was about
equal to the recursive solution timewise, but is even worse than
the string-based solution with respect to space usage. Perhaps
MCL is worse than other implementations in this regard?

? (defun first-digit-one-p (n)
(if (zerop n)
nil
(= 1 (floor (/ n (expt 10 (floor (log n 10))))))))
FIRST-DIGIT-ONE-P
? (time (test-first-digit #'first-digit-one-p))
(TEST-FIRST-DIGIT #'FIRST-DIGIT-ONE-P) took 1,839 milliseconds (1.839
seconds) to run.
Of that, 176 milliseconds (0.176 seconds) were spent in The Cooperative
Multitasking Experience.
9,502,496 bytes of memory allocated.
11294

Finally, Joe's binary search solution was about as good as
his iterative solution in my initial test:

? (defun leftmost-digit-one-p (n)
(if (> 10 n)
(= n 1)

(do* ((i 10 next)
(next 100 (* i i)))
((> next n) (leftmost-digit (floor n i))))))

LEFTMOST-DIGIT-ONE-P
? (time (test-first-digit #'leftmost-digit-one-p))
(TEST-FIRST-DIGIT #'LEFTMOST-DIGIT-ONE-P) took 277 milliseconds (0.277
seconds) to run.
11026

I changed my test routine so that the random numbers were
drawn from {0, 1, ..., 99999}. The binary search version
was then *worse* than the iterative version.

I increased the range by two orders of magnitude, retested,
and the two routines were about even again.

It appears that the savings of binary versus linear search
is offset by the overhead of the recursive call in MCL.

sv0f

unread,

Jun 12, 2003, 4:28:17 PM6/12/03

to

In article <3cifm5...@ccs.neu.edu>, Joe Marshall <j...@ccs.neu.edu>
wrote:

>One reason you want to avoid strings is that the algorithm that
>renders numbers to strings assumes that you are interested in *all*
>the digits. Since you only want the leftmost one, you can save a
>lot of time by just not generating the others.

Good point.

On a related not, I searched for a Common Lisp command
like PRINC-TO-STRING that takes not just an object to
be written, but also a string to receive the printed
representation of the object. My thinking was that I
could create a single, large string with a fill pointer
and reuse this instead of having PRINC-TO-STRING generate
a new string each time. But, (1) I could find no such
function and (2) it was unclear to me whether manually
managing string resources would be any faster than
generational GC for such short-lived strings.

Joe Marshall

unread,

Jun 12, 2003, 4:36:37 PM6/12/03

to

sv0f <no...@vanderbilt.edu> writes:

>
> What tripped me up was, embarassingly enough: (> 10 n).
> I kept reading it as (> n 10)!!!

I wrote it backwards so that the `base' always appears
to the left of the number. I was thinking of replacing the literal 10
with an arbitrary base.

I should have written it differently.

Erann Gat

unread,

Jun 12, 2003, 4:03:34 PM6/12/03

to

In article <none-3A8EBB.1...@news.vanderbilt.edu>, sv0f
<no...@vanderbilt.edu> wrote:

> The explanations that have been offered strike me as
> wrong. I think the reason for the first digit law is
> not mathematical, but social/psychological.

[snip]

> Alas, I have to put this project away for a few
> months and finish a dissertation. However, any ideas,
> feedback, etc., are welcome.

Since you ask...

That numbers used to identify real-world quantities should obey something
like zipf's law has always seemed intuitively obvious to me, because such
numbers identify quantities in a (usually implicit) finite range. These
ranges are not uniformly distributed, and the prevalence of particular
leading digits are not uniformly distributed within a range. For example,
numbers that identify quantities in the range, say, 1..3 are much more
common than numbers that identify quantities in the range
0..9999999999999999999. So ranges like 1..3 are dispropotionately
represented among all the possible ranges within which numbers are used to
identify quantities, and the number 1 is disproportionately reprsented as
a leading digit in such ranges. Sociologically, other disproportionately
common ranges that have disproportionately large prevalence of leading 1's
include ranges such as 1..10 (probability 0.2) and 1..20 (probability
0.5).

Even if one assumes a uniform distribution of ranges one gets a very close
approximation to Benford's law. The expected probability of a leading 1
for a uniformly distributed random number drawn from a uniformly
distributed random range is about 0.29. (See the cental limit results on
http://mathworld.wolfram.com/BenfordsLaw.html.)

> For example, it would be nice to have a closed form
> expression for the sum of 1/n where n ranges between
> the integers i and j, instead of having to iterate a
> la PROB-ONE above. (Of course, the sum for i=1 and
> j=infinity is infinity.) A quick perusal of tables of
> formulas in my mathematics references was not helpful
> in this regard, but perhaps someone here can be of some
> help. I'll even offer a Knuthian $2.56 for the first
> claimant!

This is just the difference between the two harmonic numbers H-sub-i and
H-sub-j. See http://mathworld.wolfram.com/HarmonicNumber.html.

E.

Barry Margolin

unread,

Jun 12, 2003, 5:47:11 PM6/12/03

to

In article <none-3A8EBB.1...@news.vanderbilt.edu>,
sv0f <no...@vanderbilt.edu> wrote:

>The explanations that have been offered strike me as
>wrong. I think the reason for the first digit law is
>not mathematical, but social/psychological.

In the particular case of the digit 1, I suspect its frequency is highly
skewed because exact powers of 10 (1, 10, 100, 1000, etc.) come up very
frequently.

--
Barry Margolin, barry.m...@level3.com
Level(3), Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

Kalle Olavi Niemitalo

unread,

Jun 12, 2003, 5:52:59 PM6/12/03

to

Jeff Massung <j...@NOSPAM.mfire.com> writes:

> There sure were lots of good ideas. Keep in mind that if performance is an
> issue for you, that doing the iterative division or exponent/log will be
> less expensive than string manipulation. This is because to convert from
> number to string you need to do the division anyways...

Not if bignums are in BCD! ;-)

LDB and other bit-manipulation functions would become rather
expensive with such a representation, I suppose.

Jeff Massung

unread,

Jun 12, 2003, 6:52:42 PM6/12/03

to

Barry Margolin <barry.m...@level3.com> wrote in
news:zx6Ga.21$9D3...@paloalto-snr1.gtei.net:

> In article <none-3A8EBB.1...@news.vanderbilt.edu>,
> sv0f <no...@vanderbilt.edu> wrote:
>>The explanations that have been offered strike me as
>>wrong. I think the reason for the first digit law is
>>not mathematical, but social/psychological.
>
> In the particular case of the digit 1, I suspect its frequency is
> highly skewed because exact powers of 10 (1, 10, 100, 1000, etc.) come
> up very frequently.
>

How can you think that 1000-1999 is more common than 2000-2999? The
number '1' as a first digit is just as common as any other digit.
Perhaps in actual usage it comes up more often (more items cost in the
$10's of dollars than $20's) but just pulling random #'s will in no way
be skewed towards the digit '1'.

Or perhaps I'm misunderstanding what is being discussed?

Barry Margolin

unread,

Jun 12, 2003, 7:02:55 PM6/12/03

to

In article <vei11q4...@corp.supernews.com>,

Jeff Massung <j...@NOSPAM.mfire.com> wrote:
>Barry Margolin <barry.m...@level3.com> wrote in
>news:zx6Ga.21$9D3...@paloalto-snr1.gtei.net:
>
>> In article <none-3A8EBB.1...@news.vanderbilt.edu>,
>> sv0f <no...@vanderbilt.edu> wrote:
>>>The explanations that have been offered strike me as
>>>wrong. I think the reason for the first digit law is
>>>not mathematical, but social/psychological.
>>
>> In the particular case of the digit 1, I suspect its frequency is
>> highly skewed because exact powers of 10 (1, 10, 100, 1000, etc.) come
>> up very frequently.
>>
>
>How can you think that 1000-1999 is more common than 2000-2999?

I don't. I think that the specific numbers 1, 10, 100, 1,000, and
1,000,000 are more common than everything else. And these are skewing the
statistics in favor of numbers beginning with 1.

Jeff Massung

unread,

Jun 12, 2003, 7:16:35 PM6/12/03

to

Barry Margolin <barry.m...@level3.com> wrote in

news:zE7Ga.29$9D3...@paloalto-snr1.gtei.net:

> In article <vei11q4...@corp.supernews.com>,
> Jeff Massung <j...@NOSPAM.mfire.com> wrote:

>>How can you think that 1000-1999 is more common than 2000-2999?
>
> I don't. I think that the specific numbers 1, 10, 100, 1,000, and
> 1,000,000 are more common than everything else. And these are skewing
> the statistics in favor of numbers beginning with 1.
>

More common than 2, 20, 200 and 2000?

Paul Wallich

unread,

Jun 12, 2003, 7:44:17 PM6/12/03

to

Jeff Massung wrote:

> Barry Margolin <barry.m...@level3.com> wrote in
> news:zE7Ga.29$9D3...@paloalto-snr1.gtei.net:
>
>
>>In article <vei11q4...@corp.supernews.com>,
>>Jeff Massung <j...@NOSPAM.mfire.com> wrote:
>
>
>>>How can you think that 1000-1999 is more common than 2000-2999?
>>
>>I don't. I think that the specific numbers 1, 10, 100, 1,000, and
>>1,000,000 are more common than everything else. And these are skewing
>>the statistics in favor of numbers beginning with 1.
>>
> More common than 2, 20, 200 and 2000?

One of the arguments made in these kinds of discussions is that it's not
numbers we're taking about, but rather things or physical quantities.
It's their representation in numeric form that gives rise to the skew,
if skew there is in a particular sample.

One obvious possibility is that humans choose their scaling factors in a
way that increases the relative proportion of 1's (think of all the
"X-and-half-digit" displays out there where the top end is a leading 1).
But the lognormal-distribution thing makes much more sense to me, at
least as a first approximation -- or perhaps the two effects work together.

One interesting extension would be to use a bunch of different scale
factors on the same data to see how the digit distribution changed.

paul

Barry Margolin

unread,

Jun 12, 2003, 7:57:32 PM6/12/03

to

In article <vei2ejd...@corp.supernews.com>,

Jeff Massung <j...@NOSPAM.mfire.com> wrote:
>Barry Margolin <barry.m...@level3.com> wrote in
>news:zE7Ga.29$9D3...@paloalto-snr1.gtei.net:
>
>> In article <vei11q4...@corp.supernews.com>,
>> Jeff Massung <j...@NOSPAM.mfire.com> wrote:
>
>>>How can you think that 1000-1999 is more common than 2000-2999?
>>
>> I don't. I think that the specific numbers 1, 10, 100, 1,000, and
>> 1,000,000 are more common than everything else. And these are skewing
>> the statistics in favor of numbers beginning with 1.
>
>More common than 2, 20, 200 and 2000?

I expect so. For instance, there are plenty of texts that contain tables
of the metric system prefixes, e.g.

deca = 10
kilo = 1000
giga = 1,000,000
etc.

I imagine there are lots of other times when a nice, round number like
1,000 or 1,000,000 gets used by people. For instance, a kidnapper is
probably more likely to ask for a million dollars in ransom than 2 million
dollars -- unless he has a specific need for some other amount, it's
probably the first number in that order of magnitude that comes to mind.

3 probably gets a boost because it's the first digit of pi.

Pascal Costanza

unread,

Jun 12, 2003, 8:15:26 PM6/12/03

to

In article <none-9C4DA5.1...@news.vanderbilt.edu>,
sv0f <no...@vanderbilt.edu> wrote:

Here is an additional thought: I have thought about this with an
"engineer's mindset" when I posted my comment. I have seen the following
possible forces:

- You don't know whether the performance of this piece of code will turn
out critical in a larger system.

- On the other hand, your straightforward string-based approach directly
states what you want: take the first digit from the printed
representation of a number. I have only skipped the other proposed, more
"mathematical" implementations, but it seems to me that they are less
clear, at least on first sight.

So without further knowledge about your actual requirements wrt
performance, the readability and maintainability aspects seemed to be
more important to me.

(I am obviously strongly influenced by the extreme programming mindset -
"do the simplest thing that could possibly work".)

Pascal

Duane Rettig

unread,

Jun 12, 2003, 9:22:06 PM6/12/03

to

Paul Wallich <p...@panix.com> writes:

> Jeff Massung wrote:
>
> > Barry Margolin <barry.m...@level3.com> wrote in
> > news:zE7Ga.29$9D3...@paloalto-snr1.gtei.net:
>
> >>In article <vei11q4...@corp.supernews.com>,
> >>Jeff Massung <j...@NOSPAM.mfire.com> wrote:
> >
>
> >>>How can you think that 1000-1999 is more common than 2000-2999?
> >>
> >>I don't. I think that the specific numbers 1, 10, 100, 1,000, and
> >>1,000,000 are more common than everything else. And these are skewing
> >>the statistics in favor of numbers beginning with 1.
> >>
>
> > More common than 2, 20, 200 and 2000?
>
> One of the arguments made in these kinds of discussions is that it's
> not numbers we're taking about, but rather things or physical
> quantities. It's their representation in numeric form that gives rise
> to the skew, if skew there is in a particular sample.

I disagree - I think it's a simple combination of counting and getting
tired. Very intuitive to me:

One tends to count starting at 0 or 1, but stopping arbitrarily. So
given any range, numbers starting with 1 will usually be in that range.

What about a range up _to_ a number starting with 1? Well, that would
have been a range of numbers starting with 0, but since we don't by
convention start numbers with 0, it isn't as common to see. So instead,
such a range that stops just short of 1.... is really a part of that
range with one fewer digits, a range which goes _through_ 9...
So it is really a part of a distribution of ranges which is at a
different level, and which also has a probability of stopping at or
after 1...

An obvious source of such stopping is the year - we've only just begun
y2k, so many of our years have started with 1 (yet another counting
issue).

--
Duane Rettig du...@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182

sv0f

unread,

Jun 12, 2003, 10:26:57 PM6/12/03

to

In article <zx6Ga.21$9D3...@paloalto-snr1.gtei.net>,
Barry Margolin <barry.m...@level3.com> wrote:

>In article <none-3A8EBB.1...@news.vanderbilt.edu>,
>sv0f <no...@vanderbilt.edu> wrote:
>>The explanations that have been offered strike me as
>>wrong. I think the reason for the first digit law is
>>not mathematical, but social/psychological.
>
>In the particular case of the digit 1, I suspect its frequency is highly
>skewed because exact powers of 10 (1, 10, 100, 1000, etc.) come up very
>frequently.

Along these lines, deeper in this thread, Duane wrote:

In article <4vfvaw...@beta.franz.com>,
Duane Rettig <du...@franz.com> wrote:

>An obvious source of such stopping is the year - we've only just begun
>y2k, so many of our years have started with 1 (yet another counting
>issue).

I actually don't think this explains the first digit phenomenon.
Keep in mind it crops up in a variety of sources: tables of river

areas, populations, constants, newspapers, specific heats,
pressures, atomic weights, Reader's Digest, addresses, death

rates, etc. I see no reason why powers of ten or Gregorian
calender dates would skew the distributions of first digits in
all of these sources.

When exploring this problem with Google, I thought there
would be anomalous spikes for numbers encoding heavily
populated area codes (e.g., Chicago's 312), heavily
populated zip codes, etc. But the spiks were quite
trivial in size.

I like my Zipf's law approach (needless to say). The
question is then more general: Why does Zipf's law hold
for words AND numbers?

Pascal Bourguignon

unread,

Jun 12, 2003, 10:31:20 PM6/12/03

to

sv0f <no...@vanderbilt.edu> writes about digit frequencies.

And what's stranger, is that this law does not depend on the base used
to write the numbers!

--
__Pascal_Bourguignon__ http://www.informatimago.com/
----------------------------------------------------------------------
Do not adjust your mind, there is a fault in reality.

Pascal Bourguignon

unread,

Jun 12, 2003, 10:33:35 PM6/12/03

to

Jeff Massung <j...@NOSPAM.mfire.com> writes:

Matter of fact, not in the case of natural sequences of numbers.

That's a way to detect doctored suites: they don't follow the same
proportions than the natural suites.

There's no random numbers in the real world.

Pascal Bourguignon

unread,

Jun 12, 2003, 10:38:02 PM6/12/03

to

Jeff Massung <j...@NOSPAM.mfire.com> writes:

> Barry Margolin <barry.m...@level3.com> wrote in
> news:zE7Ga.29$9D3...@paloalto-snr1.gtei.net:
>
> > In article <vei11q4...@corp.supernews.com>,
> > Jeff Massung <j...@NOSPAM.mfire.com> wrote:
>
> >>How can you think that 1000-1999 is more common than 2000-2999?
> >
> > I don't. I think that the specific numbers 1, 10, 100, 1,000, and
> > 1,000,000 are more common than everything else. And these are skewing
> > the statistics in favor of numbers beginning with 1.
> >
>
> More common than 2, 20, 200 and 2000?

That's not the point. You can rewrite these number in a different
base, and get the same kind of statistics on their digits:

(sort
(mapcar (lambda (n) (format nil "~7R" n))
'( 1 10 100 1000 10000 100000 100000
2 20 200 2000 20000 200000 200000))
(function string<=))

("1" "112211" "13" "1462043" "1462043"
"2" "202" "26" "2626"
"404" "41104"
"5555" "564355" "564355")

(try it with longer sequences of *natural* numbers, that is, not
invented numbers, but numbers you find in the real world, and with
whatever base).

sv0f

unread,

Jun 12, 2003, 10:34:23 PM6/12/03

to

In article <gat-120603...@k-137-79-50-101.jpl.nasa.gov>,
g...@jpl.nasa.gov (Erann Gat) wrote:

>> For example, it would be nice to have a closed form
>> expression for the sum of 1/n where n ranges between
>> the integers i and j, instead of having to iterate a
>> la PROB-ONE above. (Of course, the sum for i=1 and
>> j=infinity is infinity.) A quick perusal of tables of
>> formulas in my mathematics references was not helpful
>> in this regard, but perhaps someone here can be of some
>> help. I'll even offer a Knuthian $2.56 for the first
>> claimant!
>
>This is just the difference between the two harmonic numbers H-sub-i and
>H-sub-j. See http://mathworld.wolfram.com/HarmonicNumber.html.

Thanks for the pointer. Computing H-sub-i involves a
"digamma" function that looks tricky to evaluate at
first glance, but I will definitely look further.

sv0f

unread,

Jun 12, 2003, 10:50:29 PM6/12/03

to

In article <none-1A5C46.1...@news.vanderbilt.edu>,
sv0f <no...@vanderbilt.edu> wrote:

>Joe's iterative solution was 26x faster than my string-based
>version and allocates no memory. (Geoffrey's iterative
>solution performs similarly to Joe's.)
>
>? (defun leftmost-digit-one-p (n)
> (do* ((i 1 next)
> (next 10 (* i 10)))
> ((> next n) (= (floor n i) 1))))
>LEFTMOST-DIGIT-ONE-P
>? (time (test-first-digit #'leftmost-digit-one-p))
>(TEST-FIRST-DIGIT #'LEFTMOST-DIGIT-ONE-P) took 246 milliseconds (0.246
>seconds) to run.
>Of that, 1 milliseconds (0.001 seconds) were spent in The Cooperative
>Multitasking Experience.
>11139

[...]

>Finally, Joe's binary search solution was about as good as
>his iterative solution in my initial test:
>
>? (defun leftmost-digit-one-p (n)
> (if (> 10 n)
> (= n 1)
> (do* ((i 10 next)
> (next 100 (* i i)))
> ((> next n) (leftmost-digit (floor n i))))))
>LEFTMOST-DIGIT-ONE-P
>? (time (test-first-digit #'leftmost-digit-one-p))
>(TEST-FIRST-DIGIT #'LEFTMOST-DIGIT-ONE-P) took 277 milliseconds (0.277
>seconds) to run.
>11026
>
>I changed my test routine so that the random numbers were
>drawn from {0, 1, ..., 99999}. The binary search version
>was then *worse* than the iterative version.
>
>I increased the range by two orders of magnitude, retested,
>and the two routines were about even again.
>
>It appears that the savings of binary versus linear search
>is offset by the overhead of the recursive call in MCL.

I'm still puzzled by this. I wrote a non-recursive version
of Joe's second, optimized routine:

(defun unrolled-leftmost-digit-one-p (n)
(prog (i next)
LOOP-INIT
(when (> 10 n)
(return (= 1 n)))
(setq i 10)
(setq next 100)
ITERATION
(when (> next n)
(setq n (floor n i))
(go LOOP-INIT))
(setq i next)
(setq next (* i i))
(go ITERATION)))

However, it appears no faster than the original, recursive
version. My current guess is that the "binary search"
version outperforms the "linear search" version for
extremely large inputs, i.e., >> 10^9.

Joe Marshall

unread,

Jun 12, 2003, 11:22:22 PM6/12/03

to

The binary search routine will give
something like O (log log n) grows, so it
should have reasonable performance for
numbers that are thousands of digits long.

I suspect that for more modest numbers that
the call to FLOOR is the problem. Also, the
routine calculates powers of 10 over and over
when these could just as easily be looked up.

If you are restricting yourself to fixnums,
then you can just open-code the binary search
and the digit computation in one big COND statement
and get the answer with a few subtractions.

"sv0f" <no...@vanderbilt.edu> wrote in message news:none-B01907.2...@news.vanderbilt.edu...

Adam Warner

unread,

Jun 12, 2003, 11:45:23 PM6/12/03

to

Hi Duane Rettig,

>> One of the arguments made in these kinds of discussions is that it's
>> not numbers we're taking about, but rather things or physical
>> quantities. It's their representation in numeric form that gives rise
>> to the skew, if skew there is in a particular sample.
>
> I disagree - I think it's a simple combination of counting and getting
> tired. Very intuitive to me:
>
> One tends to count starting at 0 or 1, but stopping arbitrarily. So
> given any range, numbers starting with 1 will usually be in that range.
>
> What about a range up _to_ a number starting with 1? Well, that would
> have been a range of numbers starting with 0, but since we don't by
> convention start numbers with 0, it isn't as common to see. So instead,
> such a range that stops just short of 1.... is really a part of that
> range with one fewer digits, a range which goes _through_ 9...
> So it is really a part of a distribution of ranges which is at a
> different level, and which also has a probability of stopping at or
> after 1...
>
> An obvious source of such stopping is the year - we've only just begun
> y2k, so many of our years have started with 1 (yet another counting
> issue).

In addition:

Days: one begins at least 11/31 of the days of each month.

Months: one begins 1/3 of the months of each year: {1, 10, 11 and 12}.

If months were binary then the numerical months would be:
{0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, 1011, 1100}

And because we discard leading zeros the proportion of ones would be 100%.
And in hexadecimal the proportion of months starting with one would be 1/12.

Also when counting in sets of {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} the leading
one is overrepresented compared to the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

Hours follow the same pattern as months where the leading 1 is represented
in 1/3 of the hours. Minutes and seconds underrepresent 6, 7, 8 and 9
(since minutes and seconds stop at 59.999...).

Regards,
Adam

Erann Gat

unread,

Jun 12, 2003, 10:28:26 PM6/12/03

to

In article <zE7Ga.29$9D3...@paloalto-snr1.gtei.net>, Barry Margolin
<barry.m...@level3.com> wrote:

> In article <vei11q4...@corp.supernews.com>,
> Jeff Massung <j...@NOSPAM.mfire.com> wrote:
> >Barry Margolin <barry.m...@level3.com> wrote in
> >news:zx6Ga.21$9D3...@paloalto-snr1.gtei.net:
> >
> >> In article <none-3A8EBB.1...@news.vanderbilt.edu>,
> >> sv0f <no...@vanderbilt.edu> wrote:
> >>>The explanations that have been offered strike me as
> >>>wrong. I think the reason for the first digit law is
> >>>not mathematical, but social/psychological.
> >>
> >> In the particular case of the digit 1, I suspect its frequency is
> >> highly skewed because exact powers of 10 (1, 10, 100, 1000, etc.) come
> >> up very frequently.
> >>
> >
> >How can you think that 1000-1999 is more common than 2000-2999?
>
> I don't. I think that the specific numbers 1, 10, 100, 1,000, and
> 1,000,000 are more common than everything else. And these are skewing the
> statistics in favor of numbers beginning with 1.

Nope. This phenomenon occurs in distributions of numbers that refer to
"natural" quantities, and it is independent of units. It's a
straightforward result of the fact that most physical quantities are
essentially random numbers drawn from finite distributions. Here's a
quick demo that illustrates this:

(defun benford (&optional (n 10000) (base 10) (limit 100000000))
(let ( (histogram (make-array base :initial-element 0)) )
(dotimes (i n)
(let ( (digit (char (princ-to-string (random (random limit))) 0)) )
(incf (svref histogram (parse-integer (string digit) :radix base)))))
histogram))

? (benford)
#(0 2444 1815 1435 1170 943 726 612 517 338)
?

The common usage of 1,10,100 etc. may contribute to this, but most of the
effect is the result of simple statistics and probability theory.

E.

Simon Katz

unread,

Jun 13, 2003, 5:19:30 AM6/13/03

to

"Erann Gat" <g...@jpl.nasa.gov> wrote in message
news:gat-120603...@192.168.1.51...

> (defun benford (&optional (n 10000) (base 10) (limit 100000000))
> (let ( (histogram (make-array base :initial-element 0)) )
> (dotimes (i n)
> (let ( (digit (char (princ-to-string (random (random limit)))
> 0)))
> (incf (svref histogram (parse-integer (string digit)
> :radix base)))))
> histogram))
>

A nitpick:

In the middle of that, it should be
(random (1+ (random limit)))
otherwise you can call RANDOM with an argument of zero. This tends to
happen when LIMIT is small or N is large.

Or, if you don't want zeros in the histogram:
(1+ (random (1+ (random limit))))

Jacek Generowicz

unread,

Jun 13, 2003, 5:51:23 AM6/13/03

to

sv0f <no...@vanderbilt.edu> writes:

[Musings about Benford's law ... others contribute opinions further
down therad]

> I've recently run into a compelling mathematical problem
> called the "first digit phenomenon" or "first digit law".

Heh, of course if you look at the ratios of frequencies of _second_
digits, there is still a bias towards the smaller ones, though not as
pronounced. And, of course, it also holds for all subsequent digits
too, getting less significant as you go farther into the number's
representation.

> I think the reason for the first digit law is not mathematical, but
> social/psychological.

I strongly disagree.

The sociological issues are a red herring. Benford's Law still holds
if you change the base in which you represent your numbers, or if you
multiply all your numbers by any constant factor.

For example, take your sample of numbers and multiply each of them by
3. All numbers starting with a 1 are mapped to numbers starting with a
3, 4 or 5:

[1,2) * 3 --> [3,6)

On the other hand, numbers which NOW (after the scaling by
3) start with a 1, previously must have started with 3,4,5 or 6:

[.333333,.66666) * 3 --> [1,2)

So, the numbers which used to "belong" to 1 are shared out among 3
different numbers; however numbers which now "belong" to 1, were
aggregated from 4,5 and subsets of 3 and 6. There is a strong bias
towards starting with a 1!

[Note: Benford's Law breaks down in the presence of "sociological"
numbers; if you start of with lots of 1s, 10s, 100s etc. you will end
up with lots of 3s, 30s, 300s, etc ... there will be more numbers
starting with 3 than there should be. This is one way to detect
whether your sample is "natural" or not - I've heard it rumoured that
Benford's Law is used in checking tax returns.]

Another (thought) experiment you might like to try is to distribute
points uniformly on logarithmic graph paper. Then write down all the
coordinates of the points, and you'll see that Benford's Law is
beautifully obeyed. Why? Because the distance between successive pairs
of lines on logarithmic graph paper gets progressively smaller, until
you reach a line labelled with a 1, at which point the distance gets
larger again. 1 gets more that its fair share of points.

Another way to look at it: The largest number starting with 1 (for a
given order of magnitude), ie 1.999999...x10^N, is twice as large as
the smallest number starting with 1 (for the same order of magnitude)
ie 1x10^N; but if you do the same comparison for numbers starting with
2, (2.999... / 2), the factor is smaller: only 1.5. For numbers
starting with 3 the factor is only 1.3333... and so on.

The derivation of Newcomb's formula

> In 1881, Simon Newcomb [...] conjectured that the probability that
> the first digit of a number is d is (log (+ 1 (/ d)) 10).

is left as a simple exercise to the reader :-)

Benford's Law will hold for numbers which describe anything that is
scale free. Some examples:

The size of humans has a clear scale: We are about 2m long; people's
heights will not obey Benford's law. (If expressed in feet, 5 will
dominate.)

The size of living organisms has no scale: anything from the
unicellular to Blue Whales is possible (and that's just on _this_
planet); the size of living organisms will follow Benford's Law.

Benfords law is a reflection of the fact that small things are common,
while large things are uncommon: the larger, the less common.

> Zipf's law is an empirical generality that was noticed
> more than a century ago. It states that the relative
> word frequency of a word w in a text is inversely
> proportional to its rank r: f(w) ~ (/ r).
>
> Actually, Zipf's law is f(w) ~ (/ (* r (log (* 1.78 R))))
> where R is the total number of words.

> My intuition is that numbers in the real world behave
> less like Platonic ideals and more like words. In
> other words, they follow Zipf's law. The questions,
> then, are:
> (1) Can Zipf's law be applied to numbers?

I think that you are asking the question the wrong way around. I'd ask
whether Benford's Law can be applied to word ferquencies. Take a text
(a biiiig one, please) count the frequencies of all the words ... and
you should find that there are relatively few big frequencies, and
lots and lots of tiny frequencies ... and I wouldn't be in the least
surprised if about 30% of the frequencies started with 1 (even when
you scale all numbers by some constant.)

Do let me know the result if you actualy do this.

At this point I briefly (about 5 seconds) toyed with the idea of
deriving Zipf's Law from Benford's, but my immediate reaction is that
the cyclic nature of the foundation of Benford's Law (as you increase
your number, the first digit cycles through 1,2,3,...,1,2,3,...) is
... well, let's say it just looks unlikely to be possible.

Joe Marshall

unread,

Jun 13, 2003, 9:36:42 AM6/13/03

to

Pascal Bourguignon <sp...@thalassa.informatimago.com> writes:

> There's no random numbers in the real world.

Sure there are, but some are more random than others.

Ed Symanzik

unread,

Jun 13, 2003, 9:35:45 AM6/13/03

to

Rather than calling leftmost-digit for every n.

(defun ones (n)
(if (= 1 (leftmost-digit n))
(+ (floor (/ (expt 10 (floor (log n 10))) 9))
(+ 1 (- n (expt 10 (floor (log n 10))))))
(floor (/ (expt 10 (+ 1 (floor (log n 10)))) 9))))

Raymond Toy

unread,

Jun 13, 2003, 10:45:50 AM6/13/03

to

>>>>> "sv0f" == sv0f <no...@vanderbilt.edu> writes:

sv0f> Raymond and Thomas noted that round off errors can
sv0f> sabotage the exponent/log approach for sufficiently
sv0f> large numbers. The iterative approach does not
sv0f> suffer this flaw.

(log n 10d0) will lose significance if the result is greater than 2^50
or so, assuming IEEE double-floats. That means n is 10^(2^50). That
probably won't fit in your computer, so the log approach is probably
good and should be quite fast.

(Assuming I didn't make a mistake and publicly embarrass myself here.)

Ray

sv0f

unread,

Jun 13, 2003, 10:57:32 AM6/13/03

to

In article <tyf8ys6...@pcepsft001.cern.ch>,
Jacek Generowicz <jacek.ge...@cern.ch> wrote:

>[Musings about Benford's law ... others contribute opinions further
>down therad]

Interesting thoughts. I'll make just a few comments because
you have provided much for me to think about.

>Heh, of course if you look at the ratios of frequencies of _second_
>digits, there is still a bias towards the smaller ones, though not as
>pronounced. And, of course, it also holds for all subsequent digits
>too, getting less significant as you go farther into the number's
>representation.

>Benford's Law still holds

>if you change the base in which you represent your numbers, or if you
>multiply all your numbers by any constant factor.

>Benford's Law will hold for numbers which describe anything that is
>scale free. Some examples:

Any account which I come up with will definitely have to
be consistent with these fascinating observations.

>I think that you are asking the question the wrong way around. I'd ask
>whether Benford's Law can be applied to word ferquencies. Take a text
>(a biiiig one, please) count the frequencies of all the words ... and
>you should find that there are relatively few big frequencies, and
>lots and lots of tiny frequencies ... and I wouldn't be in the least
>surprised if about 30% of the frequencies started with 1 (even when
>you scale all numbers by some constant.)
>
>Do let me know the result if you actualy do this.

I will do this (in the future, post dissertation) and report
back to comp.lang.lisp and to you.

>At this point I briefly (about 5 seconds) toyed with the idea of
>deriving Zipf's Law from Benford's, but my immediate reaction is that
>the cyclic nature of the foundation of Benford's Law (as you increase
>your number, the first digit cycles through 1,2,3,...,1,2,3,...) is
>... well, let's say it just looks unlikely to be possible.

What do you make of the cycling? This is the invariant, it
seems to me, even as bases and multipicative constants and
the like can be changed. I suspect that when I dig deeper,
the cycling will prove crucial for explaining the robustness
and ubiquity of the first digit law.

Jeff Massung

unread,

Jun 13, 2003, 11:04:32 AM6/13/03

to

Adam Warner <use...@consulting.net.nz> wrote in
news:pan.2003.06.13...@consulting.net.nz:

> Hi Duane Rettig,
>
>>> One of the arguments made in these kinds of discussions is that it's
>>> not numbers we're taking about, but rather things or physical
>>> quantities. It's their representation in numeric form that gives
>>> rise to the skew, if skew there is in a particular sample.
>>
>> I disagree - I think it's a simple combination of counting and
>> getting tired. Very intuitive to me:
>>
>> One tends to count starting at 0 or 1, but stopping arbitrarily. So
>> given any range, numbers starting with 1 will usually be in that
>> range.
>>

> Days: one begins at least 11/31 of the days of each month.

>
> Months: one begins 1/3 of the months of each year: {1, 10, 11 and 12}.
>

> Hours follow the same pattern as months where the leading 1 is
> represented in 1/3 of the hours. Minutes and seconds underrepresent 6,
> 7, 8 and 9 (since minutes and seconds stop at 59.999...).

So again, we're talking about number usage. My comment was simply based
on pulling random numbers. Although, I would be curious as to what
numbers are more common based on culture, race, etc.

Barry Margolin

unread,

Jun 13, 2003, 11:22:09 AM6/13/03

to

In article <none-DA3D1C.0...@news.vanderbilt.edu>,

sv0f <no...@vanderbilt.edu> wrote:
>What do you make of the cycling? This is the invariant, it
>seems to me, even as bases and multipicative constants and
>the like can be changed.

Thinking about bases suggests that maybe there's an inductive proof.
Consider that in base 2, all numbers except 0 begin with 1.

Pascal Bourguignon

unread,

Jun 13, 2003, 11:52:04 AM6/13/03

to

There are two kinds of "random". Random when the the result has no
cause, like in a pure mathematical random function, for example, the
nth digit of Pi (written in any base you want), or Random when we
don't know the cause. But in the later case, it's not random, there
is a cause. In nature, there are causes and consequences leading to
the numbers we find in nature. We don't know these cause, so we may
approach these data with statistical methods and use pure mathematical
random functions to modelize them, but there's a fundamental
difference between our models and the reality. That's what Benford's
Law shows.

And at the quantum level, we simply don't know if there's pure random
or if there is an underlying cause.

It would be interesting to analyze quantum mechanics numbers with
respect to Benford's Law.

Pascal Bourguignon

unread,

Jun 13, 2003, 12:06:50 PM6/13/03

to

Jacek Generowicz <jacek.ge...@cern.ch> writes:
> Benford's Law will hold for numbers which describe anything that is
> scale free. Some examples:
>
> The size of humans has a clear scale: We are about 2m long; people's
> heights will not obey Benford's law. (If expressed in feet, 5 will
> dominate.)
>
> The size of living organisms has no scale: anything from the
> unicellular to Blue Whales is possible (and that's just on _this_
> planet); the size of living organisms will follow Benford's Law.
>
> Benfords law is a reflection of the fact that small things are common,
> while large things are uncommon: the larger, the less common.

Not small things. Some things perhaps. Look what happens when you take
the inverse:

[25]> (defun benford (&key (n 10000) (base 10) (limit 100000000)
(map (function identity)))

(let ( (histogram (make-array base :initial-element 0)) )
(dotimes (i n)
(let ( (digit (char (princ-to-string

(funcall map (random (1+ (random limit))))) 0)))

(incf (svref histogram (parse-integer (string digit) :radix base)))))
histogram))

BENFORD
[26]> (benford :map (lambda (x) (truncate (* 100000000 (/ 100000000.0 x)))))
#(0 3082 2138 1403 994 673 567 461 354 328)

So you could take the sizes of animals in untils of the longest whale
and still get the same kind of result.

Paul Wallich

unread,

Jun 13, 2003, 1:07:20 PM6/13/03

to

Pascal Bourguignon wrote:

> Joe Marshall <j...@ccs.neu.edu> writes:
>
>>Pascal Bourguignon <sp...@thalassa.informatimago.com> writes:
>>
>>
>>>There's no random numbers in the real world.
>>
>>Sure there are, but some are more random than others.
>
>
> There are two kinds of "random". Random when the the result has no
> cause, like in a pure mathematical random function, for example, the
> nth digit of Pi (written in any base you want), or Random when we
> don't know the cause. But in the later case, it's not random, there
> is a cause. In nature, there are causes and consequences leading to
> the numbers we find in nature. We don't know these cause, so we may
> approach these data with statistical methods and use pure mathematical
> random functions to modelize them, but there's a fundamental
> difference between our models and the reality. That's what Benford's
> Law shows.
>
> And at the quantum level, we simply don't know if there's pure random
> or if there is an underlying cause.
>
> It would be interesting to analyze quantum mechanics numbers with
> respect to Benford's Law.

How big a series/dataset do you have to have before Benford's law
becomes reliable? There just aren't that many interesting numbers in
quantum mechanics. (And in the last version that I learned, most of them
were 1 by definition.)

paul

Joe Marshall

unread,

Jun 13, 2003, 1:32:49 PM6/13/03

to

Raymond Toy <t...@rtp.ericsson.se> writes:

No, the math is correct.

Unfortunately, the implementation might not work. Most lisp
implementations compute the log function by first converting the
number to a float, *then* computing the log. So you could go up to
about most-positive-double-float (10^308 << 10^(2^9) on my machine)

sv0f

unread,

Jun 13, 2003, 2:26:53 PM6/13/03

to

In article <gat-120603...@192.168.1.51>,
g...@jpl.nasa.gov (Erann Gat) wrote:

>This phenomenon occurs in distributions of numbers that refer to
>"natural" quantities, and it is independent of units. It's a
>straightforward result of the fact that most physical quantities are
>essentially random numbers drawn from finite distributions. Here's a
>quick demo that illustrates this:
>
>(defun benford (&optional (n 10000) (base 10) (limit 100000000))
> (let ( (histogram (make-array base :initial-element 0)) )
> (dotimes (i n)
> (let ( (digit (char (princ-to-string (random (random limit))) 0)) )
> (incf (svref histogram (parse-integer (string digit) :radix base)))))
> histogram))
>
>? (benford)
>#(0 2444 1815 1435 1170 943 726 612 517 338)
>?
>
>The common usage of 1,10,100 etc. may contribute to this, but most of the
>effect is the result of simple statistics and probability theory.

Interesting. My first thought was that "simple statistics and
probability theory" might explain the first digit phenomenon,
but would not generalize to other digit positions. So I
tweaked your code slightly to allow the digit position of
interest to be specified:

? (defun benford (&optional (n 10000) (digit-position 0) (base 10)

(limit 100000000))
(let ( (histogram (make-array base :initial-element 0)) )
(dotimes (i n)

(let ((num-string (princ-to-string (1+ (random (1+ (random
limit)))))))
(if (< digit-position (length num-string))
(let ( (digit (char num-string digit-position)) )

(incf (svref histogram (parse-integer (string digit)
:radix base))))

(incf (svref histogram 0)))))
(dotimes (i base)
(format t "~&~A: ~,4F" i (/ (svref histogram i) n)))))
BENFORD

and ran a few tests:

? (benford 100000 0) ; same as Erann's original (benford) but with a
larger n
0: 0.0000
1: 0.2403
2: 0.1832
3: 0.1460
4: 0.1174
5: 0.0942
6: 0.0764
7: 0.0609
8: 0.0474
9: 0.0342
NIL
? (benford 100000 1) ; Count second digits
0: 0.1116
1: 0.1095
2: 0.1056
3: 0.1032
4: 0.1013
5: 0.0990
6: 0.0949
7: 0.0945
8: 0.0920
9: 0.0885
NIL
? (benford 100000 2) ; Count third digits
0: 0.0998
1: 0.1009
2: 0.1011
3: 0.0996
4: 0.0998
5: 0.0990
6: 0.1000
7: 0.0992
8: 0.1002
9: 0.1003
NIL
? (benford 100000 3) ; Count fourth digits
0: 0.1008
1: 0.0999
2: 0.0998
3: 0.1005
4: 0.0994
5: 0.1003
6: 0.0987
7: 0.0995
8: 0.0998
9: 0.1014
NIL
?

It looks like your reasoning generalizes to other digits positions,
contrary to my initial intuition. (Of course, things are a bit
fuzzy, but I assume the pattern is more crisp for n >> 100,000.)

And, on second thought, I understand why the pattern persists
for digit position other than the first.

HOWEVER, the proportion of 1s according to your analysis -- 0.24
or so -- is quite a bit below the 0.30 to 0.33 predicted by
Benford's analysis and Zipf's law, and observed in the world.
Unless this discrepancy is an artifact of MCL's pseudorandom
number generator, I take it as an indicator that there is at
least one other factor at play. It looks like this factor (or
these factors) are captured by the 1/n weighting.

Lars Brinkhoff

unread,

Jun 13, 2003, 2:52:19 PM6/13/03

to

Section 15-3, The Distribution of Leading Digits, in Warren's book
Hacker's Delight may also be of interest.

Erann Gat

unread,

Jun 13, 2003, 2:27:10 PM6/13/03

to

In article <87wufqn...@thalassa.informatimago.com>, Pascal Bourguignon
<sp...@thalassa.informatimago.com> wrote:

> There are two kinds of "random". Random when the the result has no
> cause, like in a pure mathematical random function, for example, the
> nth digit of Pi (written in any base you want), or Random when we
> don't know the cause. But in the later case, it's not random, there
> is a cause. In nature, there are causes and consequences leading to
> the numbers we find in nature. We don't know these cause, so we may
> approach these data with statistical methods and use pure mathematical
> random functions to modelize them, but there's a fundamental
> difference between our models and the reality.

That is true.

> That's what Benford's Law shows.

That is not true. Benford's law is (mostly) a straightforward (albeit
nonintuitive) consequence of pure probability theory, and the fact that
most physical quantities are drawn from bounded distributions. See the
code that I posted elsewhere in this thread, or the explanation at
mathworld.

> And at the quantum level, we simply don't know if there's pure random
> or if there is an underlying cause.

Actually, Bell's theorem and the Aspect experiment (and others) shows that
there can be no "underlying cause." Quantum randomness is really random.

> It would be interesting to analyze quantum mechanics numbers with
> respect to Benford's Law.

You would get exactly the same result.

E.

Kent M Pitman

unread,

Jun 13, 2003, 3:19:29 PM6/13/03

to

I've only been spot-checking this discussion and I don't have time to re-read
it all in detail. Maybe someone can indulge me a simple question:

Is this business of the probability of a number intended to mean "the
probability it will be mentioned in human texts"? I originally read
this to mean "the density of numbers" which seems impossible to
understand. I also pondered whether it could mean "the usefulness of
a number" or "the likelihood that a number would result from a given
computation" but I could attach no serious meaning to these.

If these are based on human texts, then of course the tendancy of people
to "round off" results would make sense as an effect, and would explain
why it's independent of radix (as I think I saw someone say).

Also, for people who are merchants, I'd be surprised if trailing digits
were not often 9. e.g., "9.99" rather than "10.00" is a common price.

I saw stats for "nth digit" but I think you should partition those results
into
"1st digit of several" vs "1st and last digit"
since I bet that 1 is more common in the first bucket than the second.
And you want to break up
"nth digit (for n>1) as middle digit" vs "nth digit (n>1) as last digit"
since I bet the last one is more prone to have a 9 in business.

In bit-decoded odd radixes, 7's (octal) and f's (hex) are also
probably favored since the numbers 7777...777 and ffff...fff are
probably favored, as well as because the low-order bits are more
likely to be the "easiest to think of and most commonly occurring
features, with the 1 value being the most common yes value".

The fact of radix=10 and human-fingers=5 probably gives 5 and 0 a
favored value for trailing digits, too.

If this is NOT the kind of stuff you're investigating, then I am lost
and would appreciate some brief summary and/or pointer back to the
right message to read.

Ed Symanzik

unread,

Jun 13, 2003, 3:16:15 PM6/13/03

to

sv0f wrote:
> He conjectured

> that the probability that the first digit of a number is

> d is (log (+ 1 (/ d)) 10). This expression predicts that
> approximately 30% of all numbers begin with 1;

Ok. I am an utter wing nut.

for N = 100000

100000 1
10000 - 19999 10000
1000 - 1999 1000
100 - 199 100
10 - 19 10
1 1
number of 1's = 11112

This is only 11%. Where am I missing 20%?

Christophe Rhodes

unread,

Jun 13, 2003, 3:27:50 PM6/13/03

to

Ed Symanzik <z...@msu.edu> writes:

> Ok. I am an utter wing nut.
>
> for N = 100000
>
> 100000 1
> 10000 - 19999 10000
> 1000 - 1999 1000
> 100 - 199 100
> 10 - 19 10
> 1 1
> number of 1's = 11112
>
> This is only 11%. Where am I missing 20%?

for N = 20000

10000 - 19999: 10000
1000 - 1999: 1000
100 - 199: 100
10 - 19: 10
1 : 1

number of 1s = 11111

This is 56%. You happened to choose a range that minimized the number
of 1s. I've chosen one that maximizes it. Other ranges' mileages
will vary. :-)

(the point being: choose the range randomly; then consider a sample
drawn from that range. You'll get 30% or so 1s, for sufficient
random choices of ranges)

Christophe
--
http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757
(set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b)))
(defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)

sv0f

unread,

Jun 13, 2003, 4:03:36 PM6/13/03

to

In article <sfwd6hh...@shell01.TheWorld.com>,

Kent M Pitman <pit...@world.std.com> wrote:

>I've only been spot-checking this discussion and I don't have time to re-read
>it all in detail. Maybe someone can indulge me a simple question:
>
>Is this business of the probability of a number intended to mean "the
>probability it will be mentioned in human texts"?

This is the intent, but "text" can be broadly defined to
include, for example, published tables of the areas of
major rivers in addition to more standard examples
such as newspapers and Reader's Digest.

>Also, for people who are merchants, I'd be surprised if trailing digits
>were not often 9. e.g., "9.99" rather than "10.00" is a common price.

As I mentioned in another message buried in this thread,
I played around with Google a bit looking for anomalous
frequency spikes (as measured by their "number of pages"
approximation) for what I thought to be common numbers.
For example, I chose Chicago's area code (312) and looked
at the few numbers immediately before and after it. The
spike was modest at best:

310: 8,180,000
311: 5,830,000
312: 6,290,000 <--
313: 5,290,000
314: 5,270,000

Thinking maybe the 312 spike would be larger for pages
living in the US, I re-did the search, limiting the
results to English language pages. The results were
qualitatively the same:

310: 4,850,000
311: 3,280,000
312: 3,680,000 <--
313: 3,010,000
314: 3,020,000

None of this is scientific, of course. But I was struck
by how my intuition that certain numbers would be special
was not borne out.

(And, in more extensive investigations, that the first
digit law was at least not inconsistent with what Google
was telling me.)

>If this is NOT the kind of stuff you're investigating, then I am lost
>and would appreciate some brief summary and/or pointer back to the
>right message to read.

I made the initial query about testing the first digit of a
number. Geoffrey Summerhayes asked me to back up a bit and
explain why I need to do such a thing. Below is the text of
my response:

In article <none-3A8EBB.1...@news.vanderbilt.edu>,
sv0f <no...@vanderbilt.edu> wrote:

>In article <9v2Ga.12269$Gm4.1...@news20.bellglobal.com>,
> "Geoffrey Summerhayes" <sumNOS...@hotmail.com> wrote:
>
>>Just a thought, are you sure the input has to be an integer?
>>All the times I've required a function like this, I've found
>>using a different representation for the data eliminated the
>>problem, it's a pretty odd test.
>
>It is and odd test, I know. Let me come clean.

>
>I've recently run into a compelling mathematical problem
>called the "first digit phenomenon" or "first digit law".

>In 1881, Simon Newcomb noticed that books of logarithms in
>libraries were dirtier at the beginning than the end. It
>seemed as though his colleagues were looking up the
>logarithms of numbers beginning with "1" more frequently
>than numbers beginning with other digits. He conjectured

>that the probability that the first digit of a number is
>d is (log (+ 1 (/ d)) 10). This expression predicts that

>approximately 30% of all numbers begin with 1; at the
>other end of the spectrum, only 4.6% begin with 9.
>
>In 1938, Gregory Benford, an American physicist,
>independently made the same observation. He looked
>at the distribution of first digits culled from a number
>of sources, including tables of river areas, populations,
>constants, newspapers, specific heats, pressures, atomic
>weights, Reader's Digest, addresses, death rates, etc.
>The conjecture was consistent with the data.
>
>Since then, a trickle of statisticians and mathematicians
>have tried to explain the first digit law. I ran into
>it in an interview with the statistician Persi Diaconis
>in the edited volume by Albers and Alexanderson
>"Mathematical People".
>
>(I highly recommend reading this interview and the one
>with Benoit Mandelbrot in the same book. The sequel
>"More Mathematical People" contains an interview with
>Bill Gosper, to (briefly) bring the subject back to Lisp!)
>
>A Google search will quickly put you in contact with
>expository work and ongoing research by Boyle, Diaconis,
>Hill, and Raimi, as well as an article that appeared in
>The New York Times in 1998. Let me know if anyone wants
>specific references; I even have a few PDFs.

>
>The explanations that have been offered strike me as
>wrong. I think the reason for the first digit law is
>not mathematical, but social/psychological.
>

>Zipf's law is an empirical generality that was noticed
>more than a century ago. It states that the relative
>word frequency of a word w in a text is inversely
>proportional to its rank r: f(w) ~ (/ r).
>
>Actually, Zipf's law is f(w) ~ (/ (* r (log (* 1.78 R))))
>where R is the total number of words.
>

>Two definitions are in order. The "relative word
>frequency" of a word is its frequency divided by the
>frequency of the most frequent word. In other words,
>the most frequent word has a relative word frequency
>of 1. Also, the rank of a word is its position in a
>list of words ordered from most frequent to least
>frequent. In English, for example, "the" has rank 1,
>"and" rank 2, etc.

>
>My intuition is that numbers in the real world behave
>less like Platonic ideals and more like words. In
>other words, they follow Zipf's law. The questions,
>then, are:
> (1) Can Zipf's law be applied to numbers?

> (2) If so, does the first digit law result?
> (3) If so, can I get more empirical evidence
> for the first digit law than that provided
> by Newcomb and Benford?
>
>With respect to (1), the only real problem is that
>the original formulation of Zipf's law requires
>knowing the total number of numbers, R. Of course,
>R is infinite for the case of numbers -- this is a
>problem. However, I can think of two ways around
>this problem.
>
>First, although the number of numbers may be infinite,
>the numbers that appear in the world -- in printed
>matter, on the web, etc. -- are not. As an approximation,
>MOST-POSITIVE-FIXNUM for MCL 4.1, the number of electrons
>in the universe, or the number of possible chess games
>might do.
>
>Second, and more importantly, if I compute f(w) for each
>number in some range of interest according to the simple
>(reciprocal) formulation of Zipf's law, then I can divide
>the sum of the relative frequency of all numbers beginning
>with 1 by the sum of the relative frequency of all numbers
>to get the proportion of all numbers beginning with 1.
>This is because in the division, all the (log (* 1.78 R))
>terms will cancel out. In other words, as long as I
>compute in this laborious fashion, I'm safe.
>
>With respect to (2), the first digit law does in
>fact result if one applies Zipf's law to numbers. First,
>a critical assumption must be made:
> (A) The rank of a number w is simply w.
>In other words, 1 is the most frequent number, 2 the second
>most frequent number, etc. (This assumption can be tested
>empirically as described below; it appears to hold.)
>
>Under (A), the code:
>
>(defun prob-one (n &key epoch)
> (when (or (not (integerp n))
> (< n 1)
> (>= n 1000000000))
> (error "N must be an integer in [1,999999999]."))
> (unless (or (null epoch)
> (and (integerp epoch)
> (<= 1 epoch n)))
> (error "EPOCH must be either NIL or an integer in [1,~A]." n))
> (let ((num 0)
> (den 0))
> (do* ((i 1 (1+ i))
> (term (/ 1 i) (/ 1 i)))
> ((> i n))
> (when (or (= i 1)
> (<= 10 i 19)
> (<= 100 i 199)
> (<= 1000 i 1999)
> (<= 10000 i 19999)
> (<= 100000 i 199999)
> (<= 1000000 i 1999999)
> (<= 10000000 i 19999999)
> (<= 100000000 i 199999999))
> (incf num term))
> (incf den term)
> (when (and epoch
> (zerop (mod i epoch)))
> (format t "~&~A:~A~A" i #\tab (float (/ num den)))))
> (float (/ num den))))
>
>produces:
>
>? (prob-one 999999999 :epoch 10000)
>10000: 0.31753649107573223
>20000: 0.3626650915013386
>30000: 0.3491576030219523
>40000: 0.34016832787531365
>50000: 0.3335081987176785
>60000: 0.32825701315681066
>70000: 0.3239445091319402
>80000: 0.32029940436686627
>90000: 0.3171516123911374
>100000: 0.3143886117731376
>110000: 0.319751187767352
>120000: 0.32457410795821995
>130000: 0.3289507653399085
>140000: 0.3329526520875796
>150000: 0.33663565393452183
>160000: 0.3400442456218855
>170000: 0.34321437227408086
>180000: 0.3461754799391808
>190000: 0.3489519789920579
>200000: 0.35156392865612124
>210000: 0.3502272148265025
>220000: 0.34896212882018485
>230000: 0.3477617972961802
>240000: 0.3466202718086957
>250000: 0.3455323716123922
>260000: 0.3444935583083216
>270000: 0.3434998349218677
>280000: 0.3425476639347623
>290000: 0.3416339001707058
>300000: 0.3407557354302229
>310000: 0.33991065249977476
>320000: 0.3390963867004261
>330000: 0.33831089354589533
>340000: 0.3375523213857245
>350000: 0.3368189881428105
>360000: 0.336109361434329
>370000: 0.33542204150464894
>380000: 0.3347557465080133
>390000: 0.3341092997647813
>400000: 0.33348161868326376
>410000: 0.33287170509366826
>420000: 0.3322786367844269
>430000: 0.3317015600665375
>440000: 0.33113968322026466
>[...]
>
>This output is highly suggestive, isn't it? It looks
>like the proportions of numbers beginning with 1
>oscillates between 0.36 and 0.31, at least in the
>range of numbers examined. The oscillations make
>sense if you think about it. Interestingly (and
>again obviously, if you think about it), they occur
>in intervals orders of magnitude smaller or large.
>In other words, a similar pattern occurs between,
>say, 1 and 44000 when the proportion is printed every
>1000 numbers.
>
>I suspect 0.33 is the expected proportion of numbers
>beginning with 1 if Zipf's law applies to numbers.
>(Can you guess why?)
>
>The function has its good and bad points. The bignums
>and rational numbers of Common Lisp mean no round off
>error is accumulating along the way. This is good.
>However, it isn't terribly efficient. In fact, it is
>a memory pig.
>
>Finally, with respect to (3), I have only done some
>preliminary investigations. Does the first digit
>law hold for web pages? A little messing around
>with Google was encouraging, but more work remains
>to be done. For example, one problem with Google
>is that the count of the total number of pages
>containing a numerical search term is rounded quite
>severely. Google can also be used to test the
>empirical merit of assumption (A) above. Once again,
>preliminary investigations indicate that it is
>warranted.
>
>Alas, I have to put this project away for a few
>months and finish a dissertation. However, any ideas,
>feedback, etc., are welcome.

>
>For example, it would be nice to have a closed form
>expression for the sum of 1/n where n ranges between
>the integers i and j, instead of having to iterate a
>la PROB-ONE above. (Of course, the sum for i=1 and
>j=infinity is infinity.) A quick perusal of tables of
>formulas in my mathematics references was not helpful
>in this regard, but perhaps someone here can be of some
>help. I'll even offer a Knuthian $2.56 for the first
>claimant!
>

>Sashank Varma

Pascal Bourguignon

unread,

Jun 14, 2003, 4:23:56 AM6/14/03

to

Paul Wallich <p...@panix.com> writes:
> How big a series/dataset do you have to have before Benford's law
> becomes reliable?

It does not seem to need a lot of numbers. With a few tens it's
already visible (while perhaps not reliable: there's easily spikes
around other digits with so few numbers).

> There just aren't that many interesting numbers in
> quantum mechanics. (And in the last version that I learned, most of
> them were 1 by definition.)
>
> paul

Measures, like, sequences of coordinates of an electron. Or series of
velocities of an electron.

Well, you could even start from series of velocities of Brownian
particules. That's not quantum mechanics, but it should be interesting
to see if its natural random or mathematical random at this scale too.

Martin Rubey

unread,

Jun 14, 2003, 5:00:48 AM6/14/03

to

Now I'm confused again, sv0f replied to Kent that it's about numbers in human
texts, your (plausible) explanation below suggests it's about probability. So
what is it about?

Martin

Christophe Rhodes

unread,

Jun 14, 2003, 5:42:52 AM6/14/03

to

Martin Rubey <mrst...@yahoo.de> writes:

> Now I'm confused again, sv0f replied to Kent that it's about numbers in human
> texts, your (plausible) explanation below suggests it's about probability. So
> what is it about?

Both. One subsumes the other.

Let's imagine that we're writing the "river length" section of an
almanac. We'll take a bunch of rivers, measure their lengths (with
some previously agreed-upon yardstick, to avoid fractal effects) and
report them all.

River lengths could be seen as being drawn from some underlying
population; probably not a uniform one, but not biased towards a
particular value either. So when we sample this distribution (we're
not going to be tabulating all rivers, after all, though the results
would be the same; maybe we're doing all the rivers above a certain
length in the good old U S of A) we're going to be drawing values from
a certain range that is a priori randomly chosen.

All of the category editors for the almanac will be doing much the
same thing, sampling from a random range. On average then, the number
of numbers starting with 1 in the whole almanac will be somewhere
between 11% (your case, all distributions maximally biassed against 1)
and 55% (my case, maximally biassed towards 1), but will tend to about
30%.

Just to be clear: it's not a societal artefact; it's not about numbers
in _human_ texts; numbers in klingon, wookie and cheela texts would
obey the same rules. Not all numbers follow this rule, either;
samples taken from a random Gaussian (normal, bell curve) won't follow
this law (so the table of presidents' heights and weights would count
against these general statistics :-).

Pascal Bourguignon

unread,

Jun 14, 2003, 9:25:22 AM6/14/03

to

Christophe Rhodes <cs...@cam.ac.uk> writes:
> Just to be clear: it's not a societal artefact; it's not about numbers
> in _human_ texts; numbers in klingon, wookie and cheela texts would
> obey the same rules. Not all numbers follow this rule, either;
> samples taken from a random Gaussian (normal, bell curve) won't follow
> this law (so the table of presidents' heights and weights would count
> against these general statistics :-).

For examples, you can take:

- the sizes of all the files in your file system,

- the ticket totals of all your shopping of last year,

- the elapsed time spent by each of your processes,

- the number of cars passing by a road each period of 5 mn (or
longer periods for lighter trafic),

- etc.

For my shopping of last 18 months:

(list (length values)
(apply (function min) values) (apply (function max) values))

;;(276 0.5 75.64)

(list->histogram (mapcar (lambda (price) (truncate (* 100 price))) values)
:base 6)
;;#(0 118 47 45 34 32)

(list->histogram values)
;;#(3 113 52 22 18 11 11 16 14 16)

(list->histogram (mapcar (lambda (price) (truncate (* 100 price))) values)
:base 20)
;;#(0 44 55 47 32 17 19 7 8 8 10 6 4 3 4 1 0 4 2 5)

Note the effect of various transformations:

(list->histogram (mapcar (lambda (price) (truncate (* 100 (abs (sin price)))))
values))
;;#(1 15 27 26 17 25 20 30 36 79)

(list->histogram (mapcar (lambda (price) (truncate (* 100 (sqrt price))))
values))
;;#(0 39 58 85 49 26 12 6 1 0)

(list->histogram (mapcar (lambda (price) (truncate (* 100 (/ price))))
values))
;;#(0 70 36 30 27 32 28 25 15 13)

Kent M Pitman

unread,

Jun 14, 2003, 1:18:24 PM6/14/03

to

Christophe Rhodes <cs...@cam.ac.uk> writes:

> Let's imagine that we're writing the "river length" section of an

> almanac. [...] River lengths could be seen as being drawn from some

> underlying population; probably not a uniform one, but not biased

> towards a particular value either. All of the category editors for
> the almanac will be doing much the same thing, [...]

> Just to be clear: it's not a societal artefact; it's not about numbers
> in _human_ texts; numbers in klingon, wookie and cheela texts would
> obey the same rules. Not all numbers follow this rule, either;
> samples taken from a random Gaussian (normal, bell curve) won't follow
> this law (so the table of presidents' heights and weights would count
> against these general statistics :-).

I'm not sure I agree that there is no bias. In my youth, when I had
nothing better to do, I used to experiment with the design of
measurement systems. Calendars, rulers, etc.

People name their units so that there will be easy things to count. For
example, they introduce a new unit "week" so they can say "a week" rather
than "7 days". The reason the inch, the foot, the cubit, etc. are useful
measures is that they count real things that come up. While I like the
regularity of the metric system, one thing I don't like about it is that
it doesn't count much of interest in the units area. Most things in the
world are not a single millimeter, a single centimeter, a single decimeter.
Meters are sometimes useful, as are yards. By contrast, Fahrenheit bothers
me a lot because it moves in increments that are irrelevant. The difference
between 33 and 34 degrees cannot, I believe, be usefully sensed by someone
in Fahrenheit. The difference between 20 and 21 degrees C, I think, can
usefully be sensed. This makes centigrade (and Kelvin) a good measure of
temperature not just because it mulitplies better for chemistry, but because
watching for something to "go up a degree" is meaningful.

Likewise, in the stock market, the passion for splits and reverse splits is
based around a human desire to deal in numbers that are in a certain range,
independent of their basic value. In essence, a reverse split just says
that the traders would rather deal in 1's and 2's than fractions, or would
rather trade in 1/8's than 1/32's.

(And even then, the decision of whether "1" in "1/8" to name a unit is going
to be counted among the instances of 1's has at least a small effect on your
stats... As in, "a minute is 1/60 of an hour." For the most part, we don't
name units as "3/n" of something; it's usually "1/n".)

My point is only this: when analyzing the frequency of numbers in a unit
context, make sure you understand how the units are chosen because the bias
may be hidden there.

Kalle Olavi Niemitalo

unread,

Jun 14, 2003, 3:06:01 PM6/14/03

to

sv0f <no...@vanderbilt.edu> writes:

> My thinking was that I could create a single, large string with
> a fill pointer and reuse this instead of having PRINC-TO-STRING
> generate a new string each time.

String output streams presumably have something like that under
the cover. In CMUCL, such a stream carries a string that
collects the output, and a separate integer that acts as a fill
pointer. GET-OUTPUT-STREAM-STRING then copies the text to a new
string and resets the fill pointer.(1) If you accessed the buffer
directly, you could avoid the copy... but I don't think there is
a standard way to do that.

Gray streams or simple streams could help, though. Or try a
DYNAMIC-EXTENT declaration.

(1) As the buffer string can be reused, it is more efficient to
generate multiple strings with the same string output stream
than to make a new stream for each. Accordingly, CMUCL's
PRINC-TO-STRING implementation uses CL::STRINGIFY-OBJECT,
which manages a pool of string output streams.

Matthew Danish

unread,

Jun 15, 2003, 3:26:00 AM6/15/03

to

On Sat, Jun 14, 2003 at 01:18:24PM -0400, Kent M Pitman wrote:
> Fahrenheit bothers me a lot because it moves in increments that are
> irrelevant. The difference between 33 and 34 degrees cannot, I
> believe, be usefully sensed by someone in Fahrenheit.

I seem to recall from some long ago read textbook that the Fahrenheit
scale was given degree increments that were supposed to be the
``smallest increase in temperature that a human could detect.'' Also, 0
degrees Fahrenheit was the temperature of some ice/salt mixture.

--
; Matthew Danish <mda...@andrew.cmu.edu>
; OpenPGP public key: C24B6010 on keyring.debian.org
; Signed or encrypted mail welcome.
; "There is no dark side of the moon really; matter of fact, it's all dark."

Kalle Olavi Niemitalo

unread,

Jun 15, 2003, 8:35:59 AM6/15/03

to

Kalle Olavi Niemitalo <k...@iki.fi> writes:

> If you accessed the buffer directly, you could avoid the
> copy... but I don't think there is a standard way to do that.

On second thought, there is a way, via WITH-OUTPUT-TO-STRING:

(defun count-ones (limit)
(let ((string (make-array 100 :element-type 'character :fill-pointer 0))
(*print-pretty* nil))
(with-output-to-string (stream string)
(loop for number from 0 upto limit
do (setf (fill-pointer string) 0)
(princ number stream)
count (char= (char string 0) #\1)))))

If the numbers are fixnums, then this does not cons per iteration
on CMUCL 18e.

thomas

unread,

Jun 15, 2003, 7:56:24 PM6/15/03

to

Matthew Danish <mda...@andrew.cmu.edu> wrote in message news:<20030615072...@lain.mapcar.org>...

> I seem to recall from some long ago read textbook that the Fahrenheit
> scale was given degree increments that were supposed to be the
> ``smallest increase in temperature that a human could detect.'' Also, 0
> degrees Fahrenheit was the temperature of some ice/salt mixture.

(from similarly dusty memories) 0 deg F was the freezing point of a
saturated salt (NaCl) solution (but you can get lower with other
salts), and 100 deg F was suposed to be the temperature of human
blood, but the person who got picked as the sample was running a
fever...

thomas

Jacek Generowicz

unread,

Jun 16, 2003, 2:40:30 AM6/16/03

to

mot...@yahoo.com (thomas) writes:

> Matthew Danish <mda...@andrew.cmu.edu> wrote:

> > Also, 0 degrees Fahrenheit was the temperature of some ice/salt
> > mixture.
>
> (from similarly dusty memories) 0 deg F was the freezing point of a
> saturated salt (NaCl) solution (but you can get lower with other
> salts), and 100 deg F was suposed to be the temperature of human
> blood, but the person who got picked as the sample was running a
> fever...

Fahrenheit (IIRC) was the first to make a mercury thermometer. When it
becomes cold enough for mercury to freeze, the thermometer doesn't
work all that well any more ... so the bottom of Fahrenheit's scale
was determined by the freezing point of mercury.

Marc Spitzer

unread,

Jun 16, 2003, 5:01:52 AM6/16/03

to

Jacek Generowicz <jacek.ge...@cern.ch> writes:

The coldest temp. in Fahrenheit is the Fahrenheit equiv of 0 kelvin,
I forget the conversion formula but it is a negative number.

The story I was told was that Fahrenheit used the coldest temp he could
create, table salt and ice, to be the 0 point on his scale and then he
had a scale that he used from that 0 point to measure temperature with.
the triple point of water happened to be 32 and boiling point happened
to be at 212.

marc

Jacek Generowicz

unread,

Jun 16, 2003, 7:24:43 AM6/16/03

to

sv0f <no...@vanderbilt.edu> writes:

> In article <tyf8ys6...@pcepsft001.cern.ch>,
> Jacek Generowicz <jacek.ge...@cern.ch> wrote:
>
> >Do let me know the result if you actualy do this.
>
> I will do this (in the future, post dissertation) and report
> back to comp.lang.lisp and to you.

Thanks ... I'm curious.

> >At this point I briefly (about 5 seconds) toyed with the idea of
> >deriving Zipf's Law from Benford's, but my immediate reaction is that
> >the cyclic nature of the foundation of Benford's Law (as you increase
> >your number, the first digit cycles through 1,2,3,...,1,2,3,...) is
> >... well, let's say it just looks unlikely to be possible.

Acutally, I suspect that Benford's law gives you Zipf's Law whithout
specifying what the constant (1.78 as you cited it) is. The 1.78 makes
a sociological statement ... or rather, a statement about (the
English) language, about which Benford's Law has nothing to say.

(I really must find 10 minutes to sit down with pencil and
paper and sort this out.)

> What do you make of the cycling? This is the invariant, it
> seems to me, even as bases and multipicative constants and
> the like can be changed. I suspect that when I dig deeper,
> the cycling will prove crucial for explaining the robustness
> and ubiquity of the first digit law.

Absolutely no doubt about it. Remember the logarithmic graph paper
analogy; the distance between the lines labelled 1x10^N and 2x10^N is
_always_ greater than the distance between any other pair of major
lines. Cycling ensures that you will find a [1,2) region rearby
_wherever_ you look ... all that is left is the requirement that the
shape of your distribution be scale free (not even completely scale
free, but locally scale-free over a couple of orders of magnitude (in
your chosen base) at least).

By "scale free", I mean that looking at the graph, the shape won't
change if you scale the axes ... which suggests an exponential curve
... which maps to a uniform distribution on a logarithmic scale.

(By locally scale free, I mean that it is an exponential over some
range, and 0 elsewhere.)

In summary: Benford's Law is a simple mathematical statement about
distributions in which the frequency is an exponentially decaying
function of the abcissae; it so happens that such distributions are
rather common in nature.

(OK, 5 minutes with pencil and paper, starting with the assumptions
underlying Benford's Law, I get a distribution whose shape resembles
Zip's, but doesn't seem to be mathematically equivalent. Hmmm.)

Jacek Generowicz

unread,

Jun 16, 2003, 7:32:50 AM6/16/03

to

Marc Spitzer <mspi...@optonline.net> writes:

> > Fahrenheit (IIRC) was the first to make a mercury thermometer. When it
> > becomes cold enough for mercury to freeze, the thermometer doesn't
> > work all that well any more ... so the bottom of Fahrenheit's scale
> > was determined by the freezing point of mercury.

> The story I was told was that Fahrenheit used the coldest temp he could

> create, table salt and ice, to be the 0 point on his scale

Just looked up the melting point of Hg, and it's about -38 Fahrenheit
... which makes your version sound a lot more plausible than mine.

Pascal Bourguignon

unread,

Jun 16, 2003, 9:18:57 AM6/16/03

to

Marc Spitzer <mspi...@optonline.net> writes:

(defun cefak (k)
"Insert at point a string with degree Kelvin, Celcius and Fahrenheit."
(insert (format "\n%10.2f K %10.2f 蚓 %10.2f 蚌"
k (- k 273.15) (+ 32.0 (* 1.8 (- k 273.15))))))

(defun fahrenheit (f)
"Convert degrees Fahrenheit to Celcius and Kelvin."
(interactive "XDegrees Fahrenheit: ")
(cefak (+ 273.15 (/ (- f 32) 1.8))))

(defun celcius (c)
"Convert degrees Celcius to Fahrenheit and Kelvin."
(interactive "XDegrees Celcius: ")
(cefak (+ 273.15 c)))

(defun kelvin (k)
"Convert degrees Kelvin to Fahrenheit and Celcius."
(interactive "XDegrees Kelvin: ")
(cefak k))

(One of my first programs, originally written in L.S.E. Ok, I can't
resist giving it here:

1*BOURGUIGNON PASCAL 4*6 25-05-78
2*CEFAK
3*TRANSFORMATION DE DEGRES C EN DEGRES F ET K ET INVERSEMENT
4*
5*TOURNE
6
8 CHAINE UNIT
10 AFFICHER[/,20X,'TRANSFORMATION DES DEGRES C EN DEGRES F ET K ET INVERSEME']
12 AFFICHER['NT',/,'LORSQUE VOUS VERREZ APPARAITRE LE SIGNE ":",TAPEZ']
14 AFFICHER[' LE NOMBRE',/,'DE DEGRES,UN BLANC,L''UNITE (C,F OU K) ET ']
16 AFFICHER['SUR LES TOUCHES(CTRL X-OFF)',/,'EXEMPLE :65.3 F(CTRL X-OFF)',/]
18 AFFICHER[/,'QUAND VOUS N''AUREZ PUS DE VALEURS A DONNER,TAPEZ "FIN" A LA']
20 AFFICHER['PLACE DE L''UNITE',/,'COMPRIS?'];LIRE UNIT
22 SI UNIT='NON' ALORS ALLER EN 6
24 AFFICHER[/,10X,34'#',/,10X,'#',4X,'C*',4X,'#',4X,'F*',4X,'#,'4X,'K*']
26 AFFICHER[/,4X,'#']
28 AFFICHER[/,10X,34'#',/]
30 AFFICHER[C,79X,C,10X,'#',10X,'#',10X,'#',10X,'#']
32 AFFICHER[X,':';LIRE TEM;LIRE UNIT
34 SI UNIT='C' ALORS &CFK(TEM)
36 SI UNIT='F' ALORS &FCK(TEM)
38 SI UNIT='K' ALORS &KCF(TEM)
40 SI UNIT='FIN' ALORS ALLER EN 44
42 ALLER EN 30
44 AFFICHER[C,79X,C];TERMINER
50 PROCEDURE &CFK(C)
52 F_1.8*C+32;K_C+273.16
54 &AFF(C,F,K);RETOUR EN 28
58 PROCEDURE &FCK(F)
60 C_(F-32)/1.8;K_(F-32)/1.8+273.16
62 &AFF(C,F,K);RETOUR EN 28
66 PROCEDURE &KCF(F)
68 C_K-273.16;F_(K-273.16)*1.8+32
70 &AFF(C,F,K);RETOUR EN 28
74 PROCEDURE &AFF(C,F,K)
76 AFFICHER[C,79X,C,10X,'#',F5.2,2X,'#',F5.2,2X,'#',F52.,2X,'#']C,F,K
78 RETOUR

Pekka P. Pirinen

unread,

Jun 16, 2003, 12:11:58 PM6/16/03

to

Joe Marshall <j...@ccs.neu.edu> writes:
> Unfortunately, the implementation might not work. Most lisp
> implementations compute the log function by first converting the
> number to a float, *then* computing the log. So you could go up to
> about most-positive-double-float (10^308 << 10^(2^9) on my machine)

With that (quite reasonable) implementation, it's even worse than
that. The float has to be able to represent the difference between
the successive integers 9...9 vs. 1...0 and 19...9 vs. 20...0
accurately, so you can only go to about (expt (float-radix f)
(float-digits f)) (< 10^16 on my machine). And if the log function
gets even a one-bit error on either boundary (as Kent reported at
10^3), you're screwed.
--
Pekka P. Pirinen
Don't force it, get a larger hammer.

Jacek Generowicz

unread,

Jun 17, 2003, 2:57:18 AM6/17/03

to

Kent M Pitman <pit...@world.std.com> writes:

> Is this business of the probability of a number intended to mean "the
> probability it will be mentioned in human texts"? I originally read
> this to mean "the density of numbers" which seems impossible to
> understand. I also pondered whether it could mean "the usefulness of
> a number" or "the likelihood that a number would result from a given
> computation" but I could attach no serious meaning to these.

Benford's Law is a statement about the most significant digit of
numbers which represent objects occuring naturally, specifically about
distributions of objects which are scale free.

I think that this is where the OP started his quest, as he suggests in
message

none-3A8EBB.1...@news.vanderbilt.edu

Since then a lot of other "sociological" considerations have been
mixed into the discussion, which IMHO have nothing to do with
Newcomb's and Benford's observations.

Jacek Generowicz

unread,

Jun 17, 2003, 7:57:44 AM6/17/03

to

sv0f <no...@vanderbilt.edu> writes:

> In article <gat-120603...@192.168.1.51>,
> g...@jpl.nasa.gov (Erann Gat) wrote:

> >(defun benford (&optional (n 10000) (base 10) (limit 100000000))
> > (let ( (histogram (make-array base :initial-element 0)) )
> > (dotimes (i n)
> > (let ( (digit (char (princ-to-string (random (random limit))) 0)) )
> > (incf (svref histogram (parse-integer (string digit) :radix base)))))
> > histogram))
> >
> >? (benford)
> >#(0 2444 1815 1435 1170 943 726 612 517 338)

Let's look at it another way. Imagine you have a probability
distribution function in which numbers starting with most significant
digit 1 are more likely than others. By definition, the first digit
law will hold. What might such a distribution function look like? The
obvious guess is one in which there are humps in the regions where
numbers start with 1. But if you add the requirement that the law must
still hold if you rescale or change base, then this type of function
no longer works. You need to look for something more subtle.

Let's work in base 2 for convenience, that way we only have to
consider 1 and 2 as leading digits. We are looking for a function
where for every region where the abcissae start with a 1 (call this a
1-region), there is a corresponding 2-region where the area under the
curve is smaller than it is in the 1-region. Note that _any
monotonically decreasing function_ satisfies this requirement (to the
right of every 1-region, there is a 2-region (of equal width) over
which the value of the function is lower than it is in the 1-region)
as long as:

- the abcissae range over a few orders of magnitude

- edge effects don't spoil it

[What do I mean by edge effects?

Well, if your range starts at the left of a 1-region, and ends at the
right of a 2-region, everything is fine. But if the range starts at
the left of a 2-region and ends at the right of a 1-region, you might
have problems, because each 1-region is now paired with a
higher-valued 2-region (to its left). However, the 1-region is wider
than the 2-region to its left, so the decay must be sufficiently
strong so that the _area_ under the curve in the 1-region be smaller.]

Now, by the same type of argument you can deduce that monotonically
_increasing_ probability distributions will favour numbers starting
with 9 (in base 10, that is) !

> HOWEVER, the proportion of 1s according to your analysis -- 0.24
> or so -- is quite a bit below the 0.30 to 0.33 predicted by
> Benford's analysis

This is because the 0.30 from Benford's Law refers to the specific
case of _exponential_ distribution functions only ...

> Unless this discrepancy is an artifact of MCL's pseudorandom
> number generator,

... while "(random (random limit))" is just a funny way of obtaining a
probability distribution function which _does_ exhibit the necessary
features (it is monotonically decreasing), but is _not_ exponential;
therefore the numbers deduced from exponential distributions do not
apply, even though the generalized first digit law still does.

The nice thing about the exponential function is that it is scale
free, so it looks the same wherever you look, and hence the relative
sizes of adjacent n-regions are always the same, so you can work out
the 0.30 just by looking at a single order of magnitude. The same is
not true for other distributions, so it's not as easy to work out the
bias analytically.

Now, let's hack Erann's function to make it use a monotonically
increasing distribution:

(defun benford (&optional (n 10000) (base 10) (limit 100000000))
(let ( (histogram (make-array base :initial-element 0)) )
(dotimes (i n)

(let* ((sublimit (random limit))
(digit (char (princ-to-string (- (+ limit (random sublimit)) sublimit)) 0)) )

(incf (svref histogram (parse-integer (string digit) :radix base)))))
histogram))

* (benford)
#(0 157 279 465 641 815 1083 1409 1824 3327)

Wheeeeeee !

Ivan Boldyrev

unread,

Jun 17, 2003, 12:09:15 PM6/17/03

to

On 8410 day of my life Kent M. Pitman wrote:

> My point is only this: when analyzing the frequency of numbers in a
> unit context, make sure you understand how the units are chosen
> because the bias may be hidden there.

Donal E. Knuth in his "The Art Of Computer Programming" mentions
discussed phenomena (AFAIR, it is vol.2). And he says that phenomena
is unit-invariant. But I do not remember if he proved it in TAOCP.

--
Ivan Boldyrev

"Assembly of Japanese bicycle require great peace of mind."

Raymond Toy

unread,

Jun 23, 2003, 1:22:13 PM6/23/03

to

>>>>> "Joe" == Joe Marshall <j...@ccs.neu.edu> writes:

Joe> Raymond Toy <t...@rtp.ericsson.se> writes:
>> >>>>> "sv0f" == sv0f <no...@vanderbilt.edu> writes:
>>
sv0f> Raymond and Thomas noted that round off errors can
sv0f> sabotage the exponent/log approach for sufficiently
sv0f> large numbers. The iterative approach does not
sv0f> suffer this flaw.
>>
>> (log n 10d0) will lose significance if the result is greater than 2^50
>> or so, assuming IEEE double-floats. That means n is 10^(2^50). That
>> probably won't fit in your computer, so the log approach is probably
>> good and should be quite fast.
>>
>> (Assuming I didn't make a mistake and publicly embarrass myself here.)

Joe> No, the math is correct.

Whew!

Joe> Unfortunately, the implementation might not work. Most lisp
Joe> implementations compute the log function by first converting the
Joe> number to a float, *then* computing the log. So you could go up to
Joe> about most-positive-double-float (10^308 << 10^(2^9) on my machine)

Well, that would be a quality of implementation issue. If this is
important, pick an implemenation that does a better job. :-)

If not, it's still pretty easy to do yourself in CL.

Ray