Re: K random numbers

Richard Heathfield

unread,

Nov 16, 2009, 3:29:49 AM11/16/09

to

In
<0.f2cb42259b4f36be8db5.2009...@bsb.me.uk>,
Ben Bacarisse wrote:

> Tagore <c.lang...@gmail.com> writes:
>
>> I want to generate k different random numbers between 1 to n.
>> (k<n).
>> One of the appraoch is calling random function again and again. and
>> checking wheter new generated number had already been generated.
>> but its complexity is O(n^2).....Is there any better mthod?
>
> This is more suitable for comp.programming and I have set followups.
>
> I am not sure about your Big O. It depends on the what you are
> measuring, but there is no obvious upper bound for the algorithm you
> describe.

Oh, there is, provided you have a probability theorist tucked under
one arm and a Donald Knuth or an Alan Turing under the other. (I
don't.)

> There is an elegant algorithm by Floyd for this.

I couldn't find this (truly elegant) algorithm in TAOCP, but that
doesn't mean it isn't there. (I gave up after searching the whole of
Chapter 1 and about half of Floyd's index entries.)

For those who would like a less elegant solution that still beats not
only O(rnd()) but also O(n*n), the Fisher-Yates Shuffle is easy to
understand and easy to implement.

To extract k different random numbers in the range 1 to n:

Step 0: Allocate an array of n items.
Step 1: Initialise the array:
for(i = 0; i < n; i++)
{
arr[i] = i + 1;
}
Step 2: Fisher-Yates Shuffle
for(i = 0; i < k; i++) /* [1] */
{
r = random_number_in_range(i, n - 1);
t = arr[i];
arr[i] = arr[r];
arr[r] = t;
}
Step 3: Use (ONLY) the first k numbers from the array.

[1] If you want to be able to use the whole deck, you can use i < n if
you prefer. It doesn't actually make any difference, but it's a
little more self-documenting.

<snip>

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
"Usenet is a strange place" - dmr 29 July 1999
Sig line vacant - apply within

Ben Bacarisse

unread,

Nov 16, 2009, 9:10:04 AM11/16/09

to

Richard Heathfield <r...@see.sig.invalid> writes:

> In
> <0.f2cb42259b4f36be8db5.2009...@bsb.me.uk>,
> Ben Bacarisse wrote:
>
>> Tagore <c.lang...@gmail.com> writes:
>>
>>> I want to generate k different random numbers between 1 to n.
>>> (k<n).
>>> One of the appraoch is calling random function again and again. and
>>> checking wheter new generated number had already been generated.
>>> but its complexity is O(n^2).....Is there any better mthod?

<snip>

>> There is an elegant algorithm by Floyd for this.
>
> I couldn't find this (truly elegant) algorithm in TAOCP, but that
> doesn't mean it isn't there. (I gave up after searching the whole of
> Chapter 1 and about half of Floyd's index entries.)

It is in Chapter 3 -- Random Numbers, specifically 3.4.2 exercise 17.
3.4.2 has Algorithm S which uses no more than n random numbers (rather
than k) but does not require a set data structure:

t := 0; m := 0;
while m < k
r := random_float_in_0_1();
if r*(n - t) < k - m then
select_print_whatever(t + 1)
m := m + 1
t := t + 1

I have always used Floyd's algorithm because I have not needed a
random sample in a situation where a set of selected items was in any
way inefficient or inconvenient. In fact, the set S (from my previous
post) often gets used as the result of the algorithm rather than an
array or list of k integers.

But if you just want pick k from n and you don't want to use the very
fewest random numbers, the above is hard to beat. As always, the
exercises suggest some improvements, but the result is fiddly.

> For those who would like a less elegant solution that still beats not
> only O(rnd()) but also O(n*n), the Fisher-Yates Shuffle is easy to
> understand and easy to implement.

Which Knuth tells me was published in 1938. Yes, 1938!

<snip>
--
Ben.

Richard Heathfield

unread,

Nov 16, 2009, 11:46:00 AM11/16/09

to

In
<0.4a845894bfbdd0bdd9e7.2009...@bsb.me.uk>,
Ben Bacarisse wrote:

> Richard Heathfield <r...@see.sig.invalid> writes:
>
<snip>

>> I couldn't find this (truly elegant) algorithm in TAOCP, but that
>> doesn't mean it isn't there. (I gave up after searching the whole
>> of Chapter 1 and about half of Floyd's index entries.)
>
> It is in Chapter 3 -- Random Numbers, specifically 3.4.2 exercise
> 17.

Thanks - got it. Don't ask me why I was looking in Ch1 instead of Ch3!

<snip>

>> For those who would like a less elegant solution that still beats
>> not only O(rnd()) but also O(n*n), the Fisher-Yates Shuffle is easy
>> to understand and easy to implement.
>
> Which Knuth tells me was published in 1938. Yes, 1938!

I think time machines should be banned - they ain't nothing but
trouble. Are you listening, Mr Fisher-Yates?

stan

unread,

Nov 16, 2009, 6:03:35 PM11/16/09

to

Richard Heathfield wrote:

>>> For those who would like a less elegant solution that still beats
>>> not only O(rnd()) but also O(n*n), the Fisher-Yates Shuffle is easy
>>> to understand and easy to implement.
>>
>> Which Knuth tells me was published in 1938. Yes, 1938!
>
> I think time machines should be banned - they ain't nothing but
> trouble. Are you listening, Mr Fisher-Yates?

Kinda makes one wonder what they were doing when this came up.
(algorithm not the time machine)

Ben Bacarisse

unread,

Nov 16, 2009, 7:14:32 PM11/16/09

to

stan <smo...@exis.net> writes:

Fisher was the Fisher who virtually invented modern statistical
methods. In 1938 he had just left the Rothamsted Experimental Station
to take up a position at UCL (somewhat embarrassingly in eugenics)
thereby allowing his colleague Yates to take over as head of
statistics. They published the method in a handbook called
"Statistical tables for biological, agricultural and medical
research". Random permutations are obviously important in
experimental design.

Durstenfeld was the first to publish the algorithm for an electronic
(rather than a human) computer and I think deserves some credit
because the Fisher-Yates method is not what we know today. It was
Durstenfeld who came up with the swap. The fisher-Yates is based on
crossing off items from a list, repeated counting from the start and
skipping items that are already crossed off.

[I know you didn't *really* want to know what they were doing, but I
like this kind of stuff so I chose to take you literally.]

--
Ben.

Richard Harter

unread,

Nov 17, 2009, 11:29:34 AM11/17/09

to

On Mon, 16 Nov 2009 08:29:49 +0000, Richard Heathfield
<r...@see.sig.invalid> wrote:

>In
><0.f2cb42259b4f36be8db5.2009...@bsb.me.uk>,
>Ben Bacarisse wrote:
>
>> Tagore <c.lang...@gmail.com> writes:
>>
>>> I want to generate k different random numbers between 1 to n.
>>> (k<n).
>>> One of the appraoch is calling random function again and again. and
>>> checking wheter new generated number had already been generated.
>>> but its complexity is O(n^2).....Is there any better mthod?
>>
>> This is more suitable for comp.programming and I have set followups.

Which produced the result that it is running half and half in two
groups.

I posted the following in comp.lang.c and am reproducing it here.

Sigh. Marek Chrobak and I did a note on this in IFIPS in 1988.
Quite maddeningly, I mislaid the my copy of the preprint and
can't seem find the time to reconstruct the details. However the
key idea is that Fisher-Yates only alters at most min(2k,n)
locations so that is all that you need to keep track of. To do
this, keep two arrays, one that has the permutation so far, and
the second that has the contents of the permutation element.

Thus, if the first two exchanges are (1,382),(2,17) the arrays
begin
[382, 17,...]
[ 1, 2,...]

If 382 were to occur at step 3 we see that 1 is now in cell 382
so the two arrays are now

[382, 17, 1,...]
[ 3, 2,382,...]

So, yes, you can get a uniformly distributed permutation with
O(k) storage. The obvious way to get O(k*log(k)) time is to put
everything in a binary search tree. The details should be
obvious.

Richard Harter, c...@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Infinity is one of those things that keep philosophers busy when they
could be more profitably spending their time weeding their garden.

Ben Bacarisse

unread,

Nov 17, 2009, 11:55:26 AM11/17/09

to

c...@tiac.net (Richard Harter) writes:

> On Mon, 16 Nov 2009 08:29:49 +0000, Richard Heathfield
> <r...@see.sig.invalid> wrote:
>
>>In
>><0.f2cb42259b4f36be8db5.2009...@bsb.me.uk>,
>>Ben Bacarisse wrote:
>>
>>> Tagore <c.lang...@gmail.com> writes:
>>>
>>>> I want to generate k different random numbers between 1 to n.
>>>> (k<n).
>>>> One of the appraoch is calling random function again and again. and
>>>> checking wheter new generated number had already been generated.
>>>> but its complexity is O(n^2).....Is there any better mthod?
>>>
>>> This is more suitable for comp.programming and I have set followups.
>
> Which produced the result that it is running half and half in two
> groups.

Would you suggest that one should never try to use followups to move a
discussion? I really don't mind one way or the other and am keen to
hear what you and others think. Personally I favour not answering
purely algorithmic question in purely language groups with out setting
followups, but maybe the best solution is the good old-fashioned: this
is not the right group post?

In this case it turns out the OP multi-posted so I could not have
stopped the parallel discussion so I agree that the attempt was
futile.

> I posted the following in comp.lang.c and am reproducing it here.
>
> Sigh. Marek Chrobak and I did a note on this in IFIPS in 1988.
> Quite maddeningly, I mislaid the my copy of the preprint and
> can't seem find the time to reconstruct the details. However the
> key idea is that Fisher-Yates only alters at most min(2k,n)
> locations so that is all that you need to keep track of. To do
> this, keep two arrays, one that has the permutation so far, and
> the second that has the contents of the permutation element.
>
> Thus, if the first two exchanges are (1,382),(2,17) the arrays
> begin
> [382, 17,...]
> [ 1, 2,...]
>
> If 382 were to occur at step 3 we see that 1 is now in cell 382
> so the two arrays are now
>
> [382, 17, 1,...]
> [ 3, 2,382,...]
>
> So, yes, you can get a uniformly distributed permutation with
> O(k) storage. The obvious way to get O(k*log(k)) time is to put
> everything in a binary search tree. The details should be
> obvious.

That's a nice way to do it conserving random numbers, but if you have
RNs to spare you can use Knuth's Algorithm S to put k numbers into an
array and permute that. That gives you O(k) space and time.

--
Ben.

Richard Harter

unread,

Nov 17, 2009, 3:21:33 PM11/17/09

to

This is a good question. Personally I feel that these small
purely algorithmic questions are alright. Granted that they
aren't technically language questions, but they are (or can be)
about programming in C. On my view language groups are not just
about the language itself but also about the issues of
programming in the language. Obviously there are boundaries, but
we needn't go there in this thread.

As a practical matter threads like this don't last very long so
there is no need to fuss.

As you say, the fact that he multi-posted made the whole point
moot.

>
>> I posted the following in comp.lang.c and am reproducing it here.
>>
>> Sigh. Marek Chrobak and I did a note on this in IFIPS in 1988.
>> Quite maddeningly, I mislaid the my copy of the preprint and
>> can't seem find the time to reconstruct the details. However the
>> key idea is that Fisher-Yates only alters at most min(2k,n)
>> locations so that is all that you need to keep track of. To do
>> this, keep two arrays, one that has the permutation so far, and
>> the second that has the contents of the permutation element.
>>
>> Thus, if the first two exchanges are (1,382),(2,17) the arrays
>> begin
>> [382, 17,...]
>> [ 1, 2,...]
>>
>> If 382 were to occur at step 3 we see that 1 is now in cell 382
>> so the two arrays are now
>>
>> [382, 17, 1,...]
>> [ 3, 2,382,...]
>>
>> So, yes, you can get a uniformly distributed permutation with
>> O(k) storage. The obvious way to get O(k*log(k)) time is to put
>> everything in a binary search tree. The details should be
>> obvious.
>
>That's a nice way to do it conserving random numbers, but if you have
>RNs to spare you can use Knuth's Algorithm S to put k numbers into an
>array and permute that. That gives you O(k) space and time.

IIANM Knuth's algorithm S has average time O(n), not O(k). (I am
going by the first edition; I am assuming that he hasn't revised
it since then.) More than that, you need to specify k in
advance. The plus is that it does return an ordered sequence.

An upside of our algorithm is that k does not have to be
specified in advance. At each step all possible sequences are
equi-probable.

If O(k) time is really desirable the paired entries can be put in
a hash table.

Ben Bacarisse

unread,

Nov 17, 2009, 6:45:32 PM11/17/09

to

c...@tiac.net (Richard Harter) writes:

> On Tue, 17 Nov 2009 16:55:26 +0000, Ben Bacarisse
> <ben.u...@bsb.me.uk> wrote:
>>c...@tiac.net (Richard Harter) writes:

<snip>

Ah, yes, I conflated the exercises with the algorithm in the text. In
my edition the exercises include the refinements of Jeffrey Scott
Vitter which gives O(k) time and O(1) space (O(k) is store the result
of course!). This is published in "Faster methods for random
sampling" CACM, vol 27, no 7, 1984. "Refinement" may be too mean a
term: it is a better algorithm -- you calculate the skip length rather
than choosing yes/no once per item.

<snip>
--
Ben.

stan

unread,

Nov 17, 2009, 8:54:56 PM11/17/09

to

Ben Bacarisse wrote:
> stan <smo...@exis.net> writes:

<Fisher-Yates stuff snipped>

>> Kinda makes one wonder what they were doing when this came up.
>> (algorithm not the time machine)
>
> Fisher was the Fisher who virtually invented modern statistical
> methods. In 1938 he had just left the Rothamsted Experimental Station
> to take up a position at UCL (somewhat embarrassingly in eugenics)
> thereby allowing his colleague Yates to take over as head of
> statistics. They published the method in a handbook called
> "Statistical tables for biological, agricultural and medical
> research". Random permutations are obviously important in
> experimental design.
>
> Durstenfeld was the first to publish the algorithm for an electronic
> (rather than a human) computer and I think deserves some credit
> because the Fisher-Yates method is not what we know today. It was
> Durstenfeld who came up with the swap. The fisher-Yates is based on
> crossing off items from a list, repeated counting from the start and
> skipping items that are already crossed off.
>
> [I know you didn't *really* want to know what they were doing, but I
> like this kind of stuff so I chose to take you literally.]

Actually I was curious, but I intended it as rhetorical because I
couldn't imagine anyone had a reasonable answer.

Thanks, you're pretty good. Any word on the time machine?

Richard Harter

unread,

Nov 17, 2009, 10:42:46 PM11/17/09

to

On Tue, 17 Nov 2009 23:45:32 +0000, Ben Bacarisse
<ben.u...@bsb.me.uk> wrote:

Yes, that would be better. What does Vitter use for the
probability distribution of the skip distance?

mohangupta13

unread,

Nov 21, 2009, 10:08:18 AM11/21/09

to

On Nov 17, 9:29 pm, c...@tiac.net (Richard Harter) wrote:
> On Mon, 16 Nov 2009 08:29:49 +0000, Richard Heathfield
>
> <r...@see.sig.invalid> wrote:
> >In

> ><0.f2cb42259b4f36be8db5.20091116031844GMT.87iqdb40ob....@bsb.me.uk>,
> >Ben Bacarisse wrote:

sir can you please explain it more explicitly ..i tried but couldn't
get it .I mean i couldn't understand what the algorithm says .
thanks
Mohan
>
> Richard Harter, c...@tiac.nethttp://home.tiac.net/~cri,http://www.varinoma.com

Richard Harter

unread,

Nov 22, 2009, 1:37:29 PM11/22/09

to

On Sat, 21 Nov 2009 07:08:18 -0800 (PST), mohangupta13
<mohang...@gmail.com> wrote:

>On Nov 17, 9:29=A0pm, c...@tiac.net (Richard Harter) wrote:
>> On Mon, 16 Nov 2009 08:29:49 +0000, Richard Heathfield
>>
>> <r...@see.sig.invalid> wrote:
>> >In
>> ><0.f2cb42259b4f36be8db5.20091116031844GMT.87iqdb40ob....@bsb.me.uk>,
>> >Ben Bacarisse wrote:
>>
>> >> Tagore <c.lang.mys...@gmail.com> writes:
>>

>> >>> =A0 I want to generate k different random numbers between 1 to n.
>> >>> =A0 (k<n).

>> >>> One of the appraoch is calling random function again and again. and
>> >>> checking wheter new generated number had already been generated.
>> >>> but its complexity is O(n^2).....Is there any better mthod?
>>
>> >> This is more suitable for comp.programming and I have set followups.
>>
>> Which produced the result that it is running half and half in two

>> groups. =A0

>>
>> I posted the following in comp.lang.c and am reproducing it here.
>>

>> Sigh. =A0Marek Chrobak and I did a note on this in IFIPS in 1988.

>> Quite maddeningly, I mislaid the my copy of the preprint and

>> can't seem find the time to reconstruct the details. =A0However the

>> key idea is that Fisher-Yates only alters at most min(2k,n)

>> locations so that is all that you need to keep track of. =A0To do

>> this, keep two arrays, one that has the permutation so far, and
>> the second that has the contents of the permutation element.
>>
>> Thus, if the first two exchanges are (1,382),(2,17) the arrays
>> begin
>> [382, 17,...]

>> [ =A01, =A02,...]

>>
>> If 382 were to occur at step 3 we see that 1 is now in cell 382
>> so the two arrays are now
>>

>> [382, 17, =A01,...]
>> [ =A03, =A02,382,...]

>>
>> So, yes, you can get a uniformly distributed permutation with

>> O(k) storage. =A0The obvious way to get O(k*log(k)) time is to put
>> everything in a binary search tree. =A0The details should be

>> obvious.
>sir can you please explain it more explicitly ..i tried but couldn't
>get it .I mean i couldn't understand what the algorithm says .
>thanks
>Mohan

I'm not quite sure whether the point of confusion is how to
implement the algorithm or what they underlying idea is, so I
will try to answer both questions.

First an implementation. The following is an untested code
fragment but has the key idea:

/* a[] and b[] are two arrays of size k; a[] holds the
selection sequence. b[i] holds the contents of the cell
pointed to by a[i].

for (i = 0; i < k ; i++) {
m = random(0,n-1); /* random returns an integer in the
range 0:n-1 */
j = index_of(a,m); /* index_of returns the index of m
if it is present in a, else -1 */
if (j < 0) {a[i] = m; b[i] = i;}
else {a[i] = b[j]; b[j] = i;}
}

Now what is going on here? The idea is to do Fisher-Yates
without initializing a big honking array. In Fisher-Yates we
initialize a big array with the integers 0...n-1. We loop
through first k locations. The ones we have filled so far with
our sequence are fixed. In each cycle we randomly pick one of
the unfixed locations and swap its contents with the current
location being filled. This gives an algorithm that looks like
this:

for (i = 0; i<k; i++){
m = random(i:n-1);
temp = a[i];
a[i] = a[m];
a[m] = temp;
}

Notice that at most 2*k locations in a[] get altered. Where are
those locations. K of them are the initial k locations of a[].
The remainder are in the locations that were swapped with the
initial k locations. Array b[] contains those.

I hope this helps.

Richard Harter, c...@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com