pick $n random lines from a file

Roderick Schertler

unread,

Sep 17, 2001, 9:48:48 AM9/17/01

to

How would you extend the standard technique for picking a random line
>From a file (

while (<>) {
$line = $_ if rand($.) < 1;
}

) to pick $n random lines instead? In particular, I'm looking for a way
to do this in a single pass over the input without seeking (for use with
pipes), and without saving the whole input in memory.

Thanks.

--
Roderick Schertler
rode...@argon.org

John Porter

unread,

Sep 17, 2001, 2:57:18 PM9/17/01

to

Roderick Schertler wrote:
> How would you extend the standard technique for picking a random line

> while (<>) {
> $line = $_ if rand($.) < 1;
> }

> to pick $n random lines instead?

How about:

my @l;
while (<>) {
push @l, $_ if rand($.) < $n;
last if @l == $n;
}

But there remains the possibility that <$n lines have been
collected by the time the eof has been reached.

--
John Porter

Michael J Assels

unread,

Sep 17, 2001, 3:59:17 PM9/17/01

to

In article <2001091714...@min.net>,
John Porter <jdpo...@min.net> wrote:

>How about:
>
> my @l;
> while (<>) {
> push @l, $_ if rand($.) < $n;
> last if @l == $n;
> }
>
>But there remains the possibility that <$n lines have been
>collected by the time the eof has been reached.

An algorithm that *always* includes line 1 can scarcely
be called random. I'm not sure whether the original
"standard technique" produces the desired randomness,
but I'm pretty sure this is not the best way to extend
it to arbitrary $n.

Michael

Mark-Jason Dominus

unread,

Sep 17, 2001, 4:07:31 PM9/17/01

to

> How would you extend the standard technique for picking a random line
> >From a file (
>
> while (<>) {
> $line = $_ if rand($.) < 1;
> }
>
> ) to pick $n random lines instead?

my $N = 7;
for (1 .. $N) {
push @lines, scalar <>;
}
while (<>) {
$lines[rand $N] = $_ if rand() < $N/$.;
}

Edward Avis

unread,

Sep 17, 2001, 4:02:18 PM9/17/01

to

Roderick Schertler <rode...@argon.org> writes:

>How would you extend the standard technique for picking a random line
>From a file (
>
> while (<>) {
> $line = $_ if rand($.) < 1;
> }
>
>) to pick $n random lines instead?

Well if you don't mind picking the same line twice then

while (<>) {
$line_a = $_ if rand($.) < 1;
$line_b = $_ if rand($.) < 1;
}

would obviously work to pick two lines. If you want each line to be
different (or at least a different line number), then maybe something
like

while (<>) {
if (rand($.) < 1) { $line_a = $_ }
elsif (rand($.) < 1) { $line_b = $_ }
}

would work. That ensures that the same line can never be picked for
$line_a and $line_b. But there is a problem: $line_b is not
guaranteed to be set. Whereas $line_a will always get set on the
first line (because rand(1) < 1), this won't happen for $line_b. You
could solve that by shifting $line_b along by one:

while (<>) {
if ($. >= 2 and rand($. - 1) < 1) { $line_b = $_ }
elsif (rand($.) < 1) { $line_a = $_ }
}

Then $line_b will always be set on the second line, and $line_a always
on the first. The order of the if...elsif is reversed to make sure
that $line_a can't grab the second line and stop $line_b getting it.

Now I think that solution would work, provided you don't care about
the relative ordering of $line_a and $line_b (I think that $line_b
would tend to get later positions in the file). You could always
randomize the ordering of the lines picked afterwards. (The program
generalizes to if ($. >= 3 and rand($. - 2) < 1), etc.)

But I'm not entirely happy with it if you want to be sure you're
getting something truly random. I can't think of anything that's
wrong, I just haven't examined it thoroughly enough. Or even tried to
run it :-).

--
Ed Avis <ep...@doc.ic.ac.uk>
Finger for PGP key

Ed Kademan

unread,

Sep 17, 2001, 5:20:13 PM9/17/01

to

Roderick Schertler <rode...@argon.org> writes:

> How would you extend the standard technique for picking a random line
> >From a file (
>
> while (<>) {
> $line = $_ if rand($.) < 1;
> }
>

> ) to pick $n random lines instead? In particular, I'm looking for a way
> to do this in a single pass over the input without seeking (for use with
> pipes), and without saving the whole input in memory.

To create a uniform random sample without replacement, build a table
of size n. Each entry of the table contains a TAG and a LINE.
Initially the TAGs are all -1 and the LINEs are all empty strings.

Now for each line you read from input:
* Generate a uniform random number between 0 and 1 with rand().
(Call it U.)
* If U is bigger than the biggest TAG in the table replace the table
entry associated with the smallest TAG with a new entry. This new
entry has U for a TAG and the current line as the LINE. (If the
smallest TAG is not unique so that a group of entries shares its
value just replace one of that group arbitrarily.)

Provided your input has at least n lines your table will be full.

You are essentially shuffling your input and then picking the 10 lines
that happen to end up on top.

--
Ed Kademan 508.651.3700
PHZ Capital Partners 508.653.1745 (fax)
321 Commonwealth Road <kad...@phz.com>
Wayland, MA 01778

Bruce Van Allen

unread,

Sep 17, 2001, 5:32:26 PM9/17/01

to

At 4:07 PM -0400 9/17/01, Mark-Jason Dominus wrote:
>> ) to pick $n random lines instead?
>

> my $N = 7;
> for (1 .. $N) {
> push @lines, scalar <>;
> }
> while (<>) {
> $lines[rand $N] = $_ if rand() < $N/$.;
> }

Yow. Very nice!

:) :) :) :) :) :) :) :) :) :) :) :) :) :)

1;

- Bruce

P.S. Russ: is that clear? Sorry to take your time. I collect logic
poems, and that one made me smile. The 1; is just my 'yours truly'
JAPH affectation...

__bruce_van_allen__santa_cruz_ca__

Ed Kademan

unread,

Sep 18, 2001, 8:45:40 AM9/18/01

to

Ed Kademan <kad...@phz.com> writes:

> Roderick Schertler <rode...@argon.org> writes:
>
> > How would you extend the standard technique for picking a random line
> > >From a file (

> > ...

> > to pick $n random lines instead?

> > ...

>
> To create a uniform random sample without replacement, build a table
> of size n. Each entry of the table contains a TAG and a LINE.
> Initially the TAGs are all -1 and the LINEs are all empty strings.
>
> Now for each line you read from input:
> * Generate a uniform random number between 0 and 1 with rand().
> (Call it U.)
> * If U is bigger than the biggest TAG in the table replace the table

^^^^^^^ should be smallest

> entry associated with the smallest TAG with a new entry. This new
> entry has U for a TAG and the current line as the LINE. (If the
> smallest TAG is not unique so that a group of entries shares its
> value just replace one of that group arbitrarily.)
>
> Provided your input has at least n lines your table will be full.
>
> You are essentially shuffling your input and then picking the 10

should be n ^^

> lines that happen to end up on top.

Please suffer me to point out a couple of bugs in what I just wrote.
I have noted them above, and I summarize what---for the moment---I
believe to be the correct algorithm as follows:

To create a uniform random sample without replacement, build a table
of size n. Each entry of the table contains a TAG and a LINE.
Initially the TAGs are all -1 and the LINEs are all empty strings.

Now for each line you read from input:
* Generate a uniform random number between 0 and 1 with rand().
(Call it U.)

* If U is bigger than the smallest TAG in the table replace the
table entry associated with that smallest TAG with a new entry.

This new entry has U for a TAG and the current line as the LINE.
(If the smallest TAG is not unique so that a group of entries
shares its value just replace one of that group arbitrarily.)

Provided your input has at least n lines your table will be full.

Essentially you are shuffling your input and then picking the n lines

Ian Phillipps

unread,

Sep 18, 2001, 5:23:35 AM9/18/01

to

Roderick Schertler wrote:

> In particular, I'm looking for a way
> to do this in a single pass over the input without seeking (for use with
> pipes), and without saving the whole input in memory.

Some of the posters to this thread have touched on this by implication,
but it's worth saying that what you're asking for is theoretically
impossible. You cannot choose "randomly" if you don't know what
probability to use, and you don't know that until you know the line
count.

Unless you know in advance how many lines there are, you can't do it in
one pass. And a pipe doesn't tell you how many lines it's going to
disgorge.

If you know how many there are, just make a list, then use it:

$linecount = <...total line count...>;

for(1..$n) { 1 while $match{ 1 + int rand $linecount }++ }

while(<>) { print if $match{$.} }

__END__

Ian

Edward Avis

unread,

Sep 18, 2001, 4:02:24 PM9/18/01

to

Ian Phillipps <dont....@myself.com> writes:

[picking not one, but N random lines from a file]

>You cannot choose "randomly" if you don't know what probability to
>use, and you don't know that until you know the line count.

If this is true, then why does the solution given in the FAQ work? Is
one line a special case?

Do you have a proof that no solution is possible for N > 1?

(Hmm, this thread never really was about Perl. But it refers to an
algorithm given in the Perl documentation, so I hope that makes it
okay for this ng.)

Mark-Jason Dominus

unread,

Sep 18, 2001, 4:50:36 PM9/18/01

to

Ian.Ph...@iname.com:

> Roderick Schertler wrote:
>
> > In particular, I'm looking for a way
> > to do this in a single pass over the input without seeking (for use with
> > pipes), and without saving the whole input in memory.
>
> Some of the posters to this thread have touched on this by implication,
> but it's worth saying that what you're asking for is theoretically
> impossible.

It's not worth saying, because it's wrong. If it's "theoretically
impossible", then you need a better theory.

The code Roderick posted does in fact do what he said: It selects one
line at random, each line with equal probability, without knowing in
advance how many lines there are, without reading them all into
memory, and without making more than one pass.

Here's the code again:

while (<>) {
$selected = $_ if rand($.) < 1;
}

Clearly, if the input contains only one line, that line will be stored
into $selected before the end of the loop. If there is no input, the
loop exits, having left the single line of input in $selected.

Now let's suppose that the loop has run N times and that $selected
contains one of the first N lines, each with equal probability 1/N.
Control is now at the end of the loop. But there is an (N+1)st line,
so the loop continues. rand() generates a random number less than
(N+1). 1/(N+1) of the time, this number is less than 1, and the value
in $selected is replaced with line N+1. The other N/(N+1) of the
time, $selected retains its old value, which was one of the earlier
lines. By hypothesis, each of these earlier lines was selected with
probability 1/N; they will remain there N/(N+1) of the time, for a
total probability of 1/N * N/(N+1) = 1/(N+1) each. So each of the
first N+1 lines appears in $selected with probability exactly 1/(N+1),
which is just what we want.

This argument shows that if the loop behaves correctly for an N-line
input, it behaves correctly for an N+1-line input. Since it behaves
correctly for a 1-line input, it behaves correctly for all possible
inputs.

********

If you don't follow the argument, work though what happens when the
input is three or four lines long. On the first pass through the
loop, line 1 is in $selected. On the second pass through the loop,
line 1 is discarded and replaced with line 2 exactly half the time;
there is a 1/2 probability of line 1 remaining in $selected and a 1/2
probability of it being replaced by line 2.

On the third pass through the loop, there is a 1/3 probability of the
value of line being discarded and replaced with line 3, which
therefore appears with probability 1/3. The other 2/3 of the time,
whatever was in $selected remains there, either line 1 or line 2, each
exactly half the time. Half of 2/3 is 1/3, so each of line 1 and line
2 remain in $selected a total of 1/3 of the time.

********

If you prefer an argument from authority, see Knuth, _The Art of
Computer Programming_, volume II.

The code I posted is also correct, and will solve Roderick's problem
of selecting more than one line. It selects N lines from the input,
each line with equal probability, etc. it's a traightforward
extension of the code above.

> You cannot choose "randomly" if you don't know what probability to
> use, and you don't know that until you know the line count.

It might be instructive to go back over your argument and try to find
the flaw.

Steve Holland

unread,

Sep 18, 2001, 6:55:10 PM9/18/01

to

Edward Avis <ep...@doc.ic.ac.uk> writes:
> Ian Phillipps <dont....@myself.com> writes:

> [picking not one, but N random lines from a file]

> >You cannot choose "randomly" if you don't know what probability to
> >use, and you don't know that until you know the line count.

You need to know how many lines are in the file. However, you
can make a reasonable guess at that if you know something about the
file. First, use stat to determine the number of bytes in the file.
Then, divide this by an estimate of the length of each line. This
works best if you are reading $n random lines from a file where each
line has the same length. The results will not be truly random, but
they will be random enough for something like a fortune cookie
generator.

=====================================================================
To find out who and where I am look at:
http://www.nd.edu/~sholland/index.html
Spammers: Please send spam to: ab...@aol.com and ab...@yahoo.com
=====================================================================

Ian Phillipps

unread,

Sep 19, 2001, 5:58:19 AM9/19/01

to

Mark-Jason Dominus wrote:
>
> It's not worth saying, because it's wrong. If it's "theoretically
> impossible", then you need a better theory.

I am duly humbled.

> The code Roderick posted does in fact do what he said: It selects one
> line at random, each line with equal probability, without knowing in
> advance how many lines there are, without reading them all into
> memory, and without making more than one pass.
>
> Here's the code again:
>
> while (<>) {
> $selected = $_ if rand($.) < 1;
> }

However, you don't actually address Roderick's original question;
Now I've actually read the original program [ :-) ] I offer this as
a generalisation of the method:

#! /usr/bin/perl
$n = shift @ARGV; # Number of lines required

for(1..$n) {
push @results, scalar <>;
}

warn "Not enough lines" unless defined $results[$n-1];

while(<>) {
$selected[rand $n] = $_ if rand($.) < $n;
}

print @selected;
__END__

Note that the selection is random, but the ordering of the lines is
semi-random - i.e. probably not the order in which the lines occurred,
and not properly random either.

> This argument shows that if the loop behaves correctly for an N-line
> input, it behaves correctly for an N+1-line input. Since it behaves
> correctly for a 1-line input, it behaves correctly for all possible
> inputs.

I believe this applies to my generalisation, too. But then I'm too lazy
and hubristic to prove it :-)

Ian

Roderick Schertler

unread,

Sep 19, 2001, 11:52:53 AM9/19/01

to

Thanks to everybody for the help. Here's the relevant snippet from the
finished product:

sub run_multi {
my @hit;

while (<>) {
push @hit, [$ARGV, $_];
last if @hit == $N;
}

# be sure to notice EOF, else the following <> will read stdin
return \@hit if @hit < $N;

while (<>) {
if (rand() < $N / $.) {
splice @hit, rand @hit, 1;
push @hit, [$ARGV, $_];
}
}

return \@hit;
}

The whole script is at <URL:http://www.argon.org/~roderick/randline>. The
RS::Handy module it uses is at <URL:http://www.argon.org/~roderick/>.

--
Roderick Schertler
rode...@argon.org

Mark-Jason Dominus

unread,

Sep 19, 2001, 3:25:08 PM9/19/01

to

Ian.Ph...@iname.com:

> However, you don't actually address Roderick's original question;

I addressed it two days ago, in <200109172007...@plover.com> .

tree...@eskimo.com

unread,

Sep 22, 2001, 7:10:24 PM9/22/01

to

>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 9/18/01, 2:23:35 AM, Ian Phillipps <dont....@myself.com> wrote
regarding Re: pick $n random lines from a file:

> If you know how many there are, just make a list, then use it:

> $linecount = <...total line count...>;

how'bout $linecount = `wc -l <filename>`