How to randomly select 100 rows from a big text file?

Kevin Qin

unread,

May 25, 2011, 4:50:44 PM5/25/11

to

Hi expert,

I want to verify results in one big text file. However, I do not know
how to randomly select rows (e.g. 100 rows).
Is it possible to do it using AWK?
Would someone please give me one sample?

Thank you,
Kevin

pk

unread,

May 25, 2011, 4:52:37 PM5/25/11

to

On Wed, 25 May 2011 13:50:44 -0700 (PDT) Kevin Qin <sas.r...@gmail.com>
wrote:

One way would be to shuffle the file, then print the first 100 lines of
the result. For example:

shuf file | head -n 100

sort -R file | head -n 100

Note that your system may not have shuf, or a sort that supports -R. They
seem to be available at least under Linux.

Kevin Qin

unread,

May 25, 2011, 10:18:27 PM5/25/11

to

On May 25, 4:52 pm, pk <p...@pk.invalid> wrote:
> On Wed, 25 May 2011 13:50:44 -0700 (PDT) Kevin Qin <sas.run...@gmail.com>

My working system is AIX. There is no shuf and sort -R.
Do you have any other suggestion?

-Kevin

Message has been deleted

goarilla

unread,

May 26, 2011, 4:32:30 AM5/26/11

to

I once did something like this a long time ago
it was kinda masochistic and stupid:

I used the shell's RANDOM variable which is not available on all shells
anyway but something that can create random unsigned integers would be
fine, here is something to demonstrate it:

lenght=$(wc -l file)
length=$(($length+1))

nr=$(($RANDOM%length))
sed -e "$nr"p file (you need to be sure $nr is not 0 though)

I had an extra file which had all the previously returned numbers
which i checked for collisions ( i needed unique results ).

I hope it helps.

goarilla

unread,

May 26, 2011, 5:26:36 AM5/26/11

to

Oops:

sed -n -e "$nr"p file

pk

unread,

May 26, 2011, 5:22:58 AM5/26/11

to

Then you can, for example, prepend a random number to each line, sort the
resulting file on the first column, take the first 100 lines and remove the
prefixed numbers (the last two steps can be done in any order). Sample code:

awk 'BEGIN{srand()} { printf "%010d %s\n", int(10000000 * rand()), $0 }' | \
sort -k1,1 | head -n 100 | cut -d ' ' -f2-

(10000000 is just "a big number"; choose it based on how many lines you
have in the input)

if your awk can't do that, perhaps you can use Perl or you shell's $RANDOM
if it has it, eg

while IFS= read -r line; do
printf "%010d %s\n" $RANDOM "$line"
done < file | sort ...

Note that the shell approach seems to be slower.

If you know in advance the number of lines of your file (let's call it
"n", which must be >= 100), you can also do something like

awk 'BEGIN{
srand()
while(count < 100) {
number = int(rand() * n) + 1
if (!(number in random)) {
count++
random[number]
}
}
}
NR in random { print }' file

I think, from faint school memories, that the latter approach isn't exactly
as random as the others, but generally it should be good enough.

goarilla

unread,

May 26, 2011, 5:46:56 AM5/26/11

to

why do you assign IFS ( to space ? )
to normalize it ?

Janis Papanagnou

unread,

May 26, 2011, 5:49:18 AM5/26/11

to

awk -v size=3000 -v amount=100 '
BEGIN { srand()
if (amount > size) exit (1);
sel[0] # make sel an array
while (length (sel) <= 100)
sel[int(rand()*size)+1]
}
NR in sel
'

This approach keeps original line ordering intact. Is that okay
or should the lines be randomly shuffled?

Janis

>
> Thank you,
> Kevin

Janis Papanagnou

unread,

May 26, 2011, 5:52:59 AM5/26/11

to

Not a space; it's cleared (null-string).

> to normalize it ?

To preserved whitespace...

$ read abc ; echo "$abc"
hello world
hello world

$ IFS= read abc ; echo "$abc"
hello world
hello world

The first string is input, the second the output.

Janis

pk

unread,

May 26, 2011, 5:54:48 AM5/26/11

to

On Thu, 26 May 2011 09:46:56 +0000 (UTC) goarilla
<kevin....@mtm.DOTremove-thisDOT.kuleuven.DOTbe.invalid> wrote:

> > while IFS= read -r line; do
> > printf "%010d %s\n" $RANDOM "$line"
> > done < file | sort ...
> >
>
> why do you assign IFS ( to space ? )
> to normalize it ?

It's generally considered the safe(st) way to read a file line by line
using the shell, see

http://mywiki.wooledge.org/BashFAQ/001

goarilla

unread,

May 26, 2011, 7:39:45 AM5/26/11

to

Damnit if I had known this before
I had to use sed to iterate over a file line by line because
read would truncate the spaces from the end of the line.

Adam Funk

unread,

May 26, 2011, 10:15:11 AM5/26/11

to

On 2011-05-26, Kevin Qin wrote:

> On May 25, 4:52 pm, pk <p...@pk.invalid> wrote:
>> On Wed, 25 May 2011 13:50:44 -0700 (PDT) Kevin Qin <sas.run...@gmail.com>
>> wrote:
>>
>> > Hi expert,
>>
>> > I want to verify results in one big text file. However, I do not know
>> > how to randomly select rows (e.g. 100 rows).
>> > Is it possible to do it using AWK?
>> > Would someone please give me one sample?
>>
>> One way would be to shuffle the file, then print the first 100 lines of
>> the result. For example:
>>
>> shuf file | head -n 100
>>
>> sort -R file | head -n 100

Will shuf work if the file is too big to load into memory at one time?

>> Note that your system may not have shuf, or a sort that supports -R. They
>> seem to be available at least under Linux.
>
> My working system is AIX. There is no shuf and sort -R.
> Do you have any other suggestion?

I have a python program that does this (but it calls the system's wc
command). If you have python & wc & you're interested, I'll be happy
to post it.

--
svn ci -m 'come back make, all is forgiven!' build.xml

John DuBois

unread,

May 26, 2011, 10:55:19 AM5/26/11

to

In article <irl7mt$oqo$1...@speranza.aioe.org>,

Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>Am 25.05.2011 22:50, schrieb Kevin Qin:
>> Hi expert,
>>
>> I want to verify results in one big text file. However, I do not know
>> how to randomly select rows (e.g. 100 rows).
>> Is it possible to do it using AWK?
>> Would someone please give me one sample?
>
>awk -v size=3000 -v amount=100 '
>BEGIN { srand()
> if (amount > size) exit (1);
> sel[0] # make sel an array
> while (length (sel) <= 100)
> sel[int(rand()*size)+1]
>}
>NR in sel
>'

Note that length() on an array is a gawkism.

This (gawk utility) is less illuminating than a short example, but (like the
above) does do what was requested. It requires than the entire file fit in
memory:
ftp://ftp.armory.com/pub/scripts/randst

Help page:
ftp://ftp.armory.com/pub/scripts/help_pages/randst

John
--
John DuBois spc...@armory.com KC6QKZ/AE http://www.armory.com/~spcecdt/

Kenny McCormack

unread,

May 26, 2011, 10:57:59 AM5/26/11

to

In article <IMadnWE0aflK9EPQ...@speakeasy.net>,
John DuBois <spc...@armory.com> wrote:
...

>Note that length() on an array is a gawkism.

False. It is a TAWKism - copied (belatedly) into GAWK.

--
Windows 95 n. (Win-doze): A 32 bit extension to a 16 bit user interface for
an 8 bit operating system based on a 4 bit architecture from a 2 bit company
that can't stand 1 bit of competition.

Modern day upgrade --> Windows XP Professional x64: Windows is now a 64 bit
tweak of a 32 bit extension to a 16 bit user interface for an 8 bit
operating system based on a 4 bit architecture from a 2 bit company that
can't stand 1 bit of competition.

Keith Thompson

unread,

May 26, 2011, 2:34:56 PM5/26/11

to

Michael Vilain <vil...@NOspamcop.net> writes:
> In article
> <2f4b4e31-f784-446b...@24g2000yqk.googlegroups.com>,
> Kevin Qin <sas.r...@gmail.com> wrote:
[...]

>> My working system is AIX. There is no shuf and sort -R.
>> Do you have any other suggestion?
>

> These are both in the GNU corutils package. Install the gcc compiler,
> download them, and build them.

Do you need gcc to build coreutils, or will the native AIX compiler
(xlc?) do the job?

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith Thompson

unread,

May 26, 2011, 2:44:26 PM5/26/11

to

Kevin Qin <sas.r...@gmail.com> writes:
> I want to verify results in one big text file. However, I do not know
> how to randomly select rows (e.g. 100 rows).
> Is it possible to do it using AWK?
> Would someone please give me one sample?

Most of the suggestions here involve sorting or shuffling the file,
which can be slow if it's very large.

I'd probably first compute the number of lines in the file using
"wc -l" (which, unlike sorting, just performs a single pass), then
generate 100 distinct random numbers in the range 1..N, then read
the file again, printing each line whose number is in the list I
just generated.

So how do you generate the 100 distinct numbers? If N is
sufficiently large, generating random numbers in the range 1..N and
rejecting any you've already seen is reasonable. If N is, say,
101, that could take a long time, and if N is 99 it would take
forever. You could create a list of all the numbers from 1 to N,
and repeatedly pick a random element and then delete it from the
list, but the list itself could take up substantial space in memory.

Note that unless your input is *really* big and/or you're doing this
many times, the time you spend implementing the more sophisticated
solution will be more than outweigh the time you save with a faster
algorithm. Think about how scalable the solution actually needs
to be.

goarilla

unread,

May 26, 2011, 4:46:24 PM5/26/11

to

On Thu, 26 May 2011 11:44:26 -0700, Keith Thompson wrote:

> Kevin Qin <sas.r...@gmail.com> writes:
>> I want to verify results in one big text file. However, I do not know
>> how to randomly select rows (e.g. 100 rows). Is it possible to do it
>> using AWK?
>> Would someone please give me one sample?
>
> Most of the suggestions here involve sorting or shuffling the file,
> which can be slow if it's very large.
>
> I'd probably first compute the number of lines in the file using "wc -l"
> (which, unlike sorting, just performs a single pass), then generate 100
> distinct random numbers in the range 1..N, then read the file again,
> printing each line whose number is in the list I just generated.
>

That's kinda what i proposed !

And i do think think that the other suggestions are superior.

Janis Papanagnou

unread,

May 26, 2011, 8:35:30 PM5/26/11

to

On 26.05.2011 16:55, John DuBois wrote:
> In article <irl7mt$oqo$1...@speranza.aioe.org>,
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> Am 25.05.2011 22:50, schrieb Kevin Qin:
>>> Hi expert,
>>>
>>> I want to verify results in one big text file. However, I do not know
>>> how to randomly select rows (e.g. 100 rows).
>>> Is it possible to do it using AWK?
>>> Would someone please give me one sample?
>>
>> awk -v size=3000 -v amount=100 '
>> BEGIN { srand()
>> if (amount > size) exit (1);
>> sel[0] # make sel an array
>> while (length (sel) <= 100)
>> sel[int(rand()*size)+1]
>> }
>> NR in sel
>> '
>
> Note that length() on an array is a gawkism.

Sure. I initially wrote another version without using that function; a
plain conservative arithmetic for-loop with an if-statement to check
for duplicates, but I preferred to post the version with the slightly
easier to read code.

Janis

>
> [...]

Janis Papanagnou

unread,

May 26, 2011, 8:48:56 PM5/26/11

to

I don't see that you proposed it. I found just this comment: "an extra file
which had all the previously returned numbers", and which was fed by *many*
sed invocations as far as I've understood your informal posting upthread.

WRT Keith's statement of pre-generated numbers, see a coded solution in my
posting some hours earlier.

>
> And i do think think that the other suggestions are superior.

You don't mean those with sorting? - Those are certainly inferior! - Compare
computational complexity O(n) versus O(n log n). Or you mean those with many
unnecessary processes? Certainly inferior as well.

Janis

goarilla

unread,

May 27, 2011, 5:39:37 AM5/27/11

to

On Fri, 27 May 2011 02:48:56 +0200, Janis Papanagnou wrote:

> On 26.05.2011 22:46, goarilla wrote:
>> On Thu, 26 May 2011 11:44:26 -0700, Keith Thompson wrote:
>>
>>> Kevin Qin <sas.r...@gmail.com> writes:
>>>> I want to verify results in one big text file. However, I do not know
>>>> how to randomly select rows (e.g. 100 rows). Is it possible to do it
>>>> using AWK?
>>>> Would someone please give me one sample?
>>>
>>> Most of the suggestions here involve sorting or shuffling the file,
>>> which can be slow if it's very large.
>>>
>>> I'd probably first compute the number of lines in the file using "wc
>>> -l" (which, unlike sorting, just performs a single pass), then
>>> generate 100 distinct random numbers in the range 1..N, then read the
>>> file again, printing each line whose number is in the list I just
>>> generated.
>>>
>>>
>> That's kinda what i proposed !
>
> I don't see that you proposed it. I found just this comment: "an extra
> file which had all the previously returned numbers", and which was fed
> by *many* sed invocations as far as I've understood your informal
> posting upthread.
>

shameless link:
http://groups.google.com/group/comp.unix.shell/msg/52369943aef5a57d

Yes i used sed to display the line number since i didn't know about
IFS= read -r and sed -n -e "$nr"p file was the trick i used to preserve
the entire line.

> WRT Keith's statement of pre-generated numbers, see a coded solution in
> my posting some hours earlier.
>
>
>> And i do think think that the other suggestions are superior.
>
> You don't mean those with sorting? - Those are certainly inferior! -
> Compare computational complexity O(n) versus O(n log n). Or you mean
> those with many unnecessary processes? Certainly inferior as well.
>

I'm not talking about computational theory here but the fact i used
wc,sed,shell to do what you did in one process.

I especially liked the prepend a randomly-generated number and sort then
method.

It was a clever trick, kudos.
It was a mind opener.

> Janis

Kai-Uwe Bux

unread,

May 28, 2011, 7:16:15 AM5/28/11

to

Kevin Qin wrote:

The following gawk script shuffles the lines.

BEGIN{
srand();
sel[0];
selected=0;
line=0;
}
//{
++ line;
if ( selected < 100 ) {
sel[ selected ] = $0;
++ selected;
} else if ( rand() * line < 100 ) {
sel[ int( rand() * 100 ) ] = $0;
}
}
END{
for ( selected = 0; selected < 100; ++ selected ) {
print( sel[ selected ] );
}
}

If you want to keep the lines in order, I would prefix them with line
numbers, select 100 lines, sort by the line number prefix, and finally cut
off the line numbers.

Best,

Kai-Uwe Bux