Submitted for your review: random sampling filters in gawk

steven...@gmail.com

unread,

Jan 2, 2008, 11:44:01 PM1/2/08

to

After becoming a little curious about random sampling and how to do it
without knowing the number of records in advance, I found Waterman's
Algorithm R (by way of Knuth's TAOCP Vol. 2), and Vitter's Algorithm Z
at <http://doi.acm.org/10.1145/3147.3165>.

The nice features of these algorithms are that they take space
proportional to the sample size, and that they work without scanning
the entire file to determine how many records are present. I think
they are appropriate to be implemented as Unix stdin/stdout pipeline
filters.

I wrote both versions up in gawk and figured I'd post them somewhere
where people can poke holes in them, use them, or change them if they
like.

I know that there are a lot of useless comparisons because I used
patterns instead of explicit loops and getline for flow control, but
it's just so darn readable for me this way. :-)

On the other hand, the variable names are not readable. I took them
directly from the algorithm pseudocode to make checking the
implementation a bit easier.

My implementation preserves the input order when it outputs the random
sample. I think this behavior is more useful in shell, where you work
with things like uniq or join that require the inputs to have some
sort of order.

Here's a comparison timing on my system.

scully:~/bin steve$ time bzcat words.bz2 | vitter-sample.awk -v
n=10000 > /dev/null

real 0m54.739s
user 0m28.725s
sys 0m10.722s
scully:~/bin steve$ time bzcat words.bz2 | waterman-sample.awk -v
n=10000 > /dev/null

real 1m35.501s
user 0m50.242s
sys 0m19.479s
scully:~/bin steve$ bzcat words.bz2 | wc -l
11746850

Comments are welcome.

-- Steve

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}

NR <= n {
pool[NR] = $0
places[NR] = NR
next
}

NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}

# Vitter's Algorithm Z for random sampling
# http://doi.acm.org/10.1145/3147.3165 #

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
T = 10 # tuneable per (Vitter 85)
srand()
nplusone = n + 1
}

NR == nplusone {
pool[NR] = $0
places[NR - 1] = NR
t = n
thresh = T*n
num = 0
}

NR > n && t > thresh {
if (!W) {
W = exp(-log(rand())/n)
term = t - n + 1
}
while (1) {
U = rand()
X = t*(W - 1.0)
S = int(X)
lhs = exp(log(((U*(((t + 1)/term)^2))*(term + S))/(t + X))/n)
rhs = (((t + X)/(term + S))*term)/t
if (lhs <= rhs) {
W = rhs/lhs
break
}
y = (((U*(t + 1))/term)*(t + S + 1))/(t + X)
if (n < S) {
denom = t
numer_lim = term + S
} else {
denom = t - n + S
numer_lim = t + 1
}
for (numer = t + S; numer > numer_lim; numer--) {
y = (y*numer)/denom
denom--
}
W = exp(-log(rand())/n)
if (exp(log(y)/n) <= (t + X)/t)
break
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n*rand()))
else
exit
t = t + S + 1
term = term + S + 1
next
}

NR > n && t <= thresh {
V = rand()
S = 0
t++
num++
quot = num/t
while (quot > V) {
S++
t++
num++
quot = (quot*num)/t
}
if (SKIP_RECORDS(S))
READ_NEXT_RECORD(int(n * rand()))
else
exit
next
}

NR <= n {
pool[NR] = $0
places[NR - 1] = NR
next
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" > "/dev/
stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
newpool[ordered[i]]
}

function SKIP_RECORDS(skip) {
retval = 1
for (i = 0; i < skip; i++)
retval = getline
return retval > 0
}

function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}

steven...@gmail.com

unread,

Jan 3, 2008, 12:00:41 AM1/3/08

to

I didn't use the Google Groups form correctly and the posted code was
not valid. Hopefully the below comes out right.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}

NR <= n {
pool[NR] = $0
places[NR] = NR
next
}

NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"

exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)

print newpool[ordered[i]]
}

END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"

exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)

print newpool[ordered[i]]

William James

unread,

Jan 3, 2008, 3:25:25 AM1/3/08

to

On Jan 2, 11:00 pm, "steven.hu...@gmail.com" <steven.hu...@gmail.com>
wrote:

> I didn't use the Google Groups form correctly and the posted code was
> not valid. Hopefully the below comes out right.
>
> # Waterman's Algorithm R for random sampling
> # by way of Knuth's The Art of Computer Programming, volume 2
>
> BEGIN {
> if (!n) {

Suggestion. Indent 2 spaces instead of 4.

[...]

> if (M <= n) {
> READ_NEXT_RECORD(M)
> }

This could be changed to

if (M <= n)
READ_NEXT_RECORD(M)

[...]

> x = asorti(newpool, ordered)
> for (i = 1; i <= x; i++)
> print newpool[ordered[i]]

This could be changed to

asorti(newpool, ordered)
for (i = 1; i in ordered; i++)
print newpool[ordered[i]]

Luuk

unread,

Jan 3, 2008, 4:28:56 AM1/3/08

to

<steven...@gmail.com> schreef in bericht
news:f31214a5-f659-4d9a...@m34g2000hsf.googlegroups.com...

i tried something

$ wc -l american
212710 american

$ while true; do awk -v n=1 vitter-sample.awk american ; done
unpersonalized
unpersonalized
kerosene
kerosene
kerosene
kerosene
kerosene
Kiyoshi
Kiyoshi
Kiyoshi
Kiyoshi
Kiyoshi
Navratilova's
Navratilova's
Navratilova's
Navratilova's
Navratilova's
passport's

$ while true; do awk -v n=1 waterman-sample.awk american ; done
physio
physio
physio
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
petulant
AA
incineration's
AA
incineration's

At 1st i was amazed by the randomness of this script,
but i think i need an expert on randomness to tell why above is happening...
(but still, thanks for sharing the scripts!)

Janis Papanagnou

unread,

Jan 3, 2008, 4:37:38 AM1/3/08

to

What would be the advantage? (Just saving a variable?)

I would assume that the 'in' test operator is of complexity O(log N),
i.e. a balanced tree lookup, so that the for loop would be O(N*log N),
while the simple scalar comparison that the OP used is just plain O(N).

The counted loop seems to be always preferable in case of contiguous
filled arrays.

Janis

steven...@gmail.com

unread,

Jan 3, 2008, 6:25:09 AM1/3/08

to

On Jan 3, 4:28 am, "Luuk" <l...@invalid.lan> wrote:

> At 1st i was amazed by the randomness of this script,
> but i think i need an expert on randomness to tell why above is happening...
> (but still, thanks for sharing the scripts!)

srand() seeds the random number generator with the time in seconds, so
if your computer is fast enough that behavior of repeated selection
could happen. How does it work if you put a sleep 1 in your scripts?

On the other hand there looks to be an edge case with sample size n=1
as well. I would not be surprised at all. I can look into it a bit
more and post a fix if I find the bug in my implementation.

I think your script labels may be switched. I am seeing funny behavior
with the Vitter algorithm and sample size n=1. The simpler one seems
to be fine.

Thanks!

-- Steve

Luuk

unread,

Jan 3, 2008, 8:35:06 AM1/3/08

to

<steven...@gmail.com> schreef in bericht
news:0aac594c-3223-4281...@l6g2000prm.googlegroups.com...

Thanks!

-- Steve

you're right, i switched the 2 versions...

you are probably right about the srand(), adding the 'sleep 1' solves the
problem

but the vitter-sample.awk with option 'n=1' sometimes returns 2 rows
'AA' is the second line in the file american

$ while true; do echo "==="; awk -v n=1 -f vitter-sample.awk american ;
sleep 1; done
===
sectionally
===
AA
tangram
===
Lissie's
===
Halvaard
===
Baber's
===
AA
earaches

Anton Treuenfels

unread,

Jan 4, 2008, 12:07:19 AM1/4/08

to

"Janis Papanagnou" <Janis_Pa...@hotmail.com> wrote in message
news:fliad3$qvg$1...@online.de...

> The counted loop seems to be always preferable in case of contiguous
> filled arrays.

It certainly is in TAWK V4. Loops of any size execute up to three times
faster while traversing such arrays. Though I had generally assumed that
this was due to the "in" operator triggering an array sort before beginning
the iteration.

- Anton Treuenfels

mjc

unread,

Jan 4, 2008, 3:12:40 PM1/4/08

to

On Jan 3, 12:25 am, William James <w_a_x_...@yahoo.com> wrote:

> This could be changed to
>
> asorti(newpool, ordered)
> for (i = 1; i in ordered; i++)
> print newpool[ordered[i]]

I don't know what the for loop as written would do. It might skip some
values, do some more than once, or even get into an infinite loop.

Use: for(i in ordered)

Ed Morton

unread,

Jan 4, 2008, 3:26:25 PM1/4/08

to

No, it wouldn't do any of that. It'd just loop through all the values in "ordered".

> Use: for(i in ordered)

That wouldn't preserve the order.

Ed.

Ed Morton

unread,

Jan 4, 2008, 4:45:48 PM1/4/08

to

On 1/2/2008 11:00 PM, steven...@gmail.com wrote:
> I didn't use the Google Groups form correctly and the posted code was
> not valid. Hopefully the below comes out right.
>
>
> # Waterman's Algorithm R for random sampling
> # by way of Knuth's The Art of Computer Programming, volume 2
>
> BEGIN {
> if (!n) {
> print "Usage: sample.awk -v n=[size]"

Is zero a valid size? What about "seven"? You could tweak the above test to be a
bit more robust, e.g.:

if (!(n+0))

> exit
> }
> t = n
> srand()
> }
>
> NR <= n {
> pool[NR] = $0
> places[NR] = NR
> next

No need for a "next" above or no need for the opposite condition "NR>n" below.

> }
>
> NR > n {
> t++
> M = int(rand()*t) + 1
> if (M <= n) {
> READ_NEXT_RECORD(M)
> }
> }
>
> END {
> if (NR < n) {
> print "sample.awk: Not enough records for sample" \
> > "/dev/stderr"
> exit
> }
> # gawk needs a numeric sort function
> # since it doesn't have one, zero-pad and sort alphabetically
> pad = length(NR)

I may be having a dense moment, but I don't understand why you'd want to pad
each record to the size of the number of records in the file. Wouldn't you want
to instead pad to the length of the longest record?

Ed.

Kenny McCormack

unread,

Jan 4, 2008, 4:54:46 PM1/4/08

to

In article <0fa66663-05e3-430b...@y5g2000hsf.googlegroups.com>,

As Ed notes, the above is perfectly well-defined, on the assumption that
the array indexes are known to be (exclusively) numeric and to start at 1
(and not be infinite).

Still, the above is odd syntax and definitely "looks wrong". But, a
clever workaround for when you are using a defective AWK that doesn't do
array sorting.

steven...@gmail.com

unread,

Jan 5, 2008, 11:49:05 PM1/5/08

to

On Jan 4, 4:45 pm, Ed Morton <mor...@lsupcaemnt.com> wrote:
> Is zero a valid size? What about "seven"? You could tweak the above test to be a
> bit more robust, e.g.:
>
> if (!(n+0))

Good idea -- thanks!

> > # gawk needs a numeric sort function
> > # since it doesn't have one, zero-pad and sort alphabetically
> > pad = length(NR)
>
> I may be having a dense moment, but I don't understand why you'd want to pad
> each record to the size of the number of records in the file. Wouldn't you want
> to instead pad to the length of the longest record?

I'm padding/sorting the indices, not the input data. I want to output
the data in the same order it was received.

pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]

The only way I could see to order it within gawk alone was to copy the
original "pool" array to "newpool", which uses the indices from "pool"
converted to a format that alphabetic sort would work on.

Haven't yet had time to look at the strange behavior Luuk saw... maybe
tomorrow.

-- Steve