Performance of fixed-sample for larger samples

Stefan Hübner

unread,

Sep 6, 2012, 10:12:28 AM9/6/12

to cascal...@googlegroups.com

Hi,

I've tried to use fixed-sample to randomly extract 100K out of 1B
tuples. There were some dozen million tuples per mapper. What I found
was, that the sampling was enbearably slow, ie. were the base generator
would take a couple of minutes to get through all data, fixed-sample on
top of that generator took more than 24 hours. It seems, that for larger
sample sizes fixed-sample becomes very slow. I was trying to dig in and
understag it's implementation. I guess, c/limit is the bottleneck, but I
couldn't wrap my had around defparallelbuf to understand what really
goes on.

What makes fixed-sample so slow on large sample sizes?

-Stefan

Marshall T. Vandegrift

unread,

Sep 6, 2012, 12:41:30 PM9/6/12

to cascal...@googlegroups.com

sthu...@googlemail.com (Stefan Hübner)
writes:

> What makes fixed-sample so slow on large sample sizes?

The implementation in effect causes each mapper to generate a random
sample of the desired size, then send that sample to a single reducer.
The reducer selects the final sample from that -- and actually does so
by just selecting the first <n> tuples (sorted by a random number) --
but for Cascading reasons I don't entirely understand still has to step
through every tuple received.

So the impact is that `fixed-sample` performance is O(m*n), where <m> is
the number of map tasks and <n> is your desired sample size.

If you're willing to assume that tuples are already randomly distributed
throughout your map task input blocks, you could write an alternate
version which selected only n/m tuples in each mapper. And then you
wouldn't even need a reduce step.

I believe there's a more efficient global sample method for MapReduce,
but I'm failing to recall the details at the moment -- I'll reply again
if I happen to refresh them from swap.

--
Marshall T. Vandegrift <lla...@damballa.com>
Damballa Staff Software Engineer | 518.859.4559m

Nathan Marz

unread,

Sep 13, 2012, 3:17:49 AM9/13/12

to cascal...@googlegroups.com

That's correct. fixed-sample is uniformly random regardless of the distribution of tuples across your input files. You could make a faster version that only selected m/n from each input file, but that wouldn't have the same statistical properties (a tuple coming from a file with few tuples is far more likely to be chosen than one that comes from a large file, so your sample is no longer uniform).

--
Twitter: @nathanmarz
http://nathanmarz.com

Reply all

Reply to author

Forward