sthu...@googlemail.com (Stefan Hübner)
writes:
> What makes fixed-sample so slow on large sample sizes?
The implementation in effect causes each mapper to generate a random
sample of the desired size, then send that sample to a single reducer.
The reducer selects the final sample from that -- and actually does so
by just selecting the first <n> tuples (sorted by a random number) --
but for Cascading reasons I don't entirely understand still has to step
through every tuple received.
So the impact is that `fixed-sample` performance is O(m*n), where <m> is
the number of map tasks and <n> is your desired sample size.
If you're willing to assume that tuples are already randomly distributed
throughout your map task input blocks, you could write an alternate
version which selected only n/m tuples in each mapper. And then you
wouldn't even need a reduce step.
I believe there's a more efficient global sample method for MapReduce,
but I'm failing to recall the details at the moment -- I'll reply again
if I happen to refresh them from swap.
--
Marshall T. Vandegrift <
lla...@damballa.com>
Damballa Staff Software Engineer | 518.859.4559m