Generate Random Numbers for RDD

Wenlei Xie

unread,

Jul 30, 2013, 1:43:34 PM7/30/13

to spark...@googlegroups.com

Hi,

For my data processing job I would like to attach some random numbers to an exists RDD, say

val rand = new Scala.util.random

val newRDD = oldRDD.map(x => (x, Array.fill(10)(rand.nextDouble _)))

However this doesn't work since the random generator is not serializable. The current solution I had is to generate these numbers first and then use Spark.parallize to make it an RDD. I am wondering if there is any built-in support in Spark to generate random numbers?

Thank you!

Best,

Wenlei

Reynold Xin

unread,

Jul 30, 2013, 1:50:51 PM7/30/13

to spark...@googlegroups.com

You can do this

val newRDD = oldRDD.mapPartitions { iter =>

val rand = new Scala.util.random

iter.map(x => (x, Array.fill(10)(rand.nextDouble _)))

}

Be aware that Java's random number generator has a lock.

http://docs.oracle.com/javase/tutorial/essential/concurrency/threadlocalrandom.html

--

Reynold Xin, AMPLab, UC Berkeley

http://rxin.org

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ian O'Connell

unread,

Jul 30, 2013, 2:05:53 PM7/30/13

to spark...@googlegroups.com

To be explicit,

Reynold's example won't hit lock contention since a new instance is created for each partition processing which will occur within a single thread.

Mridul Muralidharan

unread,

Jul 30, 2013, 2:33:49 PM7/30/13

to spark...@googlegroups.com

Just one note though [*] - be careful of using random numbers in RDD :
in case of failures, you can end up with bugs due to different set of
numbers returned.

Regards,
Mridul

[*] I seem to keep harping about this in multiple forums ! Should
probably stop :-)

Mark Hamstra

unread,

Jul 30, 2013, 2:48:45 PM7/30/13

to spark...@googlegroups.com

There are also relevant examples using mapWith, flatMapWith and filterWith in RDDSuite

On Tuesday, July 30, 2013 10:50:51 AM UTC-7, Reynold Xin wrote:

You can do this

val newRDD = oldRDD.mapPartitions { iter =>
val rand = new Scala.util.random
iter.map(x => (x, Array.fill(10)(rand.nextDouble _)))

}https://github.com/mesos/spark/blob/master/core/src/test/scala/spark/RDDSuite.scala

Ian O'Connell

unread,

Jul 30, 2013, 2:54:11 PM7/30/13

to spark...@googlegroups.com

@Mridul, something like

val myAppSeed = 91234

val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>

val rand = new scala.util.Random(indx+myAppSeed)

iter.map(x => (x, Array.fill(10)(rand.nextDouble)))

}

Instead seems like potentially a decent choice to have stable behavior between runs.

Also it should be noted the original post wouldn't work(and hence example's to fix it wouldn't either), there is a second serialization problem in that the

Array.fill(10)(rand.nextDouble _) resulted in a partially applied function so the signature of the RDD ended up being:

spark.RDD[(Int, Array[() => Double])]

Which gives an error serializing Random too

Mark Hamstra

unread,

Jul 30, 2013, 2:58:42 PM7/30/13

to spark...@googlegroups.com

Or, you have bugs that are exposed by using different random numbers -- which isn't necessarily a bad thing. There are, after all, testing frameworks built around the generation of randomized inputs. If catching bugs is your intent, though, it makes sense to make sure that you can reproduce the failing test cases.

Mridul Muralidharan

unread,

Jul 30, 2013, 3:02:12 PM7/30/13

to spark...@googlegroups.com

The issue is not about reproducibility of bugs - but incorrect usage
of exposed idioms.
What Ian proposes is a more appropriate way to use random in spark
(also in hadoop, etc btw !).

Regards,
Mridul

Patrick Wendell

unread,

Jul 30, 2013, 3:04:47 PM7/30/13

to spark...@googlegroups.com

Hey Ian - I thought the issue was that there was actually a static lock that all Random objects use within a JVM. That would mean this still has the locking problem. Or is it just that a single Random() object creates an internal lock (in which case using a per-partition object is fine)? I forget.

Reynold Xin

unread,

Jul 30, 2013, 3:04:53 PM7/30/13

to spark...@googlegroups.com

Actually somebody should write a blog post about using random numbers in Spark :)

--

Reynold Xin, AMPLab, UC Berkeley

http://rxin.org

On Tue, Jul 30, 2013 at 12:02 PM, Mridul Muralidharan <mri...@gmail.com> wrote:

Mridul Muralidharan

unread,

Jul 30, 2013, 3:07:29 PM7/30/13

to spark...@googlegroups.com

I think there are enough posts about using random in hadoop/pig/etc -
most, if not all, of them should be transferable in spark too :-)
Well, there used to be a lot earlier - cant seem to find much now !

Regards,
Mridul

Josh Rosen

unread,

Jul 30, 2013, 3:10:49 PM7/30/13

to spark...@googlegroups.com

Or a FAQ entry: https://spark-project.atlassian.net/browse/SPARK-719

Ian O'Connell

unread,

Jul 30, 2013, 3:11:04 PM7/30/13

to spark...@googlegroups.com

This has come up on this a few times, every time I forget it all and need to look it up again...

the relevant API docs are @ http://docs.oracle.com/javase/7/docs/api/java/util/Random.html

With the relevant section of:

""

Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs.

""

Suggesting the lock is per instance not static

Mark Hamstra

unread,

Jul 30, 2013, 3:13:10 PM7/30/13

to spark...@googlegroups.com

That's what the issue was about (and I see nothing but agreement on the basic nature of how random numbers should be generated in RDDs) until you raised a warning about "bugs due to different numbers". That just strikes me as an odd way to view things, because I see it as far more likely that bugs would be due to prior coding errors which could be exposed by different sets of inputs than it is likely that the bugs themselves would be due to the random numbers. Exposing such bugs isn't something to be warned away from, but to be encouraged.

Anyway, enough quibbling. :)

Wenlei Xie

unread,

Jul 30, 2013, 3:30:47 PM7/30/13

to spark...@googlegroups.com, i...@ianoconnell.com

Thank you! This looks very useful :).

BTW: This should probably be included into some examples or FAQs of Spark, to avoid someone like me to ask questions on this again and again :P