Generate Random Numbers for RDD

2,325 views
Skip to first unread message

Wenlei Xie

unread,
Jul 30, 2013, 1:43:34 PM7/30/13
to spark...@googlegroups.com
Hi,

For my data processing job I would like to attach some random numbers to an exists RDD, say

val rand = new Scala.util.random
val newRDD = oldRDD.map(x => (x, Array.fill(10)(rand.nextDouble _)))

However this doesn't work since the random generator is not serializable. The current solution I had is to generate these numbers first and then use Spark.parallize to make it an RDD. I am wondering if there is any built-in support in Spark to generate random numbers?

Thank you!

Best,
Wenlei

Reynold Xin

unread,
Jul 30, 2013, 1:50:51 PM7/30/13
to spark...@googlegroups.com
You can do this

val newRDD = oldRDD.mapPartitions { iter =>
  val rand = new Scala.util.random
  iter.map(x => (x, Array.fill(10)(rand.nextDouble _)))
}

Be aware that Java's random number generator has a lock. 


--
Reynold Xin, AMPLab, UC Berkeley



--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ian O'Connell

unread,
Jul 30, 2013, 2:05:53 PM7/30/13
to spark...@googlegroups.com
To be explicit, 

Reynold's example won't hit lock contention since a new instance is created for each partition processing which will occur within a single thread.

Mridul Muralidharan

unread,
Jul 30, 2013, 2:33:49 PM7/30/13
to spark...@googlegroups.com
Just one note though [*] - be careful of using random numbers in RDD :
in case of failures, you can end up with bugs due to different set of
numbers returned.


Regards,
Mridul

[*] I seem to keep harping about this in multiple forums ! Should
probably stop :-)

Mark Hamstra

unread,
Jul 30, 2013, 2:48:45 PM7/30/13
to spark...@googlegroups.com
There are also relevant examples using mapWith, flatMapWith and filterWith in RDDSuite


On Tuesday, July 30, 2013 10:50:51 AM UTC-7, Reynold Xin wrote:
You can do this

val newRDD = oldRDD.mapPartitions { iter =>
  val rand = new Scala.util.random
  iter.map(x => (x, Array.fill(10)(rand.nextDouble _)))

Ian O'Connell

unread,
Jul 30, 2013, 2:54:11 PM7/30/13
to spark...@googlegroups.com
@Mridul, something like 

val myAppSeed = 91234

val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>

  val rand = new scala.util.Random(indx+myAppSeed)

  iter.map(x => (x, Array.fill(10)(rand.nextDouble)))

}

Instead seems like potentially a decent choice to have stable behavior between runs.


Also it should be noted the original post wouldn't work(and hence example's to fix it wouldn't either), there is a second serialization problem in that the 

Array.fill(10)(rand.nextDouble _) resulted in a partially applied function so the signature of the RDD ended up being:

spark.RDD[(Int, Array[() => Double])]

Which gives an error serializing Random too

Mark Hamstra

unread,
Jul 30, 2013, 2:58:42 PM7/30/13
to spark...@googlegroups.com
Or, you have bugs that are exposed by using different random numbers -- which isn't necessarily a bad thing.  There are, after all, testing frameworks built around the generation of randomized inputs.  If catching bugs is your intent, though, it makes sense to make sure that you can reproduce the failing test cases.

Mridul Muralidharan

unread,
Jul 30, 2013, 3:02:12 PM7/30/13
to spark...@googlegroups.com
The issue is not about reproducibility of bugs - but incorrect usage
of exposed idioms.
What Ian proposes is a more appropriate way to use random in spark
(also in hadoop, etc btw !).

Regards,
Mridul

Patrick Wendell

unread,
Jul 30, 2013, 3:04:47 PM7/30/13
to spark...@googlegroups.com
Hey Ian - I thought the issue was that there was actually a static lock that all Random objects use within a JVM. That would mean this still has the locking problem. Or is it just that a single Random() object creates an internal lock (in which case using a per-partition object is fine)? I forget.

Reynold Xin

unread,
Jul 30, 2013, 3:04:53 PM7/30/13
to spark...@googlegroups.com
Actually somebody should write a blog post about using random numbers in Spark :)


--
Reynold Xin, AMPLab, UC Berkeley



On Tue, Jul 30, 2013 at 12:02 PM, Mridul Muralidharan <mri...@gmail.com> wrote:

Mridul Muralidharan

unread,
Jul 30, 2013, 3:07:29 PM7/30/13
to spark...@googlegroups.com
I think there are enough posts about using random in hadoop/pig/etc -
most, if not all, of them should be transferable in spark too :-)
Well, there used to be a lot earlier - cant seem to find much now !

Regards,
Mridul

Josh Rosen

unread,
Jul 30, 2013, 3:10:49 PM7/30/13
to spark...@googlegroups.com

Ian O'Connell

unread,
Jul 30, 2013, 3:11:04 PM7/30/13
to spark...@googlegroups.com
This has come up on this a few times, every time I forget it all and need to look it up again...


With the relevant section of:
""

Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs.

""

Suggesting the lock is per instance not static

Mark Hamstra

unread,
Jul 30, 2013, 3:13:10 PM7/30/13
to spark...@googlegroups.com
That's what the issue was about (and I see nothing but agreement on the basic nature of how random numbers should be generated in RDDs) until you raised a warning about "bugs due to different numbers".  That just strikes me as an odd way to view things, because I see it as far more likely that bugs would be due to prior coding errors which could be exposed by different sets of inputs than it is likely that the bugs themselves would be due to the random numbers.  Exposing such bugs isn't something to be warned away from, but to be encouraged.

Anyway, enough quibbling. :)

Wenlei Xie

unread,
Jul 30, 2013, 3:30:47 PM7/30/13
to spark...@googlegroups.com, i...@ianoconnell.com
Thank you! This looks very useful :).

BTW: This should probably be included into some examples or FAQs of Spark, to avoid someone like me to ask questions on this again and again :P

Best,
Wenlei
Reply all
Reply to author
Forward
0 new messages