Accessing SparkContext within RDD transformation functions

angshu rai

unread,

May 20, 2013, 5:46:07 AM5/20/13

to spark...@googlegroups.com

Hi All,

I am trying solve a problem using spark where I need to have nested RDD transformation in the following sense:

RDD1.map(generate a new RDD2 using data in RDD1 and other constructor args -> RDD2.reduce() -> return resultant RDD)

So, in order to create RDD2 I need a SparkContext (to call parallelize), I tried the following 2 approaches and ended up with corresponding exceptions

1. pass parent SC as constructor argument to map function class for RDD1 -> this led to NotSerializableException, (I also tried to put SC in a wrapper class that implements Serializable but didn't help)

2. create a new SC within the map function class -> this led to BindException saying the address is already in use.

Can someone please give some pointers as to how this can be resolved. Essentially I need a SparkContext within a map function.

Thanks

-Angshu

angshu rai

unread,

May 20, 2013, 8:39:32 AM5/20/13

to spark...@googlegroups.com

Just to add, I tried having a global static final SC also, but it led to ExceptionInInitializerError.

Josh Rosen

unread,

May 20, 2013, 10:58:33 AM5/20/13

to spark...@googlegroups.com

Spark doesn't support "nested RDDs"; see these previous threads for details:

https://groups.google.com/d/msg/spark-users/XqaRLAn6-s8/jIsP3Vo6jD8J

https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ

https://spark-project.atlassian.net/browse/SPARK-718

Mark Hamstra

unread,

May 20, 2013, 12:55:37 PM5/20/13

to spark...@googlegroups.com

Why do you need to call parallelize? That is something that really should be avoided, since it implies moving all of the data from the driver into the cluster -- and that, in turn, implies a very expensive consumption of network resources if you're working with any sizable amount of data. (And if you don't have a lot of data, then why are you using a data-oriented cluster framework in the first place?) The parallelize operation should really only be used in tests or small-scale explorations. Once you intend to use your Spark cluster in production and at scale, your code should be relying on neither handling large amounts of data in the driver process nor on moving large amounts of data between nodes.

In other words, if your data are already distributed across the cluster within RDD1, then you need to find a way to transform RDD1 directly into RDD2 without moving the data back to the driver so that you can call parallelize and move it back to the cluster. And as Josh pointed out, creating new, nested driver processes/SparkContexts on the worker nodes is not an option. Ideally, your transformations of RDD1 will not only not involve the very expensive back-and-forth of data between the driver and the workers, but will also retain the existing partitioning of RDD1 in order to avoid moving data between worker nodes.

If the above doesn't make immediate sense to you, then you're not yet understanding some of the fundamental concepts behind programming and working with Spark.

angshu rai

unread,

May 20, 2013, 2:06:06 PM5/20/13

to spark...@googlegroups.com

Thanks Josh for pointing me to the previous threads, I had figured out earlier that nested RDDs are not supported. I was just wondering if and how the SparkContext reused, looks like that is also not possible as it stands now.

Mark: Thanks a lot for elucidating the crux of motivation to practice distributed computing. It is just that not all problems can be catered to by performing transformations to a single RDD (unless maybe you make some simplifying relaxations). The problem I am working on deals with co-training of data where, putting in lay man's terms, each data point perturbs multiple parameters, which in-turn feedback into how the data point is perceived (processed), something slightly more involved than say training a Logit classifier or Gibbs sampling estimation. Now, I understand that these are not strictly embarrassingly parallel algorithms, however, my attempt is to extract as much parallelism as possible by employing parallel parameter updates etc. there are concepts of pipe-lining involved that complement the parallel update process where updates are discrete but numerous. I am sure some of it would make immediate sense to you. Thanks for pointing out that nested SparkContext is not possible.

Still I would be curious to know if there is any way by which an RDD can be created within a transformation function.

Thanks

-Angshu

Mark Hamstra

unread,

May 20, 2013, 3:15:01 PM5/20/13

to spark...@googlegroups.com

No, from within a transformation of an RDD you can't create more RDDs with independent lineages, or perform actions on RDDs -- those are fundamentally driver process operations that cannot be performed on workers. It sounds like your use case involves communication patterns that are not well-suited to Spark. Sometimes (as Josh already pointed out) such cases can be accommodated with joins or by repeatedly accumulating results in the driver and broadcasting them to the workers, but often that involves a lot of overhead and does not scale well. Some such problems can be broken down in ways that reduce the communication requirement while still working in a cluster framework broadly similar to Spark, but making use of such a refactoring will likely require extending Spark analogously to the interesting work done on Sparkler.

angshu rai

unread,

May 21, 2013, 4:40:37 AM5/21/13

to spark...@googlegroups.com

Thanks a lot Mark for clarifying that, and the nice pointer to Sparkler. Let me see what extensions/workarounds are possible in my case.