multiple or shared SparkContext for each SparkStep?

34 views
Skip to first unread message

Vincent Chan

unread,
Apr 9, 2018, 11:58:43 AM4/9/18
to mrjob

I am new to MRJob/pySpark and I'm trying to implement a multi-step EMR/Spark job using MRJob, do I need to create a new SparkContext for each SparkStep, or can I share the same SparkContext for all SparkSteps?

I tried to look up the MRJob manual but unfortunately it was not clear on this.

Can someone please advise what's the correct approach? Thanks a lot!

  1. Creating a separate SparkContext:

     class MRSparkJob(MRJob):
         def spark_step1(self, input_path, output_path):
             from pyspark import SparkContext
             sc = SparkContext(appName='appname')
             ...
             sc.stop()
    
         def spark_step2(self, input_path, output_path):
             from pyspark import SparkContext
             sc = SparkContext(appName='appname')
             ...
             sc.stop()
    
         def steps(self):
             return [SparkStep(spark=self.spark_step1),
                     SparkStep(spark=self.spark_step2)]
    
     if __name__ == '__main__':
         MRSparkJob.run()
    
  2. Create a single SparkContext and share it among differnt SparkSteps

     class MRSparkJob(MRJob):
         sc = None
    
         def spark_step1(self, input_path, output_path):
             from pyspark import SparkContext
             self.sc = SparkContext(appName='appname')
             ...
    
         def spark_step2(self, input_path, output_path):
             from pyspark import SparkContext
             ... (reuse the same self.sc)
             self.sc.stop()
    
         def steps(self):
             return [SparkStep(spark=self.spark_step1),
                     SparkStep(spark=self.spark_step2)]
    
     if __name__ == '__main__':
         MRSparkJob.run()

Dave Marin

unread,
Apr 9, 2018, 1:51:13 PM4/9/18
to mr...@googlegroups.com
If you want to have multiple spark steps, you should create a new SparkContext for each step. In essence, that’s what happens anyway, since each step is a completely new invocation of Hadoop and Spark.

However, the way Spark works, there isn’t much benefit into breaking it into two steps (other than to see the intermediate results for debugging).

-Dave
> --
> You received this message because you are subscribed to the Google Groups "mrjob" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Vincent Chan

unread,
Apr 10, 2018, 3:21:46 AM4/10/18
to mrjob
Thanks for your feedback Dave. My problem is that I needed to wait until the results of step 1 completed, run an UPDATE sql command on a database table using psycopg2 (because Spark/JDBC doesn't support it), before I can resume with my step 2, as my step 2 process requires that table being updated completely before reading from it.

Would you have any suggestion in my use case, other than breaking it into two steps?
Reply all
Reply to author
Forward
0 new messages