[job server] best strategy for inter-job metadata sharing?

Evan Chan

unread,

Jul 24, 2013, 8:15:28 PM7/24/13

to spark-de...@googlegroups.com

Hey folks,

So our job server is designed for both one-job-per-context scenarios as well as multiple-job, shared cached RDDs per context scenarios.

Here is a typical scenario (for us anyways):

- Create a shared context in job server (POST /contexts/my-new-context)

- Submit jars

- Run a job that uploads or creates persistent RDDs

- Run another job that computes a result based on the persistent RDDs (this is why I submitted a pull request for the getPersistentRDDs interface)

It's pretty common to need to persist additional metadata than the RDDs across jobs. For example, we might want to persist custom hashmaps for quick lookup if you have lots of small persistent RDDs. Or, one might want to store variables controlling how many RDDs to persist, as well as how to purge the RDDs.

What is the best method for persisting extra custom metadata between jobs? There are a couple approaches.

First, some background info: all job server jobs currently must implement a trait:

trait SparkJob {

/**

* This is the entry point for a Spark Job Server to execute Spark jobs.

* This function should create or reuse RDDs and return the result at the end, which the

* Job Server will cache or display.

* @param sc a SparkContext for the job. May be reused across jobs.

* @param config the Typesafe Config object passed into the job request

* @return the job result

*/

def runJob(sc: SparkContext, config: Config): Any

/**

* This method is called by the job server to allow jobs to validate their input and reject

* invalid job requests. If SparkJobInvalid is returned, then the job server returns 400

* to the user.

* NOTE: this method should return very quickly. If it responds slowly then the job server may time out

* trying to start this job.

* @return either SparkJobValid or SparkJobInvalid

*/

def validate(sc: SparkContext, config: Config): SparkJobValidation

Here are the ideas we have, and the + and -'s:

Persist the shared metadata in a custom SparkContext class, maybe one that inherits from SparkContext itself. Allow users to use custom contexts with their jobs.

+: It seems to me that both Shark and Spark Streaming use this idea, as both define custom contexts.
-: The StreamingContext is not a subclass of SparkContext, so the trait above would not work for streaming jobs
-: How can you validate that you will have the right type of Context for a given jar?

Have the job server pass around a HashMap or other shared data structure between jobs.

+: Relatively easy to manage (concurrency is slightly tricky).
-: Not as flexible as a custom SparkContext. Would not be able to override any default behaviors, for example.
Does not solve the problem of enabling streaming jobs, but maybe that's a separate discussion

Add an extra "metadata" HashMap in SparkContext itself that all jobs can access.

Not sure this is a win, since if you are writing your own job server, there are many more ways to pass around metadata

What do you guys think?

thanks,

Evan

Reynold Xin

unread,

Jul 25, 2013, 4:16:17 PM7/25/13

to spark-de...@googlegroups.com

I think 1 makes the most sense, given the state/metadata are tied to the specific SparkContext's.

Are you really going to design this to run streaming jobs as well? Most of the design of the job server to me seems to suggest it is for one-off batch jobs/queries. Streaming jobs have very different requirements, so maybe it would be easier to make streaming a non-goal to begin with.

--

Reynold Xin, AMPLab, UC Berkeley

http://rxin.org

--
You received this message because you are subscribed to the Google Groups "Spark Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-develope...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Evan Chan

unread,

Jul 26, 2013, 3:24:08 AM7/26/13

to spark-de...@googlegroups.com

Hi Reynold,

Thanks for having a look. Yes, the job server isn't designed for streaming jobs.... I just wanted to explore how easy it would be to support other scenarios.

We definitely want to support a Spark job creating a Shark table, then being able to run Shark QL queries..... possibly a web-based window or UI doing queries.

-Evan

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-developers/X6iG4CaflSo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-develope...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--

Evan Chan

Staff Engineer
e...@ooyala.com |

Reply all

Reply to author

Forward