Hey folks,
So our job server is designed for both one-job-per-context scenarios as well as multiple-job, shared cached RDDs per context scenarios.
Here is a typical scenario (for us anyways):
- Create a shared context in job server (POST /contexts/my-new-context)
- Submit jars
- Run a job that uploads or creates persistent RDDs
- Run another job that computes a result based on the persistent RDDs (this is why I submitted a pull request for the getPersistentRDDs interface)
It's pretty common to need to persist additional metadata than the RDDs across jobs. For example, we might want to persist custom hashmaps for quick lookup if you have lots of small persistent RDDs. Or, one might want to store variables controlling how many RDDs to persist, as well as how to purge the RDDs.
What is the best method for persisting extra custom metadata between jobs? There are a couple approaches.
First, some background info: all job server jobs currently must implement a trait:
trait SparkJob {
/**
* This is the entry point for a Spark Job Server to execute Spark jobs.
* This function should create or reuse RDDs and return the result at the end, which the
* Job Server will cache or display.
* @param sc a SparkContext for the job. May be reused across jobs.
* @param config the Typesafe Config object passed into the job request
* @return the job result
*/
def runJob(sc: SparkContext, config: Config): Any
/**
* This method is called by the job server to allow jobs to validate their input and reject
* invalid job requests. If SparkJobInvalid is returned, then the job server returns 400
* to the user.
* NOTE: this method should return very quickly. If it responds slowly then the job server may time out
* trying to start this job.
* @return either SparkJobValid or SparkJobInvalid
*/
def validate(sc: SparkContext, config: Config): SparkJobValidation
Here are the ideas we have, and the + and -'s:
- Persist the shared metadata in a custom SparkContext class, maybe one that inherits from SparkContext itself. Allow users to use custom contexts with their jobs.
- +: It seems to me that both Shark and Spark Streaming use this idea, as both define custom contexts.
- -: The StreamingContext is not a subclass of SparkContext, so the trait above would not work for streaming jobs
- -: How can you validate that you will have the right type of Context for a given jar?
- Have the job server pass around a HashMap or other shared data structure between jobs.
- +: Relatively easy to manage (concurrency is slightly tricky).
- -: Not as flexible as a custom SparkContext. Would not be able to override any default behaviors, for example.
- Does not solve the problem of enabling streaming jobs, but maybe that's a separate discussion
- Add an extra "metadata" HashMap in SparkContext itself that all jobs can access.
- Not sure this is a win, since if you are writing your own job server, there are many more ways to pass around metadata
What do you guys think?
thanks,
Evan