Spark as a Service

Evan Chan

unread,

Apr 23, 2013, 7:08:56 PM4/23/13

to spark...@googlegroups.com

Hey guys,

We would like to roll out Spark as a service internally. What this means is that there should be some web service that someone can post Spark jobs to for submission, progress tracking, etc. I'm writing to gather feedback about ideas.

- Has this been done before? I couldn't find anything on this mailing list, but wanted to check.

- Submitting jobs: you would post a JAR to a URL.

- Each JAR could technically run in a separate process, or in process on the server (which might make tracking progress easier?)

- Right now our jobs are just like standard Scala apps, with command line arguments, but we might need to think about a better API.

- Progress tracking: AFAIK there is no real progress tracking for Spark jobs today. It's not in the UI, and I haven't dug through the code yet, but it seems you could easily grab progress back from Hadoop inputformats, for example....

any work in this area? I'd be happy to contribute a progress tracking framework back to the community if someone can point me as to where to start.

In general I'd like to hear back from others who have deployed Spark as an internal service, and best practices for people running jobs.

thanks,

Evan

Mark Hamstra

unread,

Apr 23, 2013, 7:30:10 PM4/23/13

to spark...@googlegroups.com

Something closely related to job progress tracking...

Matei Zaharia

unread,

Apr 23, 2013, 8:23:40 PM4/23/13

to spark...@googlegroups.com

Hi Evan,

> We would like to roll out Spark as a service internally. What this means is that there should be some web service that someone can post Spark jobs to for submission, progress tracking, etc. I'm writing to gather feedback about ideas.
>
> - Has this been done before? I couldn't find anything on this mailing list, but wanted to check.
> - Submitting jobs: you would post a JAR to a URL.
> - Each JAR could technically run in a separate process, or in process on the server (which might make tracking progress easier?)
> - Right now our jobs are just like standard Scala apps, with command line arguments, but we might need to think about a better API.

I believe Quantifind was doing something like this internally (I'll ask them) but it was going to run jobs in the same set of JVMs, as a sort of test environment for developers, so it's not exactly what you wanted. It would be cool to have a way of submitting them to a standalone Spark cluster so that the masters run on the cluster.

> - Progress tracking: AFAIK there is no real progress tracking for Spark jobs today. It's not in the UI, and I haven't dug through the code yet, but it seems you could easily grab progress back from Hadoop inputformats, for example....
> any work in this area? I'd be happy to contribute a progress tracking framework back to the community if someone can point me as to where to start.

Yes, there's no work on this right now, beyond adding the SparkListener concept to SparkContext which is an API that we ant to expand to give progress events. Adding it to a per-job UI would be great. We actually have such a UI already for the BlockManager (to tell you where data is stored on the cluster), so that could be extended to show running tasks, etc.

I think several people were interested in this one, but I'm not aware of any concrete work yet, so if you want to send a patch or proposed design, go ahead. Otherwise, this is also high-priority on our end at Berkeley.

Matei

Imran Rashid

unread,

Apr 24, 2013, 12:48:12 AM4/24/13

to spark...@googlegroups.com

Hi Evan,

We've built a system similar to this, and have come to depend upon it heavily for internal tools. Our system is a little different:

* Developers submit jars which can each contain a bunch of independent spark jobs, coded to an API where they need to give each job a name, some inputs, and an output
* We have one long-lived spark context, which loads a bunch of our bigger data sets into cached RDDs on startup.
* A simple web service takes incoming requests to run jobs. It then matches the job name to the code from the jar, makes a web-form to get the inputs, runs the spark job, and renders the output in a browser.

A key difference between what we have and what you want is that we just have one long-lived spark context, b/c we want to avoid the overhead of loading data into memory. The downside is, all spark jobs need to be defined when the job starts. On the other hand, we can always simply restart when we need to add a new jar -- the only added overhead is reloading the data, which you've got to do anyway with the other approach. (We've looked into dynamically loading code from user jars, but its turning out to be a real pain.)

We've actually briefly discussed the idea of giving a presentation on what we've built, if there is interest I can bring it up again.

We're also on the verge of open sourcing a library we wrote which helps manage the inputs to the jobs, and makes it easy to define arguments that are common to the command line and to a web ui. I still need to clean some things up before its "officially" released but you can take a look:
https://github.com/quantifind/Sumac

Responses to specific questions inline:

On Tuesday, April 23, 2013 4:08:56 PM UTC-7, Evan Chan wrote:

- Each JAR could technically run in a separate process, or in process on the server (which might make tracking progress easier?)

By running each one in a separate process, you'll definitely simplify some issues with having classloaders know about the code in all the jars. Just be sure to start the java process w/ the user jars on the java classpath AND to add them when you create your SparkContext. If you have just one process, you need to figure out how to get the classloader on all threads to know about the relevant jars. I'd love to trade notes with you if you go down that route.

- Right now our jobs are just like standard Scala apps, with command line arguments, but we might need to think about a better API.

We use Sumac for this. Each spark job has a class which defines its arguments, and each jar has one Factory class that returns all the relevant spark jobs it contains. Because we've got the arguments in a strongly typed object, its pretty easy to convert those arguments into a web-form, and even fill in some values that are supplied by the system.

- Progress tracking: AFAIK there is no real progress tracking for Spark jobs today. It's not in the UI, and I haven't dug through the code yet, but it seems you could easily grab progress back from Hadoop inputformats, for example....
any work in this area? I'd be happy to contribute a progress tracking framework back to the community if someone can point me as to where to start.

There is already a SparkListener trait, though right now it only tracks stage completion events, and there is no UI. You'd probably want to add stageStarted and taskCompleted events, which would be really easy. And then you could build a UI on top of that. I'd be happy to create those events for you if that would get you going, since I'm already familiar with that code. I'd really love it if somebody built a UI for it, that would be a huge improvement.

Evan Chan

unread,

Apr 24, 2013, 4:12:02 AM4/24/13

to spark...@googlegroups.com

Hi Imran,

On Tue, Apr 23, 2013 at 9:48 PM, Imran Rashid <im...@quantifind.com> wrote:

Hi Evan,

We've built a system similar to this, and have come to depend upon it heavily for internal tools. Our system is a little different:

* Developers submit jars which can each contain a bunch of independent spark jobs, coded to an API where they need to give each job a name, some inputs, and an output
* We have one long-lived spark context, which loads a bunch of our bigger data sets into cached RDDs on startup.

* A simple web service takes incoming requests to run jobs. It then matches the job name to the code from the jar, makes a web-form to get the inputs, runs the spark job, and renders the output in a browser.

A key difference between what we have and what you want is that we just have one long-lived spark context, b/c we want to avoid the overhead of loading data into memory. The downside is, all spark jobs need to be defined when the job starts. On the other hand, we can always simply restart when we need to add a new jar -- the only added overhead is reloading the data, which you've got to do anyway with the other approach. (We've looked into dynamically loading code from user jars, but its turning out to be a real pain.)

I think I got it now. You share a SparkContext amongst jobs, which you can manage because they are all predefined.

Hmmm.... so if we want a more generic service, where people can submit different jobs, then it's easiest to make them separate processes, with say standard args and STDOUT.

We've actually briefly discussed the idea of giving a presentation on what we've built, if there is interest I can bring it up again.

We're also on the verge of open sourcing a library we wrote which helps manage the inputs to the jobs, and makes it easy to define arguments that are common to the command line and to a web ui. I still need to clean some things up before its "officially" released but you can take a look:
https://github.com/quantifind/Sumac

Sumac looks pretty neat, by the way. Now if you can incorporate a help system...... :)

Responses to specific questions inline:

On Tuesday, April 23, 2013 4:08:56 PM UTC-7, Evan Chan wrote:

- Each JAR could technically run in a separate process, or in process on the server (which might make tracking progress easier?)

By running each one in a separate process, you'll definitely simplify some issues with having classloaders know about the code in all the jars. Just be sure to start the java process w/ the user jars on the java classpath AND to add them when you create your SparkContext. If you have just one process, you need to figure out how to get the classloader on all threads to know about the relevant jars. I'd love to trade notes with you if you go down that route.

I don't get what you mean, by having a separate java process, and to add user jars to the server process as well. You can't pass the SparkContext to the other process, right?

I see that with a SparkContext you can pass in jars that can be distributed across the cluster, but I assume that SparkContext still needs to be in the same process that starts the job....

Running the Spark job in a separate process also makes it easy to kill. However...

- Right now our jobs are just like standard Scala apps, with command line arguments, but we might need to think about a better API.

We use Sumac for this. Each spark job has a class which defines its arguments, and each jar has one Factory class that returns all the relevant spark jobs it contains. Because we've got the arguments in a strongly typed object, its pretty easy to convert those arguments into a web-form, and even fill in some values that are supplied by the system.

- Progress tracking: AFAIK there is no real progress tracking for Spark jobs today. It's not in the UI, and I haven't dug through the code yet, but it seems you could easily grab progress back from Hadoop inputformats, for example....

any work in this area? I'd be happy to contribute a progress tracking framework back to the community if someone can point me as to where to start.

There is already a SparkListener trait, though right now it only tracks stage completion events, and there is no UI. You'd probably want to add stageStarted and taskCompleted events, which would be really easy. And then you could build a UI on top of that. I'd be happy to create those events for you if that would get you going, since I'm already familiar with that code. I'd really love it if somebody built a UI for it, that would be a huge improvement.

I'd be interested in this, though I need to look at SparkListener.

I could just be very confused, but listening to event updates via SparkListener (added via context) means that the Context and the Job itself must be running in the same process right? Or is there somehow the ability to create SparkContexts on separate processes that track the same job?

thanks,

Evan

--
You received this message because you are subscribed to a topic in the Google Groups "Spark Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/spark-users/JcrAJlps2T0/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
--
Evan Chan
Senior Software Engineer |

e...@ooyala.com | (650) 996-4600
www.ooyala.com | blog | @ooyala

Imran Rashid

unread,

Apr 24, 2013, 5:24:56 PM4/24/13

to spark...@googlegroups.com

On Wednesday, April 24, 2013 1:12:02 AM UTC-7, Evan Chan wrote:

I think I got it now. You share a SparkContext amongst jobs, which you can manage because they are all predefined.

Hmmm.... so if we want a more generic service, where people can submit different jobs, then it's easiest to make them separate processes, with say standard args and STDOUT.

The jobs aren't really predefined, developers can submit whatever they want. However, by keeping the spark context with data loaded around, we can rerun the same job with different arguments, and its super fast. We can always add a new job by restarting. In addition, we can auto-supply some of the arguments, eg. we can automatically pass in our latest set of data if its requested, and we can reformat the output into an html table.

https://github.com/quantifind/Sumac

Sumac looks pretty neat, by the way. Now if you can incorporate a help system...... :)

glad you like it. It already automatically gives a help message if "--help" is one of the arguments. I've got to update the docs a bunch before the official release ...

By running each one in a separate process, you'll definitely simplify some issues with having classloaders know about the code in all the jars. Just be sure to start the java process w/ the user jars on the java classpath AND to add them when you create your SparkContext. If you have just one process, you need to figure out how to get the classloader on all threads to know about the relevant jars. I'd love to trade notes with you if you go down that route.

I don't get what you mean, by having a separate java process, and to add user jars to the server process as well. You can't pass the SparkContext to the other process, right?

I see that with a SparkContext you can pass in jars that can be distributed across the cluster, but I assume that SparkContext still needs to be in the same process that starts the job....

sorry, I did not explain that very well. Really, I'm just saying that you need to pass the jars in the same way as you do when you run a standalone spark job -- the jars need to be in the classpath & registered w/ the spark context. I mentioned it just because I got this wrong at first when writing our framework, I forgot about including user jars everywhere.

I could just be very confused, but listening to event updates via SparkListener (added via context) means that the Context and the Job itself must be running in the same process right? Or is there somehow the ability to create SparkContexts on separate processes that track the same job?

I think the context & the job are always in the same process. But I guess the question is whether your job launcher creates a new process for each spark job. You could always write a SparkListener which forwards events back to job launcher process.

But an easier option might be to just have your job launcher give a link to the UI for each spark job. That way, you can build the job tracker UI into the common spark job UI, that way everybody could use it :)

Reply all

Reply to author

Forward