Hi Evan,
We've built a system similar to this, and have come to depend upon it heavily for internal tools. Our system is a little different:
* Developers submit jars which can each contain a bunch of independent spark jobs, coded to an API where they need to give each job a name, some inputs, and an output
* We have one long-lived spark context, which loads a bunch of our bigger data sets into cached RDDs on startup.
* A simple web service takes incoming requests to run jobs. It then matches the job name to the code from the jar, makes a web-form to get the inputs, runs the spark job, and renders the output in a browser.
A key difference between what we have and what you want is that we just have one long-lived spark context, b/c we want to avoid the overhead of loading data into memory. The downside is, all spark jobs need to be defined when the job starts. On the other hand, we can always simply restart when we need to add a new jar -- the only added overhead is reloading the data, which you've got to do anyway with the other approach. (We've looked into dynamically loading code from user jars, but its turning out to be a real pain.)
We've actually briefly discussed the idea of giving a presentation on what we've built, if there is interest I can bring it up again.
We're also on the verge of open sourcing a library we wrote which helps manage the inputs to the jobs, and makes it easy to define arguments that are common to the command line and to a web ui. I still need to clean some things up before its "officially" released but you can take a look:
https://github.com/quantifind/SumacResponses to specific questions inline:
On Tuesday, April 23, 2013 4:08:56 PM UTC-7, Evan Chan wrote:
- Each JAR could technically run in a separate process, or in process on the server (which might make tracking progress easier?)
By running each one in a separate process, you'll definitely simplify some issues with having classloaders know about the code in all the jars. Just be sure to start the java process w/ the user jars on the java classpath AND to add them when you create your SparkContext. If you have just one process, you need to figure out how to get the classloader on all threads to know about the relevant jars. I'd love to trade notes with you if you go down that route.
- Right now our jobs are just like standard Scala apps, with command line arguments, but we might need to think about a better API.
We use Sumac for this. Each spark job has a class which defines its arguments, and each jar has one Factory class that returns all the relevant spark jobs it contains. Because we've got the arguments in a strongly typed object, its pretty easy to convert those arguments into a web-form, and even fill in some values that are supplied by the system.
- Progress tracking: AFAIK there is no real progress tracking for Spark jobs today. It's not in the UI, and I haven't dug through the code yet, but it seems you could easily grab progress back from Hadoop inputformats, for example....
any work in this area? I'd be happy to contribute a progress tracking framework back to the community if someone can point me as to where to start.
There is already a SparkListener trait, though right now it only tracks stage completion events, and there is no UI. You'd probably want to add stageStarted and taskCompleted events, which would be really easy. And then you could build a UI on top of that. I'd be happy to create those events for you if that would get you going, since I'm already familiar with that code. I'd really love it if somebody built a UI for it, that would be a huge improvement.