Hi Lou,
It's hard to give an exact answer to your first question because integrating Sparrow with Spark was intimately tied to the development of the Sparrow in general. We ended up essentially doing the Spark integration twice: the first integration happened around Spring 2012, and then we had to do a near-complete re-write in August 2013 because the Spark scheduling code had changed significantly. Patrick Wendell did the first integration and I did the second one; we each did actually start from scratch in terms of our existing Spark knowledge (this would be different now!). I remember it taking a month or so the first time and less (more like a week) the second time, because having the old code was a useful reference. In the end, the current Sparrow branch (
https://github.com/kayousterhout/spark/commits/sparrow) reflects 326 added lines of code compared to the base Spark version (plus the ThroughputTester file, which we wrote to compare to the default Spark scheduler). The current Sparrow branch is not feature-complete with the Spark default scheduler (for example, it doesn't currently support the UI -- which didn't exist when we started!), so doing that would require additional work.
You're right that gang scheduling is not supported by Sparrow, and would be difficult to support given Sparrow's decentralized architecture. Gang scheduling is not needed by Spark (since synchronization happens between stages, and not within a stage) and also not supported by many other cluster schedulers. There's a brief discussion of this on page 14 of our
SOSP paper.
For inter-job dependencies, some of these could be supported by Sparrow. For example, one inter-job dependency I've heard of folks wanting is "job X should never be scheduled on the same machine at job Y". Sparrow could support this by adding some additional logic when responding to probes, to notify frontends about what else was running on the machine. Sparrow's late binding approach ends up looking a lot like Mesos, where scheduler frontends get "offers" from workers (in response to probes) describing available resources, and the scheduler frontend can accept or reject those resources. It's possible to implement a fairly broad set of policies in this model, which the Mesos paper has some discussion of. Supporting this would require adding some logic in Sparrow to avoid sending multiple offers from one worker at the same time (which is possible now but would lead to race conditions with inter-job policies) and also adding logic to expose more information about the jobs currently running on a particular machine. We have generally been surprised at how many policies would be implemented using Sparrow! I'm curious to hear if there are other policies you're thinking of that seem difficult or impossible to support.
I hope this is helpful. Let me know if you have further questions!
-Kay