Checkpointable preemption system for Disco

Augusto Souza

unread,

Aug 19, 2015, 11:40:16 PM8/19/15

to Disco-development

Hello,

My name is Augusto Souza (http://github.com/augustorsouza) and I am a computer science masters student in University of Campinas (Brazil). My area of research is Distributed Systems and I have been trying to work with the Hadoop community to make some contributions to it:

- https://issues.apache.org/jira/browse/MAPREDUCE-5269

- https://issues.apache.org/jira/browse/MAPREDUCE-6434

- https://issues.apache.org/jira/browse/MAPREDUCE-6444

I have been trying to help an Apache developer to make the changes related to checkpointable preemption of jobs accepted (as you can see by my activity in Hadoop's Jira).

Since I am having problems while trying to commit changes into Hadoop, I have been looking into alternative distributed systems frameworks that could have the benefit of checkpointable preemption of the mappers and reducers when a job with more priority gets scheduled, I found Disco on Github, and then I got into this mailing list.

I am curious if a checkpointing feature would be useful for this project, and if so I would like to contribute in some way to this project and measure some results to help me with my Masters work on University.

Does disco have a scheduler or something like that? If not I think I could try to write one based on Hadoop and also add a checkpointing preemption feature based on the patch I have been working on Hadoop.

Also, I am able to help in any other need Disco might have and I might perform some research in order to complete my Masters.

Thank you in advance!

Best regards,

Augusto Souza

Erik Dubbelboer

unread,

Aug 24, 2015, 1:31:39 AM8/24/15

to Disco-development

What exactly would be the use for checkpointable preemption of jobs? Is it so you can have a more important task run right away with the other tasks giving it room to do so?

Augusto Souza

unread,

Aug 24, 2015, 6:58:13 AM8/24/15

to disc...@googlegroups.com

Hello Erik,

Thanks for your answer.

When checking the benefits of checkpointable preemption of jobs in Hadoop we found that it is common in Hadoop clusters to have two main categories of jobs: research and production. The research jobs are long running but with less priority than the production ones, which are faster in average. The problem with this kind of workload is that the production jobs must be executed right away and for this reason they preempt the research jobs. Sometimes the research jobs that get killed to make room for production ones, loose a lot of work that have being computed (since they are long running). In some cases, the research jobs can even starve in those conditions, never finishing because of the lack of resources.

With a checkpointable preemption the main advantage is that the research jobs wouldn't throw away the computed work before getting preempted.

Best regards,

Augusto Souza

--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/d/optout.

Augusto Souza

unread,

Aug 31, 2015, 3:41:57 PM8/31/15

to disc...@googlegroups.com

Another great benefit of preemption in the Hadoop world we would have on disco too with the use of preemption is that jobs just submitted would get the resources needed to be executed faster. Since the scheduler would kill the tasks from other jobs in order to make room for the more recent job to be executed. At this moment we could checkpoint the tasks before killing than, so when rescheduled they would reuse the previous computation.

What do you guys think? Any tips on how to begin this implementation? Resources I could study? I have been reading disco's source code (specially the fair_scheduler part of it).