Hi,
I am looking into how to restart a Jenkins master while builds are running on Mesos based slaves and have a couple of questions:
1. It looks like FrameworkID is not saved in state and thus also not provided upon startup, meaning as soon as the scheduler gets stopped, all tasks are stopped as well (FailoverTimeout would also have to be provided). I have developed a small patch to address this, but contributing to upstream will take some time.
I ran into one issue though: If FrameworkID is passed to MesosSchedulerDriver after the FailoverTimeout has been reached, registration will fail with driver status "aborted", because the FrameworkID is invalid (I'm running this on OSX with libmesos-0.22). Was that a recent change in 0.22? Is it sufficient to check for driver status aborted and register without a FrameworkID as a brand new scheduler?
2. When restarting Jenkins with FrameworkID set (and FailoverTimeout previously set) it seems as if the scheduler is able to pick up the Mesos tasks again, but Jenkins does not see the executors again (nor resumes tasks from before the master restart). It looks though as if Jenkins persisted active slaves in config.xml correctly. I still need to test that a bit better, but maybe you guys know if that is supposed to work or not?
Thanks!
Regards,
Bjoern