handling failures in distributed system

CM

unread,

Apr 2, 2018, 5:40:28 PM4/2/18

to grpc.io

Hi,

I am designing a small distributed job scheduling system with a twist -- each job can be re-executed (idempotent), but same job can't be executed by two workers in parallel. This requirement makes everything really difficult in presence of network/worker failures.

In essence on a high level it looks like this:

- Worker -- a process (one of many) that connects to Coordinator, receives jobs, executes them and submits generated sub-jobs back (if any)

- think "traversing a filesystem": "process this directory" job will generate a bunch of sub-jobs (one for each directory item)

- Coordinator -- maintains systems state, feeds jobs to workers

- System State -- list of jobs and their current status (executing on worker X, done, etc), can be just a list in memory or table in database

- job can take a very long time

I am having difficulty implementing "no parallel execution" guarantee -- if worker (or connection to it) goes down I need to recognize this in Coordinator, "pause" all jobs given worker was running and (after some timeout or user action) re submit jobs to another worker. Timeout (or user action) is required to allow worker (if it is alive) to detect network error and stop it's jobs and start the cycle again (try to register self with Coordinator, etc). It is important that once connection was deemed as broken -- it never reused(or worker may not notice the problem), worker is treated as dead until it re-registers itself (after a job purge or restart).

Can grpc help me implement this? I am feeling like reinventing a bicycle... This certainly can be done with raw TCP (with manual keep-alives), but I'd like to avoid coding all that logic.

Regards,

Michael.

CM

unread,

Apr 9, 2018, 12:45:27 AM4/9/18

to grpc.io

It seems I will be making my own bicycle...

Christopher Warrington - MSFT

unread,

Apr 9, 2018, 6:02:47 PM4/9/18

to grpc.io

> I am having difficulty implementing "no parallel execution" guarantee --
> if worker (or connection to it) goes down I need to recognize this in
> Coordinator, "pause" all jobs given worker was running and (after some
> timeout or user action) re submit jobs to another worker. Timeout (or user
> action) is required to allow worker (if it is alive) to detect network
> error and stop it's jobs and start the cycle again (try to register self
> with Coordinator, etc). It is important that once connection was deemed as
> broken -- it never reused(or worker may not notice the problem), worker is
> treated as dead until it re-registers itself (after a job purge or
> restart).

gRPC doesn't have these sort of intrinsics.

The interesting part here smells like a variation on distributed locking.
You may want to look at something like ZooKeeper.

You could use gRPC messages to do things like communicate the lock names.

--
Christopher Warrington

CM

unread,

Apr 10, 2018, 3:10:07 AM4/10/18

to grpc.io

On Monday, April 9, 2018 at 5:02:47 PM UTC-5, Christopher Warrington - MSFT wrote:

... distributed locking ...

You can't even imagine how much you helped me by mentioning these keywords. :) Thank you, Christopher.

After reading some material Google had to offer in response to "distributed locking" search I came to a conclusion that my system is impossible. Simply because there no way to tell when job of crashed worker is actually finished. I.e. even if I manually went to the box and checked that given worker process is terminated -- there is no guarantee that job it was running has finished! There might be a packet still traveling network towards (lets say) NFS device that will cause observable changes on arrival (at some unspecified moment in the distant future). I can't declare given job dead (and reschedule it) until all such connections are orderly shut down (requires feedback from the other side) -- and this is (1) not really possible once OS takes over TCP connection owned by crashed process (2) not possible at all if network is down.

Solution could be to:

- never use timeouts to declare a job dead/complete

- somehow guarantee that worker never crash -- hardly a possibility

- or provide for a human to make a call (that "job definitely has finished now") and manually "retire" given worker. Human (knowing details of jobs being executed) can go to related devices and manually shutdown related connections before declaring related jobs "complete"

- ... or relax my requirements and/or add some assumptions (like all TCP activity is over 5 seconds after process has terminated)

( ... this is a very high-tech bicycle I am looking at :-/ ... )

You could use gRPC messages to do things like communicate the lock names.

Yes, I certainly can use gRPC as communication layer and build required primitives on top of it. E.g. build a "session" over gRPC (similar to how TCP is built over IP).

Reply all

Reply to author

Forward