So I am curious about using elixir to build a distributed cron system. Our platform runs user defined “flows” from a variety of IoT devices/services (think Nest, SmartThings, Lifx, Fitbit, etc) as well as digital services (Twitter, Facebook, etc). Here is an example: When it’s 9am or I turn my car on at the house, dim my lights, turn down the thermostat only if the outside temperature is below 60 otherwise leave thermostat at current level.
We need to keep track of these time sensitive “jobs” and we do so by having our data router send the cron job to a "scheduler" node when one of our brokers (each integration e.g. Nest, Facebook, etc has its own broker or group of brokers) sends a new request in. This scheduler node schedules it using node-crontab (basically a snapshot of the data it sends to the processing engine). We currently have thousands of jobs per node in memory, unfortunately when a node goes down so do its jobs. When the cron job needs to run the payload in memory is sent for processing and execution. We are trying to think through a way to have another node take over a failed nodes job (first thinking through how a node or group of scheduler nodes get notified that this has happened and to which node.) without having to check all the defined jobs in the central k/v store (redis) and take all the jobs belonging to the failed node (when jobs come in part of the job key value is the hash of the scheduler node it was sent to).
A couple of people are leaning towards a zookeeper master / slave system for this to solve notifications but we are still faced with how to quickly have another node take over a failed nodes jobs. Anyways, I and another person have deployed elixir for a few semi-critical services but nothing like what we might need to build a cron system like this. Erlang/OTP/Elixir seem a perfect fit perhaps with each node writing keeping jobs a local Agent while also writing it to mnesia or perhaps for a cluster of nodes keeping track of jobs on mnesia and when a node goes down another one can take over the jobs for that node by grabbing them from mnesia and writing to it’s cron. I don’t have much experience with these types of systems but it seems like a natural fit for Elixir and the OTP model. Any advice or guidance would be very welcome.
For an example here is something kind of like what we want in golang http://dkron.io/
but as far as I can tell without robust failover mechanisms.
Thanks,
Dan
So I am curious about using elixir to build a distributed cron system.
We currently have thousands of jobs per node in memory, unfortunately when a node goes down so do its jobs. When the cron job needs to run the payload in memory is sent for processing and execution. We are trying to think through a way to have another node take over a failed nodes job
A couple of people are leaning towards a zookeeper master / slave system for this to solve notifications but we are still faced with how to quickly have another node take over a failed nodes jobs.
A couple of people are leaning towards a zookeeper master / slave system for this to solve notifications but we are still faced with how to quickly have another node take over a failed nodes jobs.