> the controller is running on a node which is a peregrine node as well. Is this bad practice? is there any way to recover when the controller crashes?
yes... this is the issue.
The controller has to write to logs, allocate memory, needs CPU, etc
If you run the controller on a peregrine node you're just going to starve it of resources.
Generally the controller shouldn't require too much CPU so you can just run it on a thin/idle node ... but if it has to compete with something VERY intensive like a regular compute node which uses a TON of CPU / memory /disk then you will almost certainly starve it.
Also, it makes management harder... because there are like 256MB you need to run the controller and it's going to be competing or that memory with regular tasks.
So just put the controller on a dedicated node and you are set.
I have some designed for a distributed controller... I'm going to shard them and then have them write to a write ahead log. This way the controllers can crash and we can shard their work and replicate the log...
Right now if the controller crashes you have to restart the job.