Hello Doug,
I would be happy to provide additional information about the workflow-in-database concept and discuss it in greater detail. Please feel free to contact me at
i...@ikhsoftware.com to arrange a meeting, or we can discuss this further via email, whichever is more convenient for you.
1- Regarding concurrency, we aim to set up three parallel threads:
a) The first thread sleeps on a Postgres connection, awaiting Postgres notifications via the listen/notify mechanism. Once a notification is received, it retrieves runnable tasks from the database one by one, feeds them to the queue as long as it is 'hungry', and then goes back to sleep.
b) The second thread sleeps on the work queue and, upon task completion, removes results from the queue, updates the database, checks if the queue is 'hungry', and if so, sends a notification to wake up the first thread before returning to sleep on the queue.
c) The third thread sleeps on a timer, wakes up every 60 seconds, retrieves all work queue stats (bandwidth, capacity, efficiency, task counters, workers, etc.), and writes this data into the database. This is mainly for convenience, as we're implementing a dashboard that will read all information from the database, and it's useful to have queue stats in one place.
We are curious if these three threads might face any concurrency issues while simultaneously submitting new tasks, retrieving results of completed tasks, and reading current queue stats. The problem is that we can’t really protect these using a mutex because we want to use blocking workqueue.wait to retrieve the task results.
2 - Regarding keep-alive:
We were also thinking that initiating keep-alive checks from the worker side might be beneficial: If the worker does not receive an "alive" response within the keepalive timeout, it can attempt to re-establish the connection or, alternatively, exit. If the manager does not receive any messages from a worker within, for instance, twice the keepalive interval, it could then remove the worker and close the connection.
Alternatively, the initiation of keep-alive checks could remain on the manager side, but the worker could be enhanced to monitor the time since the last message was received from the manager. If this period exceeds, for instance, twice the keepalive interval, the worker could then drop the connection and either attempt to re-establish it or exit.
We were puzzled as to why the NLB connection didn't stay active with keepalive checks every two minutes from the manager. The next item seems to provide an answer to this question.
3 - Thanks for investigating the work_queue_status. We believe we have identified the issue, which also seems to relate to the NLB timeout problem.
It appears that when our Python manager program neglects the queue for too long, specifically by not periodically calling workqueue.wait, background workqueue functions cease to operate. This includes halting check message transmissions, not accepting new workers, stopping stats and debug log updates, and failing to respond to work_queue_status requests.
We suspect this happens because these activities are managed by the main thread. When the Python program is predominantly sleeping or waiting for database notifications (which occurs in our main loop when the workqueue is empty), it suspends the background queue tasks, leading to various issues. To remedy this, we've adjusted our main loop to spend most of the time inside workqueue.wait(1), waking up every second to check the database and returning to workqueue.wait(1) for another second if there was no new work.
We would appreciate it if you could reconfirm the accuracy of our findings and suspicions. Additionally, if you have any better suggestions for remedying this situation, please let us know.
Additionally, we observed that running the manager with a password leads to the rejection of work_queue_status, as it does not support password authentication (auth: peer is not using password authentication). This isn't a significant issue, as we are operating on an internal network and can run without the password. However, we wanted to highlight this in case we overlooked a method to set the password with work_queue_status.
Thank you again for dedicating your time and attention to our questions.
Best regards,
Igor