Hello,
I have a deployment of two containers which use Thespian actor systems. One is a "central" actor system which performs most processing, and which is configured to be a convention leader. The others, which run as convention followers, are used by WSGI processes running Flask to handle HTTP request/response by injecting messages into the central actor system and receiving the results.
The problem I am having is that each uwsgi worker process wants to start its own local ActorSystem to connect to the convention leader. What I expected to happen was that one of them would start the pod-local ActorSystem, and the rest would detect the running ActorSystem and just use it. However, they appear to be conflicting, and I see socket bind failures with "address already in use" in the error logs.
I also periodically see issues with the ActorSystems running in the uwsgi pods not seeing the convention leader, or somehow not receiving responses from its actors; after a period of high load, when uwsgi kills extra workers, I periodically see "No response received to PendingActor request" errors, followed by the app crashing and reloading itself. This sometimes leads to 502 errors on requests.
How can I diagnose or solve this issue? I would like to be able to start the ActorSystem once per pod, definitively, and have other calls to ActorSystem() reference the existing one, as the docs state. But I keep hitting these errors.
Thanks very much,
Avi Blackmore