Intermittent ActorSystem() failures under uwsgi in Kubernetes

26 views

Skip to first unread message

flew...@gmail.com

unread,

Aug 30, 2022, 1:34:17 PM8/30/22

to thespian.py

Hello,

I have a deployment of two containers which use Thespian actor systems. One is a "central" actor system which performs most processing, and which is configured to be a convention leader. The others, which run as convention followers, are used by WSGI processes running Flask to handle HTTP request/response by injecting messages into the central actor system and receiving the results.

The problem I am having is that each uwsgi worker process wants to start its own local ActorSystem to connect to the convention leader. What I expected to happen was that one of them would start the pod-local ActorSystem, and the rest would detect the running ActorSystem and just use it. However, they appear to be conflicting, and I see socket bind failures with "address already in use" in the error logs.

I also periodically see issues with the ActorSystems running in the uwsgi pods not seeing the convention leader, or somehow not receiving responses from its actors; after a period of high load, when uwsgi kills extra workers, I periodically see "No response received to PendingActor request" errors, followed by the app crashing and reloading itself. This sometimes leads to 502 errors on requests.

How can I diagnose or solve this issue? I would like to be able to start the ActorSystem once per pod, definitively, and have other calls to ActorSystem() reference the existing one, as the docs state. But I keep hitting these errors.

Thanks very much,

Avi Blackmore

Kevin Quick

unread,

Aug 31, 2022, 4:36:22 PM8/31/22

to flew...@gmail.com, thespian.py

Hi Avi,

We did previously use Thespian with WSGI runners: the ActorSystem startup on a particular machine was global to that machine and subsequent WSGI threads would simply connect to that same runner. That said, I believe we did start the ActorSystem prior to starting the WSGI threads; you may be seeing a race condition between multiple WSGI threads detecting no ActorSystem and then trying to start one simultaneously. I have not used Thespian in kubernetes pods, so I don't have much experience on any interactions or considerations that should be taken when running under kubernetes.

There is a thespian.log file that can be created with various Thespian internal information (See "Thespian Internals Logging" at https://thespianpy.com/doc/using.html#hH-ce55494c-dd7a-4258-a1e8-b090c3bbb1e6). This log is not normally intended for general user consumption and the contents can be pretty cryptic, but it may provide some additional details to help diagnose your connection issues following high loads or not seeing the convention leader. I assume the WSGI threads are using the ActorSystem interface and not acting as Actors themselves; these WSGI threads can be very ephemeral and create requests to Thespian that they don't then hang around to resolve.

Let me know if the above helps or what you find from the internal logs.

Regards,

Kevin

--
You received this message because you are subscribed to the Google Groups "thespian.py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thespianpy+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/thespianpy/b1b3e836-0fbd-4035-b84b-23b6893a26b6n%40googlegroups.com.