Could this problem in production be related to web2py?

Lisandro

unread,

May 9, 2019, 11:12:24 AM5/9/19

to web2py-users

I'm running web2py in a production environment that handles about 30 requests per second, using Nginx + uWSGI.

The server is a Linode VPS with 16 cores.

uWSGI runs an average of 30 workers (it spawns or recycles workers depending on traffic load)

The server has plenty of resources available (CPU usage is always around 25%, used memory is around 50%).

My web2py application consists in a bunch of news websites (articles and blog posts). Most of requests are simple HTML pages, and they are generated in a few milliseconds. However, since a couple of months ago, I noticed that a few of those requests take much longer than they should (some of them can take up to 60 seconds, which is nonsense). It isn't always the same request that hangs. Requests hang randomly: the same request that takes 300 milliseconds to complete can take up to 50 seconds later.

I checked the health of some related services (like postgresql for databases, pgbouncer for pooling database connections, redis for cache) and there is nothing weird going on there. None of those services is reporting any warning or error, they are all running smoothly.

Here is the only interesting thing I noticed: those requests that hang always correspond to uWSGI workers that have been assigned to core 0.

uWSGI has a config option called "harakiri" which is basically a timeout: if a worker has been working in a requests for more than X seconds, then the worker is killed (recycled). When this happens, uWSGI logs information about the event: the worker number, the request that hanged, and the core number that the worker was assigned to. And the problem is **always** with core 0. Here there are some examples of what I see in uWSGI log:

Thu May  9 07:37:29 2019 - *** HARAKIRI ON WORKER 2 (pid: 27125, try: 1) ***
Thu May  9 07:37:29 2019 - HARAKIRI !!! worker 2 status !!!
Thu May  9 07:37:29 2019 - HARAKIRI [core 0] 186.111.149.160 - GET /necrologicas since 1557398188
Thu May  9 07:37:29 2019 - HARAKIRI !!! end of worker 2 status !!!
Thu May  9 07:55:44 2019 - *** HARAKIRI ON WORKER 9 (pid: 18405, try: 1) ***
Thu May  9 07:55:44 2019 - HARAKIRI !!! worker 9 status !!!
Thu May  9 07:55:44 2019 - HARAKIRI [core 0] 186.109.121.239 - GET / since 1557399283
Thu May  9 07:55:44 2019 - HARAKIRI !!! end of worker 9 status !!!
Thu May  9 09:02:48 2019 - *** HARAKIRI ON WORKER 8 (pid: 3287, try: 1) ***
Thu May  9 09:02:48 2019 - HARAKIRI !!! worker 8 status !!!
Thu May  9 09:02:48 2019 - HARAKIRI [core 0] 66.249.79.48 - GET /noticia/2557/secuestran-plantas-de-marihuana-y-una-moto since 1557403307
Thu May  9 09:02:48 2019 - HARAKIRI !!! end of worker 8 status !!!
Thu May  9 09:15:00 2019 - *** HARAKIRI ON WORKER 10 (pid: 9931, try: 1) ***
Thu May  9 09:15:00 2019 - HARAKIRI !!! worker 10 status !!!
Thu May  9 09:15:00 2019 - HARAKIRI [core 0] 66.249.65.142 - GET /amp/156013/prevencion-salud-show-cantando-con-adriana since 1557404039
Thu May  9 09:15:00 2019 - HARAKIRI !!! end of worker 10 status !!!
Thu May  9 09:29:15 2019 - *** HARAKIRI ON WORKER 22 (pid: 14688, try: 1) ***
Thu May  9 09:29:15 2019 - HARAKIRI !!! worker 22 status !!!
Thu May  9 09:29:15 2019 - HARAKIRI [core 0] 181.95.11.146 - GET /noticia/73359/santa-fe-un-changarin-ayudo-a-una-mujer-y-ella-con-una-colecta-le-compro-una-bic since 1557404894
Thu May  9 09:29:15 2019 - HARAKIRI !!! end of worker 22 status !!!
Thu May  9 11:05:20 2019 - *** HARAKIRI ON WORKER 38 (pid: 521, try: 1) ***

Notice that the problem is always on core 0. I've searched the entire logs and I haven't found the problem presented in a different core number, it's always the core 0.

I've already asked in the uWSGI forum, but it's unlikely uWSGI being the problem (the problem started happening without having updated or changed anything of uWSGI).

Could this be an issue within web2py? I personally don't think so, but I would like to hear some opinions from the experts :)

IMHO the problem could be related to virtualization (remember this is running on a Linode VPS).

What do you think?

Any comment or suggestion will be much appreciated.

Thanks!

Best regards,

Lisandro.

Leonel Câmara

unread,

May 9, 2019, 2:25:36 PM5/9/19

to web2py-users

Seems like a locking problem due to threads. Do you create any thread in your application? If so you need to remove master=true from your uwsgi .ini config.

Lisandro

unread,

May 9, 2019, 4:06:26 PM5/9/19

to web2py-users

Hi Leonel, thank you very much for your time.

uWSGI docs confirm what you suggest:

"The emperor should generally not be run with --master, unless master features like advanced logging are specifically needed."

Allow me to make one last question: what do you mean by "create any thread in your application"? Do you mean using subprocess.call() or something like that?

If that's the case, I think I've taken care of that and I only use subprocess within scheduler environment, but not in my controller functions.

Is that what you meant?

Leonel Câmara

unread,

May 9, 2019, 4:10:19 PM5/9/19

to web2py-users

Yes I meant stuff exactly like that.

Lisandro

unread,

May 9, 2019, 8:10:57 PM5/9/19

to web2py-users

I've checked my app's code once again and I can confirm that it doesn't create threads. It only uses subprocess.cal() within functions that are called in the scheduler environment, I understand that's the proper way to do it because those calls don't run in uwsgi environment.

In the other hand, I can't disable the master process, I use "lazy-apps" and "touch-chain-reload" options of uwsgi in order to achieve graceful reloading, because acordingly to the documentation about graceful reloading:

"All of the described techniques assume a modern (>= 1.4) uWSGI release with the master process enabled."

Graceful reloading allows me to update my app's code and reload uwsgi workers smoothly, without downtime or errors. What can I do if I can't disable master process?

You mentioned the original problem seems to be a locking problem due to threads. If my app doesn't open threads, where else could be the cause of the issue?

The weirdest thing for me is that the timeouts are always on core 0. I mean, uwsgi runs between 30 and 45 workers over 16 cores, isn't too much of a coincidence that requests that hang correspond to a few workers always assigned on core 0?

Lisandro

unread,

May 13, 2019, 12:49:26 PM5/13/19

to web2py-users

After doing a lot of reading about uWSGI, I've discovered that "uWSGI cores are not CPU cores" (this was confirmed by unbit developers, the ones that wrote and mantain uWSGI). This makes me think that the issue I'm experiencing is due to a misconfiguration of uWSGI. But as I'm a developer and not a sysadmin, it's being hard for me to figure out exactly what uWSGI options should I tweak.

I know this is out of the scope of this group, but I'll post my uWSGI app configuration anyway, in case someone still wants to help:

[uwsgi]
pythonpath = /var/www/medios/
mount = /=wsgihandler:application
master = true
workers = 40
cpu-affinity = 3
lazy-apps = true
harakiri = 60
reload-mercy = 8
max-requests = 4000
no-orphans = true
vacuum = true
buffer-size = 32768
disable-logging = true
ignore-sigpipe = true
ignore-write-errors = true
listen = 65535
disable-write-exception = true

Just to remember, this is running on a machine with 16 CPUs.

Maybe I should enable-threads, set processes options and maybe tweak cpu-affinity.

My application uses Redis for caching, so I think I can enable threads safely.

What do you think?

Lisandro

unread,

May 24, 2019, 11:26:50 AM5/24/19

to web2py-users

I've found the root cause of the issue: the guilty was Redis.

This is what was happening: Redis has an option for persistance wich stores the DB to the disk every certain amount of time. The configuration I had was the one that comes by default with Redis, that stores the DB every 15 minutes if at least 1 key changed, every 5 minutes if at least 10 keys changed, and every 60 seconds if 10000 keys changed. My Redis instance was saving DB to the disk every minute, and the saving process was taking about 70 seconds. Apparently, during that time, many of the requests were hanging. What I did was to simply disable the saving process (I can do it in my case because I don't need persistance).

I'm not sure why this happens. I know that Redis is single-threaded, but its documentation states that many tasks (such as saving the DB) run in a separate thread that Redis creates. So I'm not sure how is that the process of saving DB to the disk is making the other Redis operations hang. But this is what was happening, and I'm able to confirm that, after disabling the DB saving process, my application response times have decreased to expected values, no more timeouts :)

I will continue to investigate this issue with Redis in the proper forum.

I hope this helps anyone facing the same issue.

Thanks for the help!

Leonel Câmara

unread,

May 24, 2019, 11:31:27 AM5/24/19

to web2py-users

Thanks a lot for explaining what was happening.

Tim Nyborg

unread,

Jun 6, 2019, 11:15:38 AM6/6/19

to web2py-users

This is really good to know. I've a similar architecture to you, and am planning to add redis to the stack soon. Knowing about issues to be on the lookout for is very helpful.

Lisandro

unread,

Jun 6, 2019, 3:33:28 PM6/6/19

to web2py-users

If you're going to add Redis, let me add a couple of comments about my own experience:

- Using Redis to store sessions (not only to cache) was a huge improvement in my case. I have public websites, some of them with much traffic, so my app handles many sessions. I was using the database for handling sessions, but when I changed to Redis, the performance improvement was considerable.

- Do some tests with the argument "with_lock" available in RedisCache and RedisSessions (from gluon.contrib). In my specific case, using with_lock=False is better, but of course this depends on each specific scenario.

- An advise: choose proper values for "maxmemory" and "maxmemory-policy" options from Redis configuration. The first one sets the max amount of memory that Redis is allowed to use, and "maxmemory-policy" allows you to choose how Redis should evict keys when it hits the maxmemory: https://redis.io/topics/lru-cache.

Tim Nyborg

unread,

Jun 7, 2019, 6:09:26 AM6/7/19

to web2py-users

Thanks for this. Let me know if you find a resolution to the 'saving to disk' latency issue. Redis sessions would be an improvement, but I'd want to ensure disk-persistence for them (but not for cached things like search results). How many sessions are you storing, and how much RAM does it consume?

Lisandro

unread,

Jun 7, 2019, 9:15:35 AM6/7/19

to web2py-users

I'm not exactly sure how many sessions my app is handling, but this numbers will give you an idea:

- My websites receive about 500k visits (sessions) in an average day.

- The server handles about 2.5 million requests in an average day.

- I use RedisSession(session_expiry=36000), that is, sessions handled by Redis expire after 10 hours.

- I also use Redis to store in cache the final HTML of public pages for 5 minutes.

- My Redis instance uses about 12gb of RAM.

- My Redis instance consumes only about 8% of CPU (that is the 8% of one single CPU, notice Redis is single-threaded).

When you say "I'd want to ensure disk-persistence for them (but not for cached things like search results)", how do you plan to achieve that? I'm no expert, but I think the disk-persistance option in Redis is global. If you want persistance for sessions and not for other cached things, I think you will need to different instances of Redis.

Tim Nyborg

unread,

Jun 7, 2019, 10:07:33 AM6/7/19

to web2py-users

You're exactly right - I'll probably wind up with two instances, once RAM-only for caching, and a persistent one for sessions.

Reply all

Reply to author

Forward