Why are my web2py applications hanging intermitently even for 404 pages?

71 views

Skip to first unread message

Lisandro

unread,

May 22, 2019, 1:42:16 PM5/22/19

to web2py-users

I'm trying to find the root cause of an issue that makes my web2py applications respond with a huge delay in some cases.

I'm running about 500 copies of the same application, each one has its own postgresql database.

Nginx is sitting in front and it forwards requests to uWSGI.

I use Redis for caching and storing sessions.

There is always plenty of resources available, including uWSGI workers and database connections.

RAM used is about 50% of 64gb total, CPU used is about 25% of 16 cores total.

The requests that present the delay are very random and even include some requests that return 404. So I was wondering, why are my applications hanging even with a simple 404 in some cases? I focused my investigation in the models, because that code is executed even when web2py has to return a 404. So what are my models doing that could cause a lock or a delay?

First of all, each application connects to its own database, and that is done inside models/db.py. But database is not the problem here. I use pgBouncer for pooling connections to postgresql, and I've checked both logs (postgresql and pgbouncer), everything is normal. There is plenty of room for connections. Just in case, I activated a PostgreSQL feature to log queries that take more than 1 second to complete, and there is nothing there. My applications do very simple SQL queries.

In the other hand, each application connects to Redis to store sessions and to cache, something that is done also from models/db.py.

A couple of questions/comments about Redis:

with_lock

I use RedisSession(with_lock=False) and RedisCache(with_lock=True).

I've tried changing RedisCache(with_lock=False) but the problem didn't solve.

RConn

In my model, I create one single Redis connection with RConn, and then I use that same connection for RedisSession() and RedisCache().

I think that shouldn't be a problem, but what do you think? Should I use two different connections?

Redis SLOWLOG

I'm using this Redis feature to look for queries that are taking too long, and I've found something weird.

First, I connected to redis-cli to set the slowlog in order to log queries that take more than 1000000 microseconds (that is 1 second):

redis-cli
127.0.0.1:6379> SLOWLOG RESET
127.0.0.1:6379> CONFIG SET slowlog-log-slower-than 1000000

After a few seconds running, I check the last three queries that took too long and I see this:

127.0.0.1:6379> SLOWLOG GET 3
1) 1) (integer) 21928
   2) (integer) 1558545081
   3) (integer) 100109
   4)  1) "SUNION"
       2) "w2p:fmcomunicar:___cache_set:25822188"
       3) "w2p:fmcomunicar:___cache_set:25822189"
       4) "w2p:fmcomunicar:___cache_set:25844541"
       5) "w2p:fmcomunicar:___cache_set:25847383"
       6) "w2p:fmcomunicar:___cache_set:25822182"
       7) "w2p:fmcomunicar:___cache_set:25822183"
       8) "w2p:fmcomunicar:___cache_set:25822180"
       9) "w2p:fmcomunicar:___cache_set:25822181"
      10) "w2p:fmcomunicar:___cache_set:25822186"
      11) "w2p:fmcomunicar:___cache_set:25822187"
      12) "w2p:fmcomunicar:___cache_set:25822184"
      13) "w2p:fmcomunicar:___cache_set:25822185"
      14) "w2p:fmcomunicar:___cache_set:25841469"
      15) "w2p:fmcomunicar:___cache_set:25958993"
      16) "w2p:fmcomunicar:___cache_set:25958992"
      17) "w2p:fmcomunicar:___cache_set:25958991"
      18) "w2p:fmcomunicar:___cache_set:25958990"
      19) "w2p:fmcomunicar:___cache_set:25958997"
      20) "w2p:fmcomunicar:___cache_set:25958996"
      21) "w2p:fmcomunicar:___cache_set:25958995"
      22) "w2p:fmcomunicar:___cache_set:25855749"
      23) "w2p:fmcomunicar:___cache_set:25855746"
      24) "w2p:fmcomunicar:___cache_set:25855747"
      25) "w2p:fmcomunicar:___cache_set:25855744"
      26) "w2p:fmcomunicar:___cache_set:25958998"
      27) "w2p:fmcomunicar:___cache_set:25855742"
      28) "w2p:fmcomunicar:___cache_set:25855743"
      29) "w2p:fmcomunicar:___cache_set:25855740"
      30) "w2p:fmcomunicar:___cache_set:25855741"
      31) "w2p:fmcomunicar:___cache_set:25931985"
      32) "... (170178 more arguments)"
2) 1) (integer) 21927
   2) (integer) 1558545078
   3) (integer) 116807
   4) 1) "SMEMBERS"
      2) "w2p:fmcomunicar:___cache_set"
3) 1) (integer) 21926
   2) (integer) 1558545047
   3) (integer) 85990
   4)  1) "SUNION"
       2) "w2p:elexpresodesalta:___cache_set:25823558"
       3) "w2p:elexpresodesalta:___cache_set:25823559"
       4) "w2p:elexpresodesalta:___cache_set:25930251"
       5) "w2p:elexpresodesalta:___cache_set:25914961"
       6) "w2p:elexpresodesalta:___cache_set:25823550"
       7) "w2p:elexpresodesalta:___cache_set:25823551"
       8) "w2p:elexpresodesalta:___cache_set:25823552"
       9) "w2p:elexpresodesalta:___cache_set:25823553"
      10) "w2p:elexpresodesalta:___cache_set:25823554"
      11) "w2p:elexpresodesalta:___cache_set:25823555"
      12) "w2p:elexpresodesalta:___cache_set:25823556"
      13) "w2p:elexpresodesalta:___cache_set:25823557"
      14) "w2p:elexpresodesalta:___cache_set:25914962"
      15) "w2p:elexpresodesalta:___cache_set:25954748"
      16) "w2p:elexpresodesalta:___cache_set:25954749"
      17) "w2p:elexpresodesalta:___cache_set:25954744"
      18) "w2p:elexpresodesalta:___cache_set:25954745"
      19) "w2p:elexpresodesalta:___cache_set:25954746"
      20) "w2p:elexpresodesalta:___cache_set:25954747"
      21) "w2p:elexpresodesalta:___cache_set:25954740"
      22) "w2p:elexpresodesalta:___cache_set:25954742"
      23) "w2p:elexpresodesalta:___cache_set:25954743"
      24) "w2p:elexpresodesalta:___cache_set:25929344"
      25) "w2p:elexpresodesalta:___cache_set:25931796"
      26) "w2p:elexpresodesalta:___cache_set:25888189"
      27) "w2p:elexpresodesalta:___cache_set:25913734"
      28) "w2p:elexpresodesalta:___cache_set:25913735"
      29) "w2p:elexpresodesalta:___cache_set:25839999"
      30) "w2p:elexpresodesalta:___cache_set:25839998"
      31) "w2p:elexpresodesalta:___cache_set:25934839"
      32) "... (147209 more arguments)

If I understand the Redis SLOWLOG documentation correctly, the third integer number of every entry is "the amount of time needed for its execution, in microseconds". So I get the feeling that Redis is ignoring the slowlog-log-slower-than value I've set, because the log is showing queries that take much less than 1000000 microseconds. I've tried running SLOWLOG RESET before and after setting slowlog-log-slower-than, but it always behaves the same way.

Anyway, I'm pretty lost with this issue. Regardless there are about 500 websites running, the total load is about 30 rps. And regardless most of the requests take less than a second to complete, a few of them sometimes take up to 60 seconds to complete (or they just timeout). Even a simple 404 page! :/

Here is my models/db.py in case you notice something odd:

# -*- coding: utf-8 -*-

import os
import time
import locale
from datetime import datetime as dt, date, timedelta, time as t

from gluon import current
from gluon.contrib.redis_cache import RedisCache
from gluon.contrib.redis_session import RedisSession
from gluon.dal import SQLCustomType
from gluon.tools import Auth, Storage
from templates import init_template

from globales import \
    init_config, \
    init_config_template, \
    inicializar_db, \
    MyAuth, \
    constantes_generales, \
    get_redis_conn


current.request = request
current.response = response

locale.setlocale(locale.LC_ALL, 'es_ES.utf8')


redis_conn = get_redis_conn()
cache.redis = RedisCache(
 redis_conn=redis_conn, 
 with_lock=True, 
 application=request.application.replace('_panel', ''))
current.cache = cache


# inicializa constantes de tiempo de caché
TIME_CACHE_LG = 86400
TIME_CACHE_MD = 3600
TIME_CACHE_SM = 300
TIME_CACHE_XS = 30


# inicializa la configuración
CONFIG = cache.redis('CONFIG', lambda: init_config(), time_expire=TIME_CACHE_LG)
current.CONFIG = CONFIG


# se conecta a la base de datos del sitio
db = DAL(
    'postgres://%s:%s@%s/%s' % (
        CONFIG.db_user,
        CONFIG.db_user_password,
        CONFIG.db_host,
        CONFIG.db_name),
    attempts=1,
    migrate=CONFIG.migrate or CONFIG.fake_migrate,
    lazy_tables=not (CONFIG.primera_ejecucion
                     or CONFIG.migrate
                     or CONFIG.fake_migrate),
    fake_migrate=CONFIG.fake_migrate,
    folder=os.path.join(CONFIG.path_datos, 'databases'))
current.db = db


sessiondb = RedisSession(redis_conn=redis_conn, session_expiry=36000)  # 10 horas
session.connect(request, response, db=sessiondb, masterapp=CONFIG.cliente)
current.session = session


# sets that the cookie is valid for maindomain.com and panel.maindomain.com
if response.session_id_name in response.cookies:
    response.cookies[response.session_id_name]['domain'] = CONFIG.dominio


auth = MyAuth(
 globals(), 
 db, 
 hmac_key=Auth.get_or_create_key(
 filename=os.path.join(CONFIG.path_datos, 'auth.key')))
auth.define_tables()
current.auth = auth


# -----------------------------------------
# here it defines around 15 tables
# -----------------------------------------




TEMPLATE = cache.redis(
 'TEMPLATE',
    lambda: init_template(template=CONFIG.template, current_config=CONFIG),
    time_expire=TIME_CACHE_LG)


C = cache.redis('C', lambda: constantes_generales(), time_expire=TIME_CACHE_LG)


T.force('es-es')


CT = cache.redis('CT', lambda: init_config_template(), time_expire=TIME_CACHE_LG)

What do you think? Where could be the issue?

Lisandro

unread,

May 24, 2019, 11:27:36 AM5/24/19

to web2py-users

I've found the root cause of the issue: the guilty was Redis.

This is what was happening: Redis has an option for persistance wich stores the DB to the disk every certain amount of time. The configuration I had was the one that comes by default with Redis, that stores the DB every 15 minutes if at least 1 key changed, every 5 minutes if at least 10 keys changed, and every 60 seconds if 10000 keys changed. My Redis instance was saving DB to the disk every minute, and the saving process was taking about 70 seconds. Apparently, during that time, many of the requests were hanging. What I did was to simply disable the saving process (I can do it in my case because I don't need persistance).

I'm not sure why this happens. I know that Redis is single-threaded, but its documentation states that many tasks (such as saving the DB) run in a separate thread that Redis creates. So I'm not sure how is that the process of saving DB to the disk is making the other Redis operations hang. But this is what was happening, and I'm able to confirm that, after disabling the DB saving process, my application response times have decreased to expected values, no more timeouts :)

I will continue to investigate this issue with Redis in the proper forum.

I hope this helps anyone facing the same issue.

Thanks for the help!

Reply all

Reply to author

Forward

0 new messages