I'm trying to find the root cause of an issue that makes my web2py applications respond with a huge delay in some cases.
I'm running about 500 copies of the same application, each one has its own postgresql database.
Nginx is sitting in front and it forwards requests to uWSGI.
I use Redis for caching and storing sessions.
There is always plenty of resources available, including uWSGI workers and database connections.
RAM used is about 50% of 64gb total, CPU used is about 25% of 16 cores total.
The requests that present the delay are very random and even include some requests that return 404. So I was wondering, why are my applications hanging even with a simple 404 in some cases? I focused my investigation in the models, because that code is executed even when web2py has to return a 404. So what are my models doing that could cause a lock or a delay?
First of all, each application connects to its own database, and that is done inside models/db.py. But database is not the problem here. I use pgBouncer for pooling connections to postgresql, and I've checked both logs (postgresql and pgbouncer), everything is normal. There is plenty of room for connections. Just in case, I activated a PostgreSQL feature to log queries that take more than 1 second to complete, and there is nothing there. My applications do very simple SQL queries.
In the other hand, each application connects to Redis to store sessions and to cache, something that is done also from models/db.py.
A couple of questions/comments about Redis:
with_lock
I use RedisSession(with_lock=False) and RedisCache(with_lock=True).
I've tried changing RedisCache(with_lock=False) but the problem didn't solve.
RConn
In my model, I create one single Redis connection with RConn, and then I use that same connection for RedisSession() and RedisCache().
I think that shouldn't be a problem, but what do you think? Should I use two different connections?
Redis SLOWLOG
I'm using this Redis feature to look for queries that are taking too long, and I've found something weird.
First, I connected to redis-cli to set the slowlog in order to log queries that take more than 1000000 microseconds (that is 1 second):
redis-cli
127.0.0.1:6379> SLOWLOG RESET
127.0.0.1:6379> CONFIG SET slowlog-log-slower-than 1000000
After a few seconds running, I check the last three queries that took too long and I see this:
127.0.0.1:6379> SLOWLOG GET 3
1) 1) (integer) 21928
2) (integer) 1558545081
3) (integer) 100109
4) 1) "SUNION"
2) "w2p:fmcomunicar:___cache_set:25822188"
3) "w2p:fmcomunicar:___cache_set:25822189"
4) "w2p:fmcomunicar:___cache_set:25844541"
5) "w2p:fmcomunicar:___cache_set:25847383"
6) "w2p:fmcomunicar:___cache_set:25822182"
7) "w2p:fmcomunicar:___cache_set:25822183"
8) "w2p:fmcomunicar:___cache_set:25822180"
9) "w2p:fmcomunicar:___cache_set:25822181"
10) "w2p:fmcomunicar:___cache_set:25822186"
11) "w2p:fmcomunicar:___cache_set:25822187"
12) "w2p:fmcomunicar:___cache_set:25822184"
13) "w2p:fmcomunicar:___cache_set:25822185"
14) "w2p:fmcomunicar:___cache_set:25841469"
15) "w2p:fmcomunicar:___cache_set:25958993"
16) "w2p:fmcomunicar:___cache_set:25958992"
17) "w2p:fmcomunicar:___cache_set:25958991"
18) "w2p:fmcomunicar:___cache_set:25958990"
19) "w2p:fmcomunicar:___cache_set:25958997"
20) "w2p:fmcomunicar:___cache_set:25958996"
21) "w2p:fmcomunicar:___cache_set:25958995"
22) "w2p:fmcomunicar:___cache_set:25855749"
23) "w2p:fmcomunicar:___cache_set:25855746"
24) "w2p:fmcomunicar:___cache_set:25855747"
25) "w2p:fmcomunicar:___cache_set:25855744"
26) "w2p:fmcomunicar:___cache_set:25958998"
27) "w2p:fmcomunicar:___cache_set:25855742"
28) "w2p:fmcomunicar:___cache_set:25855743"
29) "w2p:fmcomunicar:___cache_set:25855740"
30) "w2p:fmcomunicar:___cache_set:25855741"
31) "w2p:fmcomunicar:___cache_set:25931985"
32) "... (170178 more arguments)"
2) 1) (integer) 21927
2) (integer) 1558545078
3) (integer) 116807
4) 1) "SMEMBERS"
2) "w2p:fmcomunicar:___cache_set"
3) 1) (integer) 21926
2) (integer) 1558545047
3) (integer) 85990
4) 1) "SUNION"
2) "w2p:elexpresodesalta:___cache_set:25823558"
3) "w2p:elexpresodesalta:___cache_set:25823559"
4) "w2p:elexpresodesalta:___cache_set:25930251"
5) "w2p:elexpresodesalta:___cache_set:25914961"
6) "w2p:elexpresodesalta:___cache_set:25823550"
7) "w2p:elexpresodesalta:___cache_set:25823551"
8) "w2p:elexpresodesalta:___cache_set:25823552"
9) "w2p:elexpresodesalta:___cache_set:25823553"
10) "w2p:elexpresodesalta:___cache_set:25823554"
11) "w2p:elexpresodesalta:___cache_set:25823555"
12) "w2p:elexpresodesalta:___cache_set:25823556"
13) "w2p:elexpresodesalta:___cache_set:25823557"
14) "w2p:elexpresodesalta:___cache_set:25914962"
15) "w2p:elexpresodesalta:___cache_set:25954748"
16) "w2p:elexpresodesalta:___cache_set:25954749"
17) "w2p:elexpresodesalta:___cache_set:25954744"
18) "w2p:elexpresodesalta:___cache_set:25954745"
19) "w2p:elexpresodesalta:___cache_set:25954746"
20) "w2p:elexpresodesalta:___cache_set:25954747"
21) "w2p:elexpresodesalta:___cache_set:25954740"
22) "w2p:elexpresodesalta:___cache_set:25954742"
23) "w2p:elexpresodesalta:___cache_set:25954743"
24) "w2p:elexpresodesalta:___cache_set:25929344"
25) "w2p:elexpresodesalta:___cache_set:25931796"
26) "w2p:elexpresodesalta:___cache_set:25888189"
27) "w2p:elexpresodesalta:___cache_set:25913734"
28) "w2p:elexpresodesalta:___cache_set:25913735"
29) "w2p:elexpresodesalta:___cache_set:25839999"
30) "w2p:elexpresodesalta:___cache_set:25839998"
31) "w2p:elexpresodesalta:___cache_set:25934839"
32) "... (147209 more arguments)
If I understand the
Redis SLOWLOG documentation correctly, the third integer number of every entry is "the amount of time needed for its execution, in microseconds". So I get the feeling that Redis is ignoring the
slowlog-log-slower-than value I've set, because the log is showing queries that take much less than 1000000 microseconds. I've tried running SLOWLOG RESET before and after setting
slowlog-log-slower-than, but it always behaves the same way.
Anyway, I'm pretty lost with this issue. Regardless there are about 500 websites running, the total load is about 30 rps. And regardless most of the requests take less than a second to complete, a few of them sometimes take up to 60 seconds to complete (or they just timeout). Even a simple 404 page! :/
Here is my models/db.py in case you notice something odd:
# -*- coding: utf-8 -*-
import os
import time
import locale
from datetime import datetime as dt, date, timedelta, time as t
from gluon import current
from gluon.contrib.redis_cache import RedisCache
from gluon.contrib.redis_session import RedisSession
from gluon.dal import SQLCustomType
from gluon.tools import Auth, Storage
from templates import init_template
from globales import \
init_config, \
init_config_template, \
inicializar_db, \
MyAuth, \
constantes_generales, \
get_redis_conn
current.request = request
current.response = response
locale.setlocale(locale.LC_ALL, 'es_ES.utf8')
redis_conn = get_redis_conn()
cache.redis = RedisCache(
redis_conn=redis_conn,
with_lock=True,
application=request.application.replace('_panel', ''))
current.cache = cache
# inicializa constantes de tiempo de caché
TIME_CACHE_LG = 86400
TIME_CACHE_MD = 3600
TIME_CACHE_SM = 300
TIME_CACHE_XS = 30
# inicializa la configuración
CONFIG = cache.redis('CONFIG', lambda: init_config(), time_expire=TIME_CACHE_LG)
current.CONFIG = CONFIG
# se conecta a la base de datos del sitio
db = DAL(
'postgres://%s:%s@%s/%s' % (
CONFIG.db_user,
CONFIG.db_user_password,
CONFIG.db_host,
CONFIG.db_name),
attempts=1,
migrate=CONFIG.migrate or CONFIG.fake_migrate,
lazy_tables=not (CONFIG.primera_ejecucion
or CONFIG.migrate
or CONFIG.fake_migrate),
fake_migrate=CONFIG.fake_migrate,
folder=os.path.join(CONFIG.path_datos, 'databases'))
current.db = db
sessiondb = RedisSession(redis_conn=redis_conn, session_expiry=36000) # 10 horas
session.connect(request, response, db=sessiondb, masterapp=CONFIG.cliente)
current.session = session
# sets that the cookie is valid for maindomain.com and panel.maindomain.com
if response.session_id_name in response.cookies:
response.cookies[response.session_id_name]['domain'] = CONFIG.dominio
auth = MyAuth(
globals(),
db,
hmac_key=Auth.get_or_create_key(
filename=os.path.join(CONFIG.path_datos, 'auth.key')))
auth.define_tables()
current.auth = auth
# -----------------------------------------
# here it defines around 15 tables
# -----------------------------------------
TEMPLATE = cache.redis(
'TEMPLATE',
lambda: init_template(template=CONFIG.template, current_config=CONFIG),
time_expire=TIME_CACHE_LG)
C = cache.redis('C', lambda: constantes_generales(), time_expire=TIME_CACHE_LG)
T.force('es-es')
CT = cache.redis('CT', lambda: init_config_template(), time_expire=TIME_CACHE_LG)
What do you think? Where could be the issue?