Query hangs, propably after connection timeout

Tilmann Singer

unread,

Jul 30, 2013, 5:41:47 AM7/30/13

to rub...@googlegroups.com

I am trying to debug an error that occurs infrequently on a setup
consisting of a web server with unicorn clients running rails, connected
via tcp on an internal network to a postgresql server. Every couple of
hours, a web request fails, because the unicorn worker process hangs
processing the request. The worker process is killed by the unicorn
timeout mechanism after some time, and subsequent requests work fine. I
haven't found a way to reproduce the error except for waiting, but I
suspect it has to do with connections timing out - the application often
has longer idle periods over night, and this morning we experienced the
error several times, whereas during the day it appears to occur less
frequently.

Also we see occasional connection timeouts that immediately raise the
exception: "PG::UnableToSend: could not send data to server: Connection
timed out", which is ok because rails can deal with that by immediately
reconnecting (if it properly does that is another topic).

To find out where the ruby process hangs, I've added a middleware that
prints the ruby backtrace and sends the hanging process a SIGSEGV which
makes it log the following C level backtrace. I don't know C, but it
looks to me like this is related to the pg gem, so I'm posting this
here.

I think slow queries can be ruled out - it's always hanging when
executing a simple standard query, the postgresql server has a low
statement_timeout and there is nothing to see in its log when this
happens.

[BUG] Segmentation fault
ruby 2.0.0p195 (2013-05-14 revision 40734) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0001 p:---- s:0002 e:000001 TOP [FINISH]

-- C level backtrace information -------------------------------------------
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x15798a) [0x7f6e8ecf798a]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x1ae38c) [0x7f6e8ed4e38c]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(rb_bug+0xb8) [0x7f6e8ed4e668]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0xda0cd) [0x7f6e8ec7a0cd]
/lib64/libpthread.so.0(+0xf500) [0x7f6e8e770500] pg_connection.c:1052
/lib64/libc.so.6(__poll+0x53) [0x7f6e8dbe6253]
/usr/pgsql-9.2/lib/libpq.so.5(+0x141df) [0x7f6e828e71df]
/usr/pgsql-9.2/lib/libpq.so.5(+0x14260) [0x7f6e828e7260]
/usr/pgsql-9.2/lib/libpq.so.5(PQgetResult+0x99) [0x7f6e828e6889]
/usr/pgsql-9.2/lib/libpq.so.5(+0x13b68) [0x7f6e828e6b68]
/srv/projectname/shared/bundle/ruby/2.0.0/gems/pg-0.16.0/lib/pg_ext.so(+0x6320) [0x7f6e82b06320] gvl_wrappers.c:9
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x165322) [0x7f6e8ed05322]
/srv/projectname/shared/bundle/ruby/2.0.0/gems/pg-0.16.0/lib/pg_ext.so(gvl_PQexec+0x2f) [0x7f6e82b05fff] gvl_wrappers.c:10
/srv/projectname/shared/bundle/ruby/2.0.0/gems/pg-0.16.0/lib/pg_ext.so(+0x11139) [0x7f6e82b11139] pg_connection.c:853
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x140169) [0x7f6e8ece0169]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x1562d3) [0x7f6e8ecf62d3]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x14a7bd) [0x7f6e8ecea7bd]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x14fddb) [0x7f6e8ecefddb]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(rb_iseq_eval+0x196) [0x7f6e8ecf0526]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x1b2efd) [0x7f6e8ed52efd]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x1b306d) [0x7f6e8ed5306d]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x140169) [0x7f6e8ece0169]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x1562d3) [0x7f6e8ecf62d3]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x14a7bd) [0x7f6e8ecea7bd]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x14fddb) [0x7f6e8ecefddb]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(rb_iseq_eval_main+0x1c8) [0x7f6e8ecf01f8]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x22012) [0x7f6e8ebc2012]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(ruby_run_node+0x36) [0x7f6e8ebc4be6]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x2078b) [0x7f6e8ebc078b]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x7f6e8db25cdd]
unicorn worker[4] -E staging -D -c /srv/projectname/current/config/unicorn.rb(+0x20669) [0x7f6e8ebc0669]

INFO 2013-07-30 06:34:45 UTC - Thread backtrace:
INFO 2013-07-30 06:34:45 UTC - /srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/postgresql_adapter.rb:768:in `exec'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/postgresql_adapter.rb:768:in `exec_no_cache'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/postgresql/database_statements.rb:138:in `block in exec_query'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract_adapter.rb:425:in `block in log'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activesupport-4.0.0/lib/active_support/notifications/instrumenter.rb:20:in `instrument'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract_adapter.rb:420:in `log'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/postgresql/database_statements.rb:137:in `exec_query'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/postgresql_adapter.rb:885:in `select'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract/database_statements.rb:24:in `select_all'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract/query_cache.rb:61:in `block in select_all'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract/query_cache.rb:76:in `cache_sql'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/connection_adapters/abstract/query_cache.rb:61:in `select_all'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/querying.rb:36:in `find_by_sql'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation.rb:585:in `exec_queries'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation.rb:471:in `load'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation.rb:220:in `to_a'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation/finder_methods.rb:315:in `find_take'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation/finder_methods.rb:65:in `take'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/relation/finder_methods.rb:48:in `find_by'
/srv/projectname/shared/bundle/ruby/2.0.0/gems/activerecord-4.0.0/lib/active_record/querying.rb:6:in `find_by'
[.. application backtrace removed ...]

I'm interested if you think this could be a bug in pg, but also in
strategies for a workaround, e.g. how to avoid the timeouts in the first
place. Would it make sense to change tcp keepalive settings on the
postgresql server? The settings in postgresql.conf are currently on the
default values, and the kernel values (on both app and db server) are:

# sysctl -a|grep keep
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75

The error occurs with pg 0.16.0. It also occured with 0.15.1 (I cannot
100% confirm this because there is no logged backtrace from before the
upgrade, but the app showed the same behaviour)

Kernel version:
2.6.32-358.el6.x86_64

Ruby version: MRI 2.0.0-p195

OS
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.4 (Santiago)

thank you, Tilmann

Tilmann Singer

unread,

Aug 1, 2013, 8:23:41 AM8/1/13

to rub...@googlegroups.com

> Would it make sense to change tcp keepalive settings on the
> postgresql server?

Changing the keepalive settings of postgresql prevents the timeouts from
occuring, so it seems to be a useful workaround.

Lars Kanis

unread,

Aug 3, 2013, 5:30:25 PM8/3/13

to ti...@tils.net, rub...@googlegroups.com

Hi Tilmann!

Thank you for following up the issue - and sorry for my late response.
There is a similar issue on bitbucket:
https://bitbucket.org/ged/ruby-pg/issue/170 . It was my assumption, too,
that the keepalives parameters could help.

Is there a firewall between the web- and the database server? They tend
to consider a connection to be dead after less than 2 hours of idle (the
Linux default) and drop the connection state information. Which
keepalive values do you use, now?

Regards,
Lars

Tilmann Singer

unread,

Aug 3, 2013, 6:07:07 PM8/3/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

Lars Kanis <la...@greiz-reinsdorf.de> writes:
> Is there a firewall between the web- and the database server? They tend
> to consider a connection to be dead after less than 2 hours of idle (the
> Linux default) and drop the connection state information. Which
> keepalive values do you use, now?

Yes, there is a firewall between the servers. I don't know its
specifics, but it was suspected that it might drop idle connections
earlier than usual.

I'm currently setting the following values in the client
connection. Since this is rails, they conveniently go in database.yml:

production:
...
variables:
tcp_keepalives_idle: 60
tcp_keepalives_interval: 60
tcp_keepalives_count: 100

This might result in a lot more keepalives being sent than necessary,
but I doubt that this problematic.

Reply all

Reply to author

Forward