ruby threads are blocking when doing cancel through haproxy

zsom...@gmail.com

unread,

Sep 20, 2013, 11:01:39 AM9/20/13

to rub...@googlegroups.com

Hi!

I have a strange problem which unfortunatelly I didn't manage to simplify enough to make it as a sensible bug report, so I'm just trying to explain what's happening, maybe you have some ideas.

The main thing we want to achieve is to execute several parallel queries and if any of them succeeds then cancel the others. However when I issue the connection's cancel method in any thread, somehow it blocks the whole ruby VM (?), and only continues running when the server side closes the connection.

But to make it more complicated this actually does not happen when I'm connecting directly to the postgres server, it only happens when I'm connecting to a haproxy server first, which has a limit of 5 connection per postgres server, so when I do 6 parallel connections to the haproxy, 5 of them is delivered to the postgres server, one of them is kept in the queue waiting until any of the others finish.

I'm using ruby-2.0.0-p247 with pg-0.17.0 on a mac with os x 10.8.5.

I tried to make the smallest possible test script which fails for me, but because it only does it in the haproxy case I'm not sure if it's useful:

require 'pg'

DBLINK = 'host=something port=something user=something etc.'

number = 6

puts "sending #{number} parallel queries without cancelling them"

threads = []

number.times do |i|

threads << Thread.new do

print "<#{i} "

conn = PG.connect(DBLINK)

print "-#{i}= "

conn.send_query 'SELECT 42'

print "=#{i}- "

conn.close

print "#{i}> "

end

print "==joining threads== "

threads.each {|t| t.join}

puts "done"

puts "sending #{number} parallel queries which will be cancelled immediately"

threads = []

number.times do |i|

threads << Thread.new do

print "<#{i} "

conn = PG.connect(DBLINK)

print "-#{i}= "

conn.send_query 'SELECT 42'

print "=#{i}= "

conn.cancel

print "=#{i}- "

conn.close

print "#{i}> "

end

print "==joining threads== "

threads.each {|t| t.join}

puts "done"

The output is:

sending 6 parallel queries without cancelling them

<3 <1 <0 <2 ==joining threads== <4 <5 -0= -2= =0- =2- 0> 2> -4= -1= =4- =1- 4> 1> -5= =5- 5> -3= =3- 3> done

sending 6 parallel queries which will be cancelled immediately

<3 <1 <2 <0 ==joining threads== <4 <5 -0= =0=

Then a couple of minutes later the haproxy timeouts the connections and closes them, the output finishes with:

-3= -1= -4= =3= =0- -2= =1= =4= =3- =1- =4- =2= 0> 3> 1> 4> =2- 2>

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `initialize': PG::ConnectionBad

from /Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `new'

from /Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `connect'

from cancel_test.rb:29:in `block (2 levels) in <main>'

Additionally sometimes the whole things segfaults, which I'm not sure is related but prints this:

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40: [BUG] Segmentation fault

ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]

-- Control frame information -----------------------------------------------

c:0005 p:---- s:0015 e:000014 CFUNC :initialize

c:0004 p:---- s:0013 e:000012 CFUNC :new

c:0003 p:0017 s:0009 e:000008 METHOD /Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40

c:0002 p:0035 s:0005 e:000004 BLOCK cancel_test.rb:12 [FINISH]

c:0001 p:---- s:0002 e:000001 TOP [FINISH]

cancel_test.rb:12:in `block (2 levels) in <main>'

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `connect'

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `new'

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40:in `initialize'

I would really appreciate any ideas about this, if there is a bug here or I am doing something wrong.

Thanks,

Zsombor

Lars Kanis

unread,

Sep 20, 2013, 4:35:20 PM9/20/13

to rub...@googlegroups.com, zsom...@gmail.com

Hi Zsombor,

thank you for the detailed report!

Am 20.09.2013 17:01, schrieb zsom...@gmail.com:

The main thing we want to achieve is to execute several parallel queries and if any of them succeeds then cancel the others. However when I issue the connection's cancel method in any thread, somehow it blocks the whole ruby VM (?), and only continues running when the server side closes the connection.

PQcancel is a blocking function, but conn.cancel did not release the GVL. I've fixed this in the bitbucket repository. So you might retry your issue with this version.

To be sure - does PG.threadsafe? return true in your environment? If yes, your code is OK.

Additionally sometimes the whole things segfaults, which I'm not sure is related but prints this:

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40: [BUG] Segmentation fault

ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]

I worry more about this one. Can you run your test with valgrind and post the relevant output of the memory access violation here?

Unfortunately I'm currently (and at the next week) too busy to setup a haproxy environment to reproduce the issue. But if not solved, I'll try it afterwards.

Regards,
Lars

Zsombor Nagy

unread,

Sep 21, 2013, 8:03:09 AM9/21/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

Hi!

Thanks for the quick answer :)

Am 20.09.2013 17:01, schrieb zsom...@gmail.com:

The main thing we want to achieve is to execute several parallel queries and if any of them succeeds then cancel the others. However when I issue the connection's cancel method in any thread, somehow it blocks the whole ruby VM (?), and only continues running when the server side closes the connection.

PQcancel is a blocking function, but conn.cancel did not release the GVL. I've fixed this in the bitbucket repository. So you might retry your issue with this version.

To be sure - does PG.threadsafe? return true in your environment? If yes, your code is OK.

Yes, PG.threadsafe? returns true, so that's OK.

I think I managed to install the latest version from bitbucket, and the problem still exists, but it's a little bit different.

My test code has these prints before calling the cancel:

print "=#{i}= "

conn.cancel

And in the gem version this happens:

sending 6 parallel queries which will be cancelled immediately

<2 <1 <0 ==joining threads== <4 <5 <3 -0= =0=

In the latest bitbucket version this happens:

sending 6 parallel queries which will be cancelled immediately

<3 <0 <2 <1 ==joining threads== <4 <5 -2= =2= -5= =5= -3= =3= -4= =4= -1= =1=

So this tells me that now your fix is working properly, all threads are able to proceed to the cancel call, but than the cancel itself blocks - this seems to be a different issue then, probably related to the haproxy, I'll investigate it more.

Additionally sometimes the whole things segfaults, which I'm not sure is related but prints this:

/Users/zsombor/.rvm/gems/ruby-2.0.0-p247/gems/pg-0.17.0/lib/pg.rb:40: [BUG] Segmentation fault

ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]

I worry more about this one. Can you run your test with valgrind and post the relevant output of the memory access violation here?

I didn't manage to use valgrind, because it never crashes when I'm using valgrind… And also valgrind complains that "Support on MacOS 10.8 is experimental and mostly broken."

I only have this backtrace for you:

Thread 4 Crashed:

0 libsystem_kernel.dylib 0x00007fff841d2212 __pthread_kill + 10

1 libsystem_c.dylib 0x00007fff87494b24 pthread_kill + 90

2 libsystem_c.dylib 0x00007fff874d8f61 abort + 143

3 libruby.2.0.0.dylib 0x000000010c641768 rb_bug + 184

4 libruby.2.0.0.dylib 0x000000010c6f8470 sigsegv + 144 (signal.c:616)

5 libsystem_c.dylib 0x00007fff8748190a _sigtramp + 26

6 libcrypto.0.9.8.dylib 0x00007fff883d6b9e ASN1_STRING_free + 14

7 libcrypto.0.9.8.dylib 0x00007fff883d92a5 ASN1_primitive_free + 85

8 libcrypto.0.9.8.dylib 0x00007fff883d923d ASN1_template_free + 141

9 libcrypto.0.9.8.dylib 0x00007fff883d911b asn1_item_combine_free + 491

10 libcrypto.0.9.8.dylib 0x00007fff883d8f27 ASN1_item_free + 23

11 libssl.0.9.8.dylib 0x00007fff81714bf9 SSL_CTX_use_certificate_chain_file + 457

12 libpq.dylib 0x000000010ca9bc12 pqsecure_open_client + 578

13 libpq.dylib 0x000000010ca8d7e4 PQconnectPoll + 1060

14 libpq.dylib 0x000000010ca8e867 connectDBComplete + 135

15 libpq.dylib 0x000000010ca8f377 PQconnectdb + 39

16 pg_ext.bundle 0x000000010ca68811 gvl_PQconnectdb_skeleton + 17 (gvl_wrappers.c:9)

17 libruby.2.0.0.dylib 0x000000010c773677 rb_thread_call_without_gvl + 87 (thread.c:1217)

18 pg_ext.bundle 0x000000010ca687ed gvl_PQconnectdb + 45 (gvl_wrappers.c:10)

19 pg_ext.bundle 0x000000010ca6b989 pgconn_init + 105 (pg_connection.c:189)

20 libruby.2.0.0.dylib 0x000000010c7689b1 vm_call0_body + 2033 (.vm_eval.c:117)

21 libruby.2.0.0.dylib 0x000000010c767eb0 rb_call0 + 416 (.vm_eval.c:325)

22 libruby.2.0.0.dylib 0x000000010c693819 rb_class_new_instance + 41 (object.c:1763)

23 libruby.2.0.0.dylib 0x000000010c76b7e1 vm_call_cfunc + 1057 (.vm_insnhelper.c:1469)

24 libruby.2.0.0.dylib 0x000000010c7559f5 vm_exec_core + 10885 (insns.def:1002)

25 libruby.2.0.0.dylib 0x000000010c7622b5 vm_exec + 133 (vm.c:1201)

26 libruby.2.0.0.dylib 0x000000010c766423 invoke_block_from_c + 1171 (vm.c:653)

27 libruby.2.0.0.dylib 0x000000010c7615bb vm_invoke_proc + 171 (vm.c:699)

28 libruby.2.0.0.dylib 0x000000010c761500 rb_vm_invoke_proc + 32 (vm.c:715)

29 libruby.2.0.0.dylib 0x000000010c779eca thread_start_func_2 + 522 (thread.c:512)

30 libruby.2.0.0.dylib 0x000000010c779ca5 thread_start_func_1 + 149 (.thread_pthread.c:765)

31 libsystem_c.dylib 0x00007fff87493772 _pthread_start + 327

32 libsystem_c.dylib 0x00007fff874801a1 thread_start + 13

What else could I try?

This crash actually never happens when I run the test code within "rails console" (and also it never happens in the actual rails application), it only happens when I run it separately with something like "ruby cancel_test.rb". I was assuming that maybe it is some kind of premature garbage collection, but I cannot see how could that be in this case.

Thanks,

Zsombor

Zsombor Nagy

unread,

Sep 21, 2013, 8:18:23 AM9/21/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

Hi!

One more quick questions!

On 2013.09.21., at 14:03, Zsombor Nagy <zsom...@gmail.com> wrote:

So this tells me that now your fix is working properly, all threads are able to proceed to the cancel call, but than the cancel itself blocks - this seems to be a different issue then, probably related to the haproxy, I'll investigate it more.

Is it possible that connection.cancel actually opens a new connection to the postgres server?

In my case we have a haproxy with a limit of 5 connections, and what I do with all of them is:

- open connection

- send query

- cancel query

- close connection

I can see the 5 connection going through the haproxy, reaching the postgres server.

Then all 5 of them should be cancelled, at this moment I can see 5 new connections arriving to the haproxy, and getting queued, not getting through to the postgres server, because the limit is already reached.

This point the cancel commands are just waiting, and the 5 original connection is still inside, because we never get to the "close connection" part.

This would explain the situation, if there is really a new connection needed for cancelling a query.

Thanks,

Zsombor

Zsombor Nagy

unread,

Sep 21, 2013, 8:49:01 AM9/21/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

Hi again! :)

On 2013.09.20., at 22:35, Lars Kanis <la...@greiz-reinsdorf.de> wrote:

I worry more about this one. Can you run your test with valgrind and post the relevant output of the memory access violation here?

I managed to reproduce the segfault with directly connecting to the postgres server (without the haproxy).

I am connecting with sslmode=require and certificates. The test script which crashes sometimes (not always) is this:

3.times.map do |i|

Thread.new do

print "<#{i} "

conn = PG.connect(DBLINK)

print "-#{i} "

conn.send_query 'SELECT 42'

print "=#{i} "

conn.close

print ">#{i} "

end

end.each {|t| t.join }

I didn't manage to reproduce the segfault with a non-ssl server, so I guess it's SSL related.

I saw the part in the mac os read me about segfaults, but I don't think it applies, if I just installed the pg gem with bundler, or does it?

Cheers,

Zsombor

Lars Kanis

unread,

Sep 22, 2013, 3:17:00 PM9/22/13

to Zsombor Nagy, rub...@googlegroups.com

Am 21.09.2013 14:18, schrieb Zsombor Nagy:

Is it possible that connection.cancel actually opens a new connection to the postgres server?

Yes, that's true. #cancel opens a second connection to the server and requests the stop by sending the PID and the backend key: [1]

Regards,
Lars

[1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/libpq/fe-connect.c;h=18fcb0c237;hb=HEAD#l3135

Lars Kanis

unread,

Sep 22, 2013, 4:07:40 PM9/22/13

to rub...@googlegroups.com, Zsombor Nagy

Am 21.09.2013 14:49, schrieb Zsombor Nagy:

I managed to reproduce the segfault with directly connecting to the postgres server (without the haproxy).
I am connecting with sslmode=require and certificates. The test script which crashes sometimes (not always) is this:

Shit - I can reproduce it!

It only happens when connecting with client side certificates. I get the same stack trace too, but also some other ways to crash. Started with valgrind I always get this one, even with only one thread:

==9983== Conditional jump or move depends on uninitialised value(s)
==9983==ï¿½ï¿½ï¿½ at 0x7DF664B: ASN1_STRING_set (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DE434C: ASN1_mbstring_ncopy (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DE4573: ASN1_mbstring_copy (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DE5424: ASN1_STRING_to_UTF8 (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DE6872: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DE6DD6: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DED304: ASN1_item_ex_d2i (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DEDEAF: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DEE0F7: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DED970: ASN1_item_ex_d2i (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DEDEAF: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)
==9983==ï¿½ï¿½ï¿½ by 0x7DEE0F7: ??? (in /lib/x86_64-linux-gnu/libcrypto.so.1.0.0)

OpenSSL is a reasonable complex beast, but I'll try to have a look at it next week.

--
Regards,
Lars

Zsombor Nagy

unread,

Sep 22, 2013, 4:35:23 PM9/22/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

> On 2013.09.22., at 21:17, Lars Kanis <la...@greiz-reinsdorf.de> wrote:
>
> Yes, that's true. #cancel opens a second connection to the server and requests the stop by sending the PID and the backend key

Oh, all right, kind of surprising! Thanks for the info!
I guess that also means that somehow the haproxy will have to be configured in a way that this canceling connection should go to the same server as the query is running...

Thanks!

Zs

Zsombor Nagy

unread,

Sep 22, 2013, 4:36:25 PM9/22/13

to la...@greiz-reinsdorf.de, rub...@googlegroups.com

> On 2013.09.22., at 22:07, Lars Kanis <la...@greiz-reinsdorf.de> wrote:
>
> OpenSSL is a reasonable complex beast, but I'll try to have a look at it next week.

Great, thanks very much!

Cheers,
Zsombor

Reply all

Reply to author

Forward