[rabbitmq-discuss] Rabbitmq 2.5-2.6.1 java hanging on close of connection

Benjamin Bennett

unread,

Nov 2, 2011, 12:07:30 PM11/2/11

to Rabbit-Mq Discuss-Mailing List

I have am trying to figure out what is going on and I have tried many
different combinations.
The stack trace is the same as mentioned in this discussion
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-August/014790.html
.
I am running unit level test cases for the project I am working on and
the test cases just hang unless I kill the build.
For the windows 7 32 bit fdcc 3.1 (US Gov standard load) load
running in vsphere 4.1.X , it hangs 90% of the time .
For my Ubuntu 64-bit 11.04 box.(on bare metal no vmware) it rarely
hangs but I just kicked of the build and it hung all night and I had
to kill it.
I have tried switching the jre(1.6.u26 , 1.6u29) , rabbitmq java
client (2.3.1,2.5.0,2.6.1), rabbitmq(2.5.0,2.5.1,2.6.0,2.6.1), erlang
R14B03 , R14B04.

I mixed and matched all the versions pretty much and the same thing
happens almost every single time on the windows 7 32 box

I used the Tracer tool and the tests cases don't hang but the
following exception is sometimes thrown.

1320191848535: <Tracer-3> ch#2 <- {#method<channel.close-ok>(),null,""}
1320191848535: <Tracer-3> ch#0 ->
{#method<connection.close>(reply-code=200, reply-text=OK, class-id=0,
method-id=0),null,""}
1320191848535: <Tracer-3> ch#0 <- {#method<connection.close-ok>(),null,""}
1320191848535: <Tracer-3> uncaught java.io.EOFException
at java.io.DataInputStream.readUnsignedByte(Unknown Source)
at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:99)
at com.rabbitmq.tools.Tracer$DirectionHandler.readFrame(Tracer.java:271)
at com.rabbitmq.tools.Tracer$DirectionHandler.doFrame(Tracer.java:305)
at com.rabbitmq.tools.Tracer$DirectionHandler.run(Tracer.java:345)
at java.lang.Thread.run(Unknown Source)

The entire tracer log is here http://pastebin.com/ecp9ymsH .

I am pretty sure this is causing the hang but how can catch this
exception on the connection close method and not deadlock.

Thanks,
Benjamin Bennett
_______________________________________________
rabbitmq-discuss mailing list
rabbitmq...@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Steve Powell

unread,

Nov 3, 2011, 11:47:34 AM11/3/11

to Rabbit-Mq Discuss-Mailing List

Benjamin,

Thanks for your detailed trace log. It does look as though we are

getting EOF from the socket input stream and not handling it well.

Of course, Tracer may be just not coping with it, but it is

interesting that the mainline (doFrame()) code is the one that

fails internally. Certainly, an EOFException ought to be caught

there and treated as though there is nothing to see yet.

Your log will help in trying to reproduce this in our testing,

I'm going to raise a bug for this problem, and get back to you

when we find the solution.

Steve Powell (a happy bunny)

----------some definitions from the SPD----------

Rigatoni n. A prime-ministerial grin.

Nigella n. The next bottle-size up from a Nebuchadnezzar.

Homily adv. Rather like a frenchman.

On 2 Nov 2011, at 16:07, Benjamin Bennett wrote:

http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-November/015968.html

Benjamin Bennett

unread,

Nov 3, 2011, 5:40:46 PM11/3/11

to Rabbit-Mq Discuss-Mailing List

Thanks for the help. My gut reaction there is some sort of latency issue.
I could only reproduced the problem 1/30 times on my linux box , and
on my windows 7 64-bit it happened maybe 1/20 times.
When I run it on our vm build farm the test cases hang 9/10 times.
The hang is always happening when trying to close a channel.
The tracer log was produced when logging into the vm and running the
build manually.

Do you think a good work around would be to put a connection timeout?
I was thinking about putting it in but I didn't exactly what that did,
don't want to risk breaking other code.

Thanks,
Benjamin Bennett

Steve Powell

unread,

Nov 11, 2011, 8:05:55 AM11/11/11

to Rabbit-Mq Discuss-Mailing List

Hi Benjamin,

After investigation of our code, I'm now no longer convinced that
your hangs occur because we haven't trapped the EOFException
correctly. We ought to be handling it properly.

I cannot reproduce this problem here, yet.

Could you please do some more investigation for us? What would
be helpful is for you to take a java stack dump (use jstack) when
the hang occurs.

jstack -l <pid>

ought to be sufficient, with <pid> the process number of the client JVM.

That will help us to find out what is hung where.
(It would be most helpful if you could do this on the
latest release -- RabbitMQ 2.7.0, but an older one is fine.)

Steve Powell
st...@rabbitmq.com
[wrk: +44-2380-111-528] [mob: +44-7815-838-558]

On 3 Nov 2011, at 21:40, Benjamin Bennett wrote:
> http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-November/015980.html

Benjamin Bennett

unread,

Nov 13, 2011, 3:50:20 PM11/13/11

to Steve Powell, Rabbit-Mq Discuss-Mailing List

Here is the stacktrace from the running test cases. I just ran the
test cases 4 times without a hang on the same vm , then last 3 times
it hangs.

The tests hang for 20 minutes then the junit run kills the jvm for
timeout exception.

In the rabbitmq log it is just says the connection abruptly close at
the same time of the junit runner killed the unit tests.

I was going to attempt to put a timeout on the connection close method
but that really would be a hack.

The stacktrace is exactly the same as this one
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-August/014789.html
.
http://pastebin.com/u1Z0Qyrr

Also before the test cases are run I reinstall rabbitmq , delete all
vhosts and recreate them . So there should not be any memory issues .

The only thread that is in the RUNNABLE state is at the beginning of
the the stacktrace. I realize this is causing the hang but I don't
understand why no reply is getting through from rabbit.

at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)

Searching the interwebs points to the remote server not responding.
http://javaeesupportpatterns.blogspot.com/2011/04/javanetsocketinputstreamsocketread0.html

I can hook up wireshark next to try to diagnose the problem , but I
thought tracer class would be sufficient.

Benjamin Bennett

unread,

Nov 13, 2011, 3:54:11 PM11/13/11

to Steve Powell, Rabbit-Mq Discuss-Mailing List

Is there any setting to put rabbitmq into debug logging state?

Matthias Radestock

unread,

Nov 13, 2011, 4:04:35 PM11/13/11

to Benjamin Bennett, Rabbit-Mq Discuss-Mailing List

Benjamin,

On 13/11/11 20:50, Benjamin Bennett wrote:
> Here is the stacktrace from the running test cases. I just ran the
> test cases 4 times without a hang on the same vm , then last 3 times
> it hangs.
>
> The tests hang for 20 minutes

When that happens, please run 'rabbitmqctl report' and send us the output.

> I can hook up wireshark next to try to diagnose the problem , but I
> thought tracer class would be sufficient.

Since you were unable to reproduce the problem with the tracer,
wireshark is the next logical step.

Regards,

Matthias.

Benjamin Bennett

unread,

Nov 13, 2011, 7:24:55 PM11/13/11

to Matthias Radestock, Rabbit-Mq Discuss-Mailing List

I was able to with tracer see beginning of thread, tracer threw an exception

Matthias Radestock

unread,

Nov 14, 2011, 3:10:06 AM11/14/11

to Benjamin Bennett, Rabbit-Mq Discuss-Mailing List

Benjamin,

On 14/11/11 00:24, Benjamin Bennett wrote:
> I was able to with tracer see beginning of thread, tracer threw an exception

That's not unusual. Neither the client or the tracer got stuck though,
which is the problem we are trying to investigate here.

Benjamin Bennett

unread,

Nov 14, 2011, 11:15:51 AM11/14/11

to Matthias Radestock, Rabbit-Mq Discuss-Mailing List

I apologize if I am sounding rude, not trying. Just trying to find
out what is going on.
Here is report from rabbitmqctrl report
http://pastebin.com/MSwv82C3

The same stacktrace is happening onIy difference is different test
case is hanging.

I ran the same test cases for the exact same git revision on all are
build agents and the ran without issues , everything passed.

It is crap shoot on whether it is happens.

It is an SDC machine so it has all kinds of extra security GPO and
services . I am going to attempt to shut down all other services that
I can. And see if it has an benefit.

Simon MacMullen

unread,

Nov 14, 2011, 1:32:02 PM11/14/11

to rabbitmq...@lists.rabbitmq.com

On 14/11/11 16:15, Benjamin Bennett wrote:
> Here is report from rabbitmqctrl report
> http://pastebin.com/MSwv82C3

Ah, thank you. After some poking, that genuinely looks like a server
bug. Damn.

In order for it to happen you need the last channel close / close_ok to
overlap with the connection close / close_ok. With the Java client you
have to invoke Channel.close() and Connection.close() from different
threads to get this to happen, and still be unlucky.

You should be allowed to do this, but right now it's racy.

> I was going to attempt to put a timeout on the connection close method
> but that really would be a hack.

Indeed! Other slightly less hacky workarounds until we get this fixed:

* Invoke Channel.close() and Connection.close() from the same thread, or
otherwise ensure they don't overlap.

* Don't invoke Channel.close() if you know you're going to invoke
Connection.close() anyway.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Benjamin Bennett

unread,

Nov 14, 2011, 6:21:01 PM11/14/11

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

I am using the spring amqp lib and it is doing the connection closing when the spring context is closed. I do not think it has a property to inject the hack. Also if you know any of the spring amqp devs. Having you telling them to check to make sure it is doing the way you have described will have much more authority than me.

I will probably hack the spring amqp lib for now

rabbitmq-discuss@lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Benjamin Bennett

unread,

Nov 17, 2011, 1:12:00 PM11/17/11

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com

Had a question if no one is working on patch for the java client do
accept external patches?

I was going to either change it to the following .

1) Remove the infinite wait on a connection close.

Or

Place a BlockingQueue on the channel closes during the close call.

The sychronize on the BlockingQueue for the connection close , of
which it cannot close the connection if a channel is currently be
closed.

The second is more code but it would keep the infinity timeout in
place will working around the deadlock issue.

It would save a lot of pain if people are using 2.7.1 and below.

>> rabbitmq...@lists.rabbitmq.com
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Simon MacMullen

unread,

Nov 21, 2011, 8:07:10 AM11/21/11

to Benjamin Bennett, rabbitmq...@lists.rabbitmq.com, David Syer

Hi Benjamin.

In general we do accept patches, although due to our Corporate Overlords
you would have to sign a contributor agreement.

I would be reluctant to merge such a patch though, since:

* It's to work around a server bug which will be fixed in the next release

* This bug has been around for most of a year without attracting much
attention

* Option 1) is ugly; option 2) is (somewhat) complicated.

BTW, I got a reply from the spring-amqp maintainer, Dave Syer:

> Spring AMQP doesn't explicitly invoke close() in different threads.
> There's nothing to stop it happening (as is the case with the Java
> client itself I suppose), but we actually hardly ever call
> Channel.close() so it is pretty unlikely. I would be interested to
> hear of a way to tickle a normal app into this behaviour.

So I still wonder if the threading thing is something you are doing in
your app.

Cheers, Simon

Benjamin Bennett

unread,

Nov 21, 2011, 11:10:56 AM11/21/11

to Simon MacMullen, rabbitmq...@lists.rabbitmq.com, David Syer

Our application is using the CachingConnectionFactory, Spring
Integration AMQP .
The way this is happening is just by applying the @DirtiesContext in
our unit test , that we need to reload the context, at that time the
channel close and connection close are closed in different threads.
The BlockingQueueConsumer is the one that end ups doing the channel
closes but the CachingConnectionFactory actually does the closes .

The stacktraces have no other threads running except spring-amqp
threads BlockingQueueConsumer and CachingConnectionFactory.

My only comment about attracting attention it is a race condition ,
and race conditions are the most painful bugs to have to deal with.

On my dev box I have only seen the issue one time in 200 runs that ran
over night.
On our test windows 7 vms it happens 50% of the time, sample size was 40 runs.
On our windows xp test vms it has never happened.

In our production we get a "The service failed to close in timely
manner" and doesn't shut down.

Finally if you are fixing it in the server , what about the previous
versions? I don't even think spring-amqp has migrated to 2.7 series
java client because it is backwards incompatible.

Thanks,
Benjamin Bennett

Simon MacMullen

unread,

Nov 28, 2011, 11:05:59 AM11/28/11

to Benjamin Bennett, rabbitmq...@lists.rabbitmq.com, David Syer

On 21/11/11 16:10, Benjamin Bennett wrote:
> Our application is using the CachingConnectionFactory, Spring
> Integration AMQP .
> The way this is happening is just by applying the @DirtiesContext in
> our unit test , that we need to reload the context, at that time the
> channel close and connection close are closed in different threads.
> The BlockingQueueConsumer is the one that end ups doing the channel
> closes but the CachingConnectionFactory actually does the closes .
>
>
> The stacktraces have no other threads running except spring-amqp
> threads BlockingQueueConsumer and CachingConnectionFactory.

I'm afraid I can't really comment on spring-amqp - I don't know if Dave can?

But for the purpose of your tests maybe you could work around this by
not closing the channels at all?

> My only comment about attracting attention it is a race condition ,
> and race conditions are the most painful bugs to have to deal with.

Oh, absolutely.

> On my dev box I have only seen the issue one time in 200 runs that ran
> over night.
> On our test windows 7 vms it happens 50% of the time, sample size was 40 runs.
> On our windows xp test vms it has never happened.
>
> In our production we get a "The service failed to close in timely
> manner" and doesn't shut down.
>
> Finally if you are fixing it in the server , what about the previous
> versions? I don't even think spring-amqp has migrated to 2.7 series
> java client because it is backwards incompatible.

You don't need to use the same versions of client and server - just
upgrading the server should be fine.

Benjamin Bennett

unread,

Nov 28, 2011, 11:18:04 AM11/28/11

to Simon MacMullen, David Syer, Rabbit-Mq Discuss-Mailing List

" But for the purpose of your tests maybe you could work around this by not
closing the channels at all?"

Our code never actually closes any channels the spring-amqp lib closes
the channels when the spring context is destroyed.

I have a hack for now and just added a timeout to the spring-amqp lib
when it closes the connection.

Thanks for all your help. I will be looking for the release that fixes
this issue.

Benjamin Bennett

unread,

Mar 7, 2012, 9:41:51 AM3/7/12

to Simon MacMullen, David Syer, Rabbit-Mq Discuss-Mailing List

Just wondering if this has been fixed in the latest releases?

--
Sincerely,

Benjamin Bennett
314.246.0645
benbe...@gmail.com

"For a successful technology, reality must take precedence over public
relations, for Nature cannot be fooled."
Richard Feynman

Simon MacMullen

unread,

Mar 7, 2012, 10:11:45 AM3/7/12

to Benjamin Bennett, Rabbit-Mq Discuss-Mailing List, David Syer

On 07/03/12 14:41, Benjamin Bennett wrote:
> Just wondering if this has been fixed in the latest releases?

Yes, it was fixed in 2.7.1.

Cheers, Simon

--
Simon MacMullen
RabbitMQ, VMware

Sankalp Acharya

unread,

Jan 11, 2013, 4:58:31 PM1/11/13

to rabbitmq...@googlegroups.com, Benjamin Bennett, Rabbit-Mq Discuss-Mailing List, David Syer, si...@rabbitmq.com

Hi,

We are seeing this same exact behavior, but on a newer version of RabbitMQ (2.8.2).

What's strange is that the problem had not been occurring until recently. We are also using Spring AMQP (1.1.3), and we have virtually the same use case (running from unit tests, the test tries to wind down, and on disposing the Spring context, it tries to close the CachingConnectionFactory, upon which we are trying to close the connection to RabbitMQ, etc.)

Is there something else we are missing, that we need to do (either in the configuration of our CachingConnectionFactory, or in the rest of our Spring wireup) in order to not run into this issue?

Regards,

Sankalp Acharya

WhaleShark Media

Sankalp Acharya

unread,

Jan 15, 2013, 3:55:06 PM1/15/13

to rabbitmq...@googlegroups.com, Benjamin Bennett, Rabbit-Mq Discuss-Mailing List, David Syer, si...@rabbitmq.com

Incidentally, we figured out what the problem is.

It seems that when the watermark checking was added in RabbitMQ 2.8.1, there was a feature that was added that basically blocks all currently-open connections until the watermark thresholds are cleared. What was happening on our build servers is that the test instance of RabbitMQ was just barely crossing the disk watermark threshold to where the MQ server started blocking connections. This would happen intermittently, depending on which side of the watermark the free disk space was at.

It seems that the connection blocking, in turn, also blocks close requests from being ack'd! This certainly seems like a bug to me – I'm not sure why the server should deny already-connected clients the ability to close their connections.

At any rate, I wanted to post to this thread in case anyone else ran into this error.