Transient (but very often) eaddrinuse on startup after network config changes (Solaris)

134 views
Skip to first unread message

Michael Verrilli

unread,
Aug 5, 2015, 10:20:47 PM8/5/15
to rabbitmq-users
I was hoping an upgrade would fix this issue, but it hasn't. 

I made the following changes for my network connection for RabbitMQ.  I don' tknow if this has anything to do with it.  You'll see the changes in the error message below.

   {could_not_start,rabbit,
       
{{case_clause,
           
{error,
               
{{shutdown,
                     
{failed_to_start_child,tcp_listener,
                         
{cannot_listen,{0,0,0,0},5672,eaddrinuse}}},
                 
{child,undefined,'rabbit_tcp_listener_sup_0.0.0.0:5672',
                     
{tcp_listener_sup,start_link,
                         
[{0,0,0,0},
                         
5672,
                         
[inet,binary,
                           
{backlog,10240},
                           
{keepalive,true},
                           
{nodelay,true},
                           
{sndbuf,122880},
                           
{recbuf,122880}],
                         
{rabbit_networking,tcp_listener_started,[amqp]},
                         
{rabbit_networking,tcp_listener_stopped,[amqp]},
                         
{rabbit_networking,start_client,[]},
                         
"TCP Listener"]},
                     
transient,infinity,supervisor,
                     
[tcp_listener_sup]}}}},
       
[{rabbit_networking,start_listener0,4,[]},
         
{rabbit_networking,'-start_listener/4-lc$^0/1-0-',4,[]},
         
{rabbit_networking,start_listener,4,[]},
         
{rabbit_networking,'-boot_tcp/0-lc$^0/1-0-',1,[]},
         
{rabbit_networking,boot_tcp,0,[]},
         
{rabbit_networking,boot,0,[]},
         
{rabbit,'-run_step/2-lc$^1/1-1-',1,[]},
         
{rabbit,run_step,2,[]}]}}


I have also verified that there are definitely no listeners (and I've run lsof repeatedly during the startup process and I do see a listener show up before RabbitMQ dies.  I do see some CLOSE_WAIT from clients trying to connect.  

Could any of these settings cause this problem?  The only other change I've made is raising the RABBITMQ_IO_THREAD_POOL_SIZE to 64.

Thanks,

Mike


Michael Klishin

unread,
Aug 6, 2015, 6:01:55 AM8/6/15
to rabbitm...@googlegroups.com, Michael Verrilli
On 6 August 2015 at 05:20:51, Michael Verrilli (mver...@gmail.com) wrote:
> {cannot_listen,{0,0,0,0},5672,eaddrinuse}}},

Another process is already running port 5672. The old instance, perhaps? 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Verrilli

unread,
Aug 6, 2015, 9:00:05 AM8/6/15
to rabbitmq-users, mver...@gmail.com
It isn't though. I've checked.  That's what I would have expected as well.  

lsof shows only CLOSE_WAIT states from workers trying to connect, but no LISTEN.  If I repeatedly run lsof, I see the rabbitmq process come up and listen on that port, then die.  (As in, there are no listeners before I start it, during the start of the launch, then it pops up to show the listener during the launch, then fails).  

This happens even when I run the service using the rabbitmq-server by hand (without service wrapper scripts).  It's very unusual.

This is also happening in every environment. Started happening after I made those network setting changes. Also seems to happen much more often if I have a lot of workers trying to hit that port. So prod ends up having this problem every time it needs to restart until I manually turn off all workers.

Michael Klishin

unread,
Aug 6, 2015, 9:06:10 AM8/6/15
to rabbitm...@googlegroups.com, Michael Verrilli
On 6 August 2015 at 16:00:07, Michael Verrilli (mver...@gmail.com) wrote:
> It isn't though. I've checked. That's what I would have expected
> as well.
>
> lsof shows only CLOSE_WAIT states from workers trying to connect,
> but no LISTEN. If I repeatedly run lsof, I see the rabbitmq process
> come up and listen on that port, then die. (As in, there are no listeners
> before I start it, during the start of the launch, then it pops
> up to show the listener during the launch, then fails).
>
> This happens even when I run the service using the rabbitmq-server
> by hand (without service wrapper scripts). It's very unusual.
>
> This is also happening in every environment. Started happening
> after I made those network setting changes. Also seems to happen
> much more often if I have a lot of workers trying to hit that port.
> So prod ends up having this problem every time it needs to restart
> until I manually turn off all workers.

eaddrinuse is fairly unambiguous.

If you say you observe nodes binding and then dying, there should
be something else in the logs .

There is a socket setting that can disable socket reuse:
http://www.rabbitmq.com/networking.html

but nothing has changed in that area in years.

Michael Verrilli

unread,
Aug 6, 2015, 10:44:04 AM8/6/15
to Michael Klishin, rabbitm...@googlegroups.com
I agree, which is why I'm here. :-)  It really doesn't make any sense. 

Here is the log.  I only removed the queue warnings and identifying info (the cookie hash I end up changing every so often, this is my dev env). 


Here is my lsof taken just before running the service: 


If I turn off most of the workers trying to connect to RabbitMQ, then it starts up fine.  Here is an lsof list once it does load successfully: 



As I mentioned, this started happening after I tuned the listener settings. I'll try disabling socket reuse and if that doesn't work, I will try undoing each one of the changes I made I guess to see which one might cause this issue.  (I still don't know that this is the trigger). The other thing that is different is that there are more connections that occur now, plus more a lot more queues (3000).  

Jean-Sébastien Pédron

unread,
Aug 6, 2015, 11:25:41 AM8/6/15
to rabbitm...@googlegroups.com
On 06.08.2015 04:20, Michael Verrilli wrote:
> I was hoping an upgrade would fix this issue, but it hasn't.
> [inet,binary,
> {backlog,10240},
> {keepalive,true},
> {nodelay,true},
> {sndbuf,122880},
> {recbuf,122880}],

The option {reuseaddr, true} is missing from those options. It's enabled
by default so I assume you override tcp_listen_options.

In TCP, a port is still considered in use for some time *after* the
application closed the socket. This is a safety measure so that
in-flight packets are not accidentally delivered to a new application
instance which reopened this port.

The reuseaddr option is there to ignore this mechanism: it tells the
kernel "please let me open this port, I don't care about in-flight
packets". And I believe it's safe to use this option because those
packets will probably be rejected by the kernel anyway because they
won't match the state of the new socket.

--
Jean-Sébastien Pédron
Pivotal / RabbitMQ
Message has been deleted

Michael Verrilli

unread,
Aug 6, 2015, 11:55:03 AM8/6/15
to rabbitmq-users, jean-se...@rabbitmq.com
I deleted my previous post because I had confused which setting exactly worked.

Setting reuseaddr explicitly to true fixed this apparently. I was able to restart rabbitmq multiple times while all the workers were running without issue. 

Setting it to false exhibited the error. 

So it sounds like if tcp_listen_options is set, but reuseaddr is not set, then reuseaddr = false. 
Is it possible that if tcp_listen_options is not set at all, then reuseaddr = true? 

I ask because I never had this problem until I added the tcp_listen_options. 

Thanks guys for all the help! Very happy to get this resolved. The issue is transient, but I'll post back later to confirm this is fixed.

Jean-Sébastien Pédron

unread,
Aug 6, 2015, 12:03:01 PM8/6/15
to rabbitm...@googlegroups.com
On 06.08.2015 17:55, Michael Verrilli wrote:
> Setting reuseaddr explicitly to true fixed this apparently. I was able
> to restart rabbitmq multiple times while all the workers were running
> without issue.

Cool!

> So it sounds like if tcp_listen_options is set, but reuseaddr is not
> set, then reuseaddr = false.

Yes, it is disabled by default in Erlang. It's probably the default
setting in the OS as well.

> Is it possible that if tcp_listen_options is not set at all, then
> reuseaddr = true?

Yes, RabbitMQ sets it explicitely to true in the default value of
"tcp_listen_options". See rabbit.app (this is a data file, not a
configuration file; keep it unmodifed).

> Thanks guys for all the help! Very happy to get this resolved. The issue
> is transient, but I'll post back later to confirm this is fixed.

You're welcome!

Michael Verrilli

unread,
Aug 6, 2015, 4:51:41 PM8/6/15
to rabbitmq-users, jean-se...@rabbitmq.com
This issue is definitely resolved. Thanks again for all the help. 

Also, I kind of think that tcp_listen_options should assume the os setting if nothing is provided just as if you didn't set it. It seems more intuitive that way.  I don't think I get a vote, though. :-)

Michael Klishin

unread,
Aug 6, 2015, 6:04:03 PM8/6/15
to rabbitm...@googlegroups.com, Michael Verrilli, jean-se...@rabbitmq.com
On 6 Aug 2015 at 23:51:45, Michael Verrilli (mver...@gmail.com) wrote:
> This issue is definitely resolved. Thanks again for all the
> help.
>
> Also, I kind of think that tcp_listen_options should assume
> the os setting if nothing is provided just as if you didn't set
> it. It seems more intuitive that way. I don't think I get a vote,
> though. :-)

Michael,

I generally agree but note it can be really tricky to detect what OS settings are.
We already are in that hell with a few other things, e.g. free disk space monitoring.

What we should do is cover the option better in our Networking guide. And since our
repo is open source [1], you can even help us write and edit it, if you feel like it :)

1. http://github.com/rabbitmq/rabbitmq-website 
Reply all
Reply to author
Forward
0 new messages