Trying to debug a TimeoutError: [Errno 110] Connection timed out

2,116 views
Skip to first unread message

jens.t...@gmail.com

unread,
May 16, 2018, 9:27:01 PM5/16/18
to rabbitmq-users
Hi,

I'm running RabbitMQ 3.5.7 on Erlang/OTP 18 on a standard Ubuntu 16 instance inside of AWS. According to this comment, that's a somewhat outdated setup, but I have not yet attempted to update. I'd like to understand the following problem.

Ever so often I see exceptions coming from py-amqp:

  […]
  File "/…/lib/python3.6/site-packages/celery/result.py", line 589, in revoke
    terminate=terminate, signal=signal, reply=wait)
  File "/…/lib/python3.6/site-packages/celery/app/control.py", line 210, in revoke
    }, **kwargs)
  File "/…/lib/python3.6/site-packages/celery/app/control.py", line 436, in broadcast
    limit, callback, channel=channel,
  File "/…/lib/python3.6/site-packages/kombu/pidbox.py", line 315, in _broadcast
    serializer=serializer)
  File "/…/lib/python3.6/site-packages/kombu/pidbox.py", line 290, in _publish
    serializer=serializer,
  File "/…/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/…/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/…/lib/python3.6/site-packages/amqp/channel.py", line 1734, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/…/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/…/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/…/lib/python3.6/site-packages/amqp/transport.py", line 258, in write
    self._write(s)
TimeoutError: [Errno 110] Connection timed out

According to this issue for py-amqp, setting the TCP timeout didn't fix the problem. The logs show me a warning:

=WARNING REPORT==== 15-May-2018::12:24:45 ===
closing AMQP connection <0.20999.19> (10.0.1.155:47144 -> 10.0.1.98:5672):
connection_closed_abruptly

which is around the time when the exception is being raised. Host 10.0.1.155 runs the webserver which attempts to schedule the job, and host 10.0.1.98 runs the message queue and job broker (Celery).

How do I go about debugging this further to find the cause for this timeout?

Thank you!

Michael Klishin

unread,
May 16, 2018, 9:41:19 PM5/16/18
to rabbitm...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

jens.t...@gmail.com

unread,
May 16, 2018, 10:22:44 PM5/16/18
to rabbitmq-users
Thanks for the links, Michael. I had read portions of the latter but didn't see the former link you sent. I'm trying to gather information but I'm not too sure about all the right places.

Looking at the syslogs, I see only regular (i.e. twice per hour) dhclient notes:

May 15 20:23:35 ip-10-0-1-98 dhclient[939]: DHCPREQUEST of 10.0.1.98 on eth0 to 10.0.1.1 port 67 (xid=0x19cda756)
May 15 20:23:35 ip-10-0-1-98 dhclient[939]: DHCPACK of 10.0.1.98 from 10.0.1.1
May 15 20:23:35 ip-10-0-1-98 dhclient[939]: bound to 10.0.1.98 -- renewal in 1582 seconds.

Nothing else is there.

Michael Klishin

unread,
May 16, 2018, 10:27:32 PM5/16/18
to rabbitm...@googlegroups.com
I meant RabbitMQ's own logs, default locations can be found at http://www.rabbitmq.com/relocate.html#generic-unix.

A traffic capture would reveal a lot about what's going on on the wire:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jens.t...@gmail.com

unread,
May 17, 2018, 12:26:49 AM5/17/18
to rabbitmq-users
I somehow doubt that this is an issue on the queue side.

The client sends two messages to the job broker over RabbitMQ right before the failure, and those messages make it just fine and are being processed. The broker is currently single-process (for debugging purposes) but that should simply queue the messages.

A few lines later, the client then attempts to revoke a Celery group of tasks and that's where the failure happens—and the failure seems to happen only and always around revoking groups. I will switch this to manual revoke of the individual tasks instead of the group, to see what happens. Knowing how messy that Celery thing is, the fault may lie there.

Michael Klishin

unread,
May 17, 2018, 5:39:01 AM5/17/18
to rabbitm...@googlegroups.com
We do not guess on this list: guessing is too time consuming and our small team cannot afford it.

Logs and Wireshark provide relevant data.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
May 17, 2018, 5:46:10 AM5/17/18
to rabbitm...@googlegroups.com
…and https://www.rabbitmq.com/troubleshooting-networking.html provides a methodology that is pretty efficient at narrowing down the problem.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

jens.t...@gmail.com

unread,
May 18, 2018, 5:46:36 AM5/18/18
to rabbitmq-users
I understand.

However, at the moment I fail to even reproduce the error and therefore I’m looking through the involved layers to trigger the problem reliably. Then I can debug in a more methodical way, completely agree with you.

Thank you for the links, I’m reading through them…
Reply all
Reply to author
Forward
0 new messages