rabbitmq reported "failed to perform operation on queue due to timeout" error

1,294 views
Skip to first unread message

xue

unread,
Dec 25, 2018, 8:40:49 AM12/25/18
to rabbitmq-users
Hello, I'm using rabbitmq 3.7.6, esl-erlang-20.3.8.14

I found a problem in my test environment, rabbitmq reported "failed to perform operation on queue" error like below.
##################################################################

2018-12-25 20:59:52.454 [error] <0.10583.293> Channel error on connection <0.22217.0> (192.28.0.5:40458 -> 192.28.8.120:5672, vhost: '/', user: 'rabbit'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'conductor' in vhost '/' due to timeout

##################################################################


I had to use this command to delete conductor queue firstly, then declared the queue named "conductor". It seems that rabbitmq has some uncleaned data.
##################################################################

rabbitmqctl eval '{ok, Q} = rabbit_amqqueue:lookup(rabbit_misc:r(<<"/">>, queue, <<"conductor">>)), rabbit_amqqueue:delete_crashed(Q).'

##################################################################

What's wrong with the rabbitmq, thank you.

Daniil Fedotov

unread,
Dec 25, 2018, 9:27:36 AM12/25/18
to rabbitmq-users
Hello,

The timeout error often means that the channel exceeded attempts to perform the operation. In most cases it means that some other error caused it to retry too many times.
Can you find anything else in the logs related to this queue?

In most cases `delete_crashed` is not the best way to clean up queues, because the queue process may still be alive and cause data corruption. This function is meant to delete queues for which the process is not alive. You list the queues for which the process is down using `rabbitmqctl list_queues --offline`. It's safe to call `delete_crashed` only for those queues.

Michael Klishin

unread,
Dec 25, 2018, 9:40:42 AM12/25/18
to rabbitm...@googlegroups.com
One example of such scenario would be a client connected to node A with a queue with master hosted on node B. If A and B cannot communicate with each other,
all operations on the queue performed by clients connected to node A will eventually time out.

So watch for “node down” and partition events in the logs, too, and not just on the node where the message originates.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

xue

unread,
Dec 25, 2018, 10:55:54 AM12/25/18
to rabbitmq-users
Hello, Daniil and Michael
Thanks for your reply, Here are the more informations.

There are only two nodes in my environment. I tried to ping another node, and got pong response

##################################################################
80729213-D21D-B211-82D2-000000821800:/home/fsp # erl -sname test@rabbitmqNode1 -setcookie rabbitmq_server_cookie -remsh rabbit@rabbitmqNode1
Erlang/OTP 20 [erts-9.3.3.6] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V9.3.3.6  (abort with ^G)
(rabbit@rabbitmqNode1)1> net_adm:ping(rabbit@rabbitmqNode0).
pong
(rabbit@rabbitmqNode1)2> 
##################################################################

node health check of two node both passsed
##################################################################
/usr/local/lib/rabbitmq/sbin/rabbitmqctl node_health_check -t 120
Timeout: 120 seconds ...
Checking health of node rabbit@rabbitmqNode1 ...
Health check passed

/usr/local/lib/rabbitmq/sbin/rabbitmqctl node_health_check -t 120
Timeout: 120 seconds ...
Checking health of node rabbit@rabbitmqNode0 ...
Health check passed
##################################################################

all of the queues stats are running
##################################################################
/usr/local/lib/rabbitmq/sbin/rabbitmqctl list_queues name state |grep -v running
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
##################################################################

no network partition in cluster
##################################################################
/usr/local/lib/rabbitmq/sbin/rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmqNode1 ...
[{nodes,[{disc,[rabbit@rabbitmqNode0,rabbit@rabbitmqNode1]}]},
 {running_nodes,[rabbit@rabbitmqNode0,rabbit@rabbitmqNode1]},
 {cluster_name,<<"rabbit@13FEC980-1DD2-11B2-9C13-000000821800">>},
 {partitions,[]},
 {alarms,[{rabbit@rabbitmqNode0,[]},{rabbit@rabbitmqNode1,[]}]}]
##################################################################

Both of the two nodes have restarted, but there are no particular errors.

The attachement is the crash log of rabbitmq, are there any other information you need?

Thank you very much.


在 2018年12月25日星期二 UTC+8下午10:40:42,Michael Klishin写道:
crash.log

Michael Klishin

unread,
Dec 25, 2018, 11:02:18 AM12/25/18
to rabbitm...@googlegroups.com
According to the log a mirror process failed with an exception, likely that led to the timeout.
<crash.log>

xue

unread,
Dec 25, 2018, 11:21:23 AM12/25/18
to rabbitmq-users
Hello Michael

Could you please tell me which process failed and why this mirror process failed,is it a bug of rabbitmq?

Thank you.


在 2018年12月26日星期三 UTC+8上午12:02:18,Michael Klishin写道:

Michael Klishin

unread,
Dec 25, 2018, 12:23:55 PM12/25/18
to rabbitm...@googlegroups.com
A queue mirror (replica) process. I cannot immediately tell what’s going on.
Message has been deleted

xue

unread,
Dec 26, 2018, 9:20:15 PM12/26/18
to rabbitmq-users
I found some queues aren't display in list_queues command, but I can use curl to get the information of these queues

##################################################################
##################################################################

This is the status of the queue named notifications.sample
##################################################################
{
    "garbage_collection": {
        "max_heap_size": -1,
        "min_bin_vheap_size": -1,
        "min_heap_size": -1,
        "fullsweep_after": -1,
        "minor_gcs": -1
    },
    "consumer_details": [
        
    ],
    "incoming": [
        
    ],
    "deliveries": [
        
    ],
    "node": "rabbit@rabbitmqNode0",
    "arguments": {
        
    },
    "exclusive": false,
    "auto_delete": false,
    "durable": false,
    "vhost": "/",
    "name": "notifications.sample"
}
##################################################################

I think maybe some information of notifications.sample queue maybe losted during mirrord queue synchronisation, so that I can't decalre the queue with the same name.

So I want to delete these abnormal queues if the queue is abnormal like below. But I'm nor sure this change could cause any other problems.

rabbitmq-server-3.7.6/deps/rabbit/src/rabbit_amqqueue.erl

##################################################################
with(Name, F, E, RetriesLeft) ->
    case lookup(Name) of
        {ok, Q = #amqqueue{state = live, name = Qname, pid = QPid, slave_pids = SPids}} when RetriesLeft =:= 0 ->
            %% Something bad happened to that queue, we are bailing out
            %% on processing current request.
            internal_delete(Qname, ?INTERNAL_USER),
            Pids = [QPid | SPids],
            delete_immediately(Pids),
            E({absent, Q, timeout});
        {ok, Q = #amqqueue{state = stopped}} when RetriesLeft =:= 0 ->
            %% The queue was stopped and not migrated
            E({absent, Q, stopped});
        %% The queue process has crashed with unknown error
        {ok, Q = #amqqueue{state = crashed}} ->
            E({absent, Q, crashed});
        %% The queue process has been stopped by a supervisor.
        %% In that case a synchronised slave can take over
        %% so we should retry.
        {ok, Q = #amqqueue{state = stopped}} ->
            %% The queue process was stopped by the supervisor
            rabbit_misc:with_exit_handler(
              fun () -> retry_wait(Q, F, E, RetriesLeft) end,
              fun () -> F(Q) end);
        %% The queue is supposed to be active.
        %% The master node can go away or queue can be killed
        %% so we retry, waiting for a slave to take over.
        {ok, Q = #amqqueue{state = live}} ->
            %% We check is_process_alive(QPid) in case we receive a
            %% nodedown (for example) in F() that has nothing to do
            %% with the QPid. F() should be written s.t. that this
            %% cannot happen, so we bail if it does since that
            %% indicates a code bug and we don't want to get stuck in
            %% the retry loop.
            rabbit_misc:with_exit_handler(
              fun () -> retry_wait(Q, F, E, RetriesLeft) end,
              fun () -> F(Q) end);
        {error, not_found} ->
            E(not_found_or_absent_dirty(Name))
    end.
##################################################################

Thanke you.

Michael Klishin

unread,
Dec 27, 2018, 3:32:11 AM12/27/18
to rabbitm...@googlegroups.com
HTTP API uses a stats database whereas `rabbitmqctl list_queues` uses internal schema tables.

Is your question how to delete the queue? There's no shortage of threads about that in list archives, including from the last 7 days or so.

On Thu, Dec 27, 2018 at 5:12 AM xue <gxuew...@gmail.com> wrote:
Hello Michael,

20181227-100854(eSpace).png



Thanke you.


在 2018年12月26日星期三 UTC+8上午1:23:55,Michael Klishin写道:


--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages