RabbitMQ - Seeing virtual host down

312 views
Skip to first unread message

yaswanth kumar

unread,
Dec 18, 2020, 12:21:59 PM12/18/20
to rabbitmq-users
RabbitMQ version: 3.8.9
Node name: rabbit@dltlbbrcmsrx04
Erlang configuration: Erlang/OTP 22 [erts-10.7.2.3] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:64] [hipe]

CentOS: CentOS Linux release 7.9.2009 (Core)

Seeing the below error, regularly, can we know the root cause for the same?

Note: We know the remediation step for this by removing /var/lib/rabbitmq/mnesia/rabbit@server01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L" and restarting rabbitmq is fixing it, but we need to know the root cause so that we can attack on it directly which prevents the remediations..

Also we have excluded both /var/log/rabbitmq and /var/lib/rabbitmq from antivirus scans , and we are using McAfee on these boxes.

[error] <0.26412.16> Failed to start message store of type msg_store_persistent for vhost '/': {{{{badmatch,{error,{unable_to_scan_file,"335.rdq",eperm}}},[{rabbit_msg_store,build_index_worker,5,[{file,"src/rabbit_msg_store.erl"},{line,1753}]},{rabbit_msg_store,'-enqueue_build_index_workers/4-fun-0-',5,[{file,"src/rabbit_msg_store.erl"},{line,1798}]},{worker_pool_worker,handle_cast,2,[{file,"src/worker_pool_worker.erl"},{line,119}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,259}]}]},{gen_server2,call,[<0.26425.16>,out,infinity]}},{child,undefined,msg_store_persistent,{rabbit_msg_store,start_link,[msg_store_persistent,"/var/lib/rabbitmq/mnesia/rabbit@server01/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L",[],{#Fun<rabbit_queue_index.2.42349471>,{start,[{resource,<<"/">>,queue,<<"ConsoleResultBus_8398a0ec-5d22-44b9-8ce6-359a72c257bf">>},{resource,<<"/">>,queue,<<"LoggerAdminBus_b4c72652-61fe-492c-b894-aa65a130c43d">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_a088648f-b7b3-4e70-b6e5-3af79c61a2b2">>},{resource,<<"/">>,queue,<<"ResumeTaggerCommandBus_7215ac9f-ab1e-4432-bd26-2f1799b91790">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_b4c72652-61fe-492c-b894-aa65a130c43d_bgxray_zh_tw-rtgr1">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_8398a0ec-5d22-44b9-8ce6-359a72c257bf_bgxray_fr_ca-rtgr1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_411b594b-98ef-45ba-8013-3d84e372e2bd_bgxray_pt_br-pa1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_0cdefa56-526f-4aa8-9992-3807fbe47371">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_57bb8351-9a22-4f4b-b347-f9cab4981638_bgxray_eng_deu-pa1">>},{resource,<<"/">>,queue,<<"AuditLogBus_411b594b-98ef-45ba-8013-3d84e372e2bd">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_59f4b408-2131-498b-adfb-5bd7c0d865a0_bgxray_eng_gbr_ire-rtgr1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterAdminBus_a088648f-b7b3-4e70-b6e5-3af79c61a2b2_bgxray_fr_be-pa1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_f1bbf110-61a4-4fae-a721-7a271d63d7bf_bgxray_en_tr-pa1">>},{resource,<<"/">>,queue,<<"ConsoleResultBus_62352b64-152f-4e31-87d6-ad0622603cf1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_6af31707-2206-49d6-8e5e-79d0c88cde34">>},{resource,<<"/">>,queue,<<"ProtocolAdapterAdminBus_d913498e-4b3d-4666-969b-009e60a003fe_bgxray-pa1">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_180d28e7-149c-4f90-b4d0-a98d22f0318c_bgxray_fr_ch-rtgr1">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_0cdefa56-526f-4aa8-9992-3807fbe47371_bgxray_en_in-rtgr1">>},{resource,<<"/">>,queue,<<"LoggerAdminBus_20c73c95-4baa-498d-bc50-e3cdf097e69c">>},{resource,<<"/">>,queue,<<"ResumeTaggerCommandBus_8398a0ec-5d22-44b9-8ce6-359a72c257bf">>},{resource,<<"/">>,queue,<<"LogBus_26327ae5-49ed-49e3-b454-909973206f69">>},{resource,<<"/">>,queue,<<"TaggerAdminBus_a088648f-b7b3-4e70-b6e5-3af79c61a2b2_bgxray_fr_be-rtgr1">>},{resource,<<"/">>,queue,<<"AuditLogBus_6af31707-2206-49d6-8e5e-79d0c88cde34">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_cfac3806-e5eb-4e4a-98bb-e9b79bc7ce54">>},{resource,<<"/">>,queue,<<"LoggerAdminBus_50d6046a-10bf-4e0d-9066-81dfa0836705">>},{resource,<<"/">>,queue,<<"ResumeTaggerCommandBus_6418738d-b89c-4e8a-ae02-0c45a87e41d0">>},{resource,<<"/">>,queue,<<"LogBus_18cc1910-46f7-4ba5-a3c2-be39141d852d">>},{resource,<<"/">>,queue,<<"ResumeTaggerCommandBus_26327ae5-49ed-49e3-b454-909973206f69">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_66b60efa-e40d-4890-8e08-33ef945715cc_bgxray_cn_cn-pa1">>},{resource,<<"/">>,queue,<<"ProtocolAdapterResultBus_cfac3806-e5eb-4e4a-98bb-e9b79bc7ce54_bgxray_en_fr-pa1">>},{resource,<<"/">>,queue,<<"LogBus_6af31707-2206-49d6-8e5e-79d0c88cde34">>},{resource,<<"/">>,queue,<<"LogBus_e0f1b1bd-c0f5-4a9a-be02-6d57d737754b">>},{resource,<<"/">>,queue,<<"AuditLogBus_26327ae5-49ed-49e3-b454-909973206f69">>},{resource,<<"/">>,queue,<<"AuditLogBus_d913498e-4b3d-4666-969b-009e60a003fe">>},{resource,<<"/">>,queue,<<"ResumeTaggerCommandBus_9bbfdbee-91...">>},...]}}]},...}}

Thanks,

yaswanth kumar

unread,
Dec 21, 2020, 8:11:05 AM12/21/20
to rabbitmq-users
Can I get some help on the above error pls?

Ayanda Dube

unread,
Dec 21, 2020, 11:06:00 AM12/21/20
to rabbitm...@googlegroups.com
Hi,
The eperm error reason means user is not the owner of that file 335.rdq (in which messages are stored). See: https://erlang.org/doc/man/file.html#posix-error-codes
An attempt to rebuild the node's message storage index is being carried out, following a previous unclean node shutdown, and failing due to file permission ownership.
Just make sure active user has full ownership of data directory, i.e. /var/lib/rabbitmq/mnesia/


error] <0.26412.16> Failed to start message store of type msg_store_persistent for vhost '/': {{{{badmatch,{error,{unable_to_scan_file,"335.rdq",eperm}}},[{rabbit_msg_store,build_index_worker,5,[{file,"src/rabbit_msg_store.erl"},{line,1753}]},{rabbit_msg_store,'-enqueue_build_index_workers/4-fun-0-',5,[{file,"src/rabbit_msg_store.erl"},{line,1798}]},{worker_pool_worker,handle_cast,2,[{file,"src/worker_pool_worker.erl"},{line,119}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,259}]}]},{gen_server2,
...


Best regards,
Ayanda
RabbitMQ Engineering

Erlang Solutions Ltd.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/b47ae953-6d3f-46cd-b1f1-518cf99caab1n%40googlegroups.com.

Code BEAM STO: 27-28 May 2021


Erlang Solutions cares about your data and privacy; please find all details about the basis for communicating with you and the way we process your data in our Privacy Policy. You can update your email preferences or opt-out from receiving Marketing emails here.


Wesley Peng

unread,
Dec 21, 2020, 9:24:58 PM12/21/20
to rabbitm...@googlegroups.com
On 2020/12/21 9:11 下午, yaswanth kumar wrote:
> Also we have excluded both /var/log/rabbitmq and /var/lib/rabbitmq
> from antivirus scans , and we are using McAfee on these boxes.

As far as I know virus scanner software will influence the storage
access of rabbitmq.

please shutdown McAFee then try it again.

Thanks.

yaswanth kumar

unread,
Dec 22, 2020, 9:43:53 AM12/22/20
to rabbitmq-users
Hi Ayanda
Thanks a lot for the information that you provided
Here is how we setup things

we got rabbitmq user and group with rabbitmq installation
And for all it folders it’s already this same user and group
Is associated and we haven’t changed anything there

And now we have another user and group for the main
Application that need rabbitmq

So now we generally sudo with this non rabbitmq user that I described above and will start our application while rabbitmq is already in start mode.. initially it works fine for quite some time as we see spikes on the rabbitmq dashboard but all of a sudden it throws the below error that I mentioned..

So not really sure what’s the ownership that you described is missing in our workflow

Thanks

Ayanda Dube

unread,
Dec 23, 2020, 7:33:50 AM12/23/20
to rabbitm...@googlegroups.com
Hi Yaswanth

Can you please attach the log file from the failing node. I'll have a look.
cheers


Best regards,
Ayanda
RabbitMQ Engineering

Erlang Solutions Ltd.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

yaswanth kumar

unread,
Dec 30, 2020, 4:11:59 PM12/30/20
to rabbitmq-users
Hi Ayanda,

Here is the log file that I grabbed from a machine where rabbitmq is down during time.

Please review and let me know if you need any more details.

rabbitlogs.txt

yaswanth kumar

unread,
Jan 4, 2021, 12:26:36 PM1/4/21
to rabbitmq-users
Hi Ayanda,

If possible can you take a look at the above attached logs and let me know if we can troubleshoot and fix the issue?

Ayanda Dube

unread,
Jan 5, 2021, 6:00:02 AM1/5/21
to rabbitm...@googlegroups.com
Hi Yaswanth

The problem is as I mentioned before. RabbitMQ does not have permissions to carry out necessary actions on
files it uses for persisting messages. Hence the eperm error I highlighted in my 1st response.

The logs here are to see the sequence of events that occur in the build up of this problem. From the logs, firstly,
there was an initial attempt to delete a message storage file (from a garbage cleanup operation). This file is in:
    /var/lib/rabbitmq/mnesia/rabbit@prodserver06/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent

You should see the notification for the file deletion request in your log file:
    ** Last message in was {'$gen_cast',{delete,312}}

The operation fails due to permission (eperm) reasons which I've already explained in my 1st response.
   ** Reason for termination ==
   ** {{badmatch,{error,{unable_to_scan_file,"312.rdq",eperm}}} ...

Thereafter, the above failure of the message storage components cause the entire virtual host to stop. You can see the following
notifications in your log file:
       2020-12-30 11:23:56.950 [info] <0.478.0> Closing all connections in vhost '/' on node 'rabbit@prodserver06' because the vhost is stopping

Next, all connections which were established on that virtual host are subsequently closed. You can see the following notifications:
       2020-12-30 11:23:56.961 [error] <0.31409.115> Error on AMQP connection <0.31409.115> ([::1]:34592 -> [::1]:5672, vhost: '/', user: 'xray', state: running), channel 0:
       operation none caused a connection exception connection_forced: "vhost '/' is down"

and...

       2020-12-30 11:23:57.097 [info] <0.15906.0> closing AMQP connection <0.15906.0> ([::1]:58102 -> [::1]:5672, vhost: '/', user: 'xray')

Following the above failures, the virtual host doesn't recover. You can see from the rest of the log notifications that connection
attempts are made and accepted (on the 1st phase of connection establishment), but immediately fail on the handshaking phase (2nd phase).
You should see the following entries continuously repeated in your logs (due the failing virtual host),
1st phase of connection establishment:
      2020-12-30 11:23:57.102 [info] <0.15188.119> accepting AMQP connection <0.15188.119> ([::1]:39088 -> [::1]:5672)

2nd phase of connection establishment (connection handshake failures):
    2020-12-30 11:23:57.223 [error] <0.15188.119> Error on AMQP connection <0.15188.119> ([::1]:39088 -> [::1]:5672, vhost: 'none', user: 'xray', state: opening), channel 0:
    {handshake_error,opening,
                     {amqp_error,internal_error,
                                 "access to vhost '/' refused for user 'xray': vhost '/' is down",
                                 'connection.open'}}

The above connection establishment attempt failures continue repeatedly due to your virtual host which never recovers. You
should see these handshake_error notifications due to vhost '/' is down" logged on your entire log file (right up to the end).

There are multiple attempts to try and recover and restore the virtual host. But again, all of these fail due to the above mentioned
permission (eperm) reason. You should see the following notifications in your logs:
      2020-12-30 11:23:57.256 [error] <0.414.0> ** Generic server <0.414.0> terminating
      ** Last message in was {'$gen_cast',{submit_async,#Fun<rabbit_msg_store.48.27584289>,<0.15386.119>}}
      ** When Server state == {from,<0.15386.119>,#Ref<0.3366353068.2695102473.64747>}
      ** Reason for termination ==
      ** {{badmatch,{error,{unable_to_scan_file,"312.rdq",eperm}}

These vhist recovery attempts are repeated and continue failing, due to the active RabbitMQ user not having permissions (eperm)
to read the message storage files. Also, all connections which momentarily attempted establishment before the vhost failure are terminated.
         2020-12-30 11:23:57.378 [info] <0.478.0> Closing connection <0.28329.111> because "vhost '/' is down"

So your virtual host '/' never recovers. The first/initial failure due to no permissions which I've mentioned above keeps your virtual host down.
It repeatedly attempts to recover, but failures continue due to the unaddressed lack of permissions to access the data directory:
/var/lib/rabbitmq/mnesia/rabbit@prodserver06/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent

Please check your workflow and ensure/place some guarantees that the same RabbitMQ orchestrating user is
the same effective user with authorized full read/write permissions for all data directories in use. There seems
to be some dynamic alteration in file permissions taking place (if you're saying this starts off 'ok', then alters
during runtime). Perhaps your antivirus software, but that's highly unlikely with all data directories are excluded
from antivirus scans.

Based on your logs, everything consistently points to external restrictions in permissions which are causing these
failures. The problem is not with RabbitMQ in this case, but a usage/permissions issue on your end, host server
permissions.



Best regards,
Ayanda
RabbitMQ Engineering

Erlang Solutions Ltd.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages