Upgraded RabbitMQ Windows service, missing erlang cookie file in %SystemRoot%\system32\config\system

1,052 views
Skip to first unread message

Peter Tirrell

unread,
Jul 2, 2018, 10:13:51 PM7/2/18
to rabbitmq-users
I have (had?) an existing 2-node RabbitMQ cluster running on Windows, version 3.6.1 with OTP 18.2.1.  I am upgrading to RabbitMQ 3.7.5 and OTP 20.3.  On each Windows server I have set the environment variables for ERLANG_HOME, RABBITMQ_BASE, and RABBITMQ_MNESIA_DIR to custom folders. 

For the upgrade process, I shut down the RabbitMQ service on both servers. The last service I shut down was the first server I upgraded.  I upgraded OTP first, then updated the ERLANG_HOME environment variable to the new version path.  Then I ran the RabbitMQ installer for the new version and it appeared to work successfully.  The service installed, and it started (on each server).

However, the current state is this - on each server, I am unable to run rabbitmqctl status (says node is unavailable), and the nodes are no longer clustered.  The existing state was that the .erlang.cookie file matches between c:\windows and c:\users\myusername on each server, all four locations.  Further reading I found that the new OTP changes the location of the .erlang.cookie file to %SystemRoot%\system32\config\system.  On each system I checked that path, and there was no .erlang.cookie file.  So I copied my existing cookie file into that location as well, and restarted the RabbitMQ Windows service.  However, I still am unable to use rabbitmqctl to check status, with what still seems to be a cookie mismatch.

The log on each node prints out something like this on startup:

2018-07-02 16:56:03.966 [info] <0.263.0>
 node           : rabbit@REGICE03
 home dir       : C:\Windows\system32\config\systemprofile
 config file(s) : c:/RABBIT~1/advanced.config
                : c:/RABBIT~1/rabbitmq.conf
 cookie hash    : +9DoKv1jcnc5mkvz8+h5ow==
 log(s)         : C:/RABBIT~1/log/RABBIT~2.LOG
                : C:/RABBIT~1/log/rabbit@server1_upgrade.log
 database dir   : c:/RABBIT~1/mnesia

with the cookie hash indeed being different than what the hash was before the upgrade.  Each server node has a different hash currently, but neither server had an erlang cookie file in the listed folder.

My question is - where is that cookie hash coming from if not from the systemprofile folder, and should I not be able to copy my common .erlang.cookie file into the systemprofile folder for it to "work"?  Once I get the cookies straightened out, should I expect that the nodes should recluster with each other or will I need to rejoin one node to the other again? This was my runthrough on my test systems, but I feel like I'm missing some step or something.

Thanks for any help,

Peter

Karl Nilsson

unread,
Jul 3, 2018, 6:42:33 AM7/3/18
to rabbitm...@googlegroups.com
Hi,

I take it you have read [1] fully? I would probably start with checking that the HOMEDRIVE or HOMEPATH environment variables for you service user are what you expect.


Cheers
Karl

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Tirrell

unread,
Jul 3, 2018, 9:43:33 AM7/3/18
to rabbitmq-users
I think I made some progress.  I do not see any env variable entries for HOMEDRIVE or HOMEPATH, and for my local user they all seem to be the default.  The RabbitMQ service is running under the "Local System" account.

However, I did manage to find the erlang cookie.  Contrary to the docs that state the new OTP Windows service puts the cookie in "C:\WINDOWS\system32\config\systemprofile", on my system I found the cookie at "C:\Windows\SysWOW64\config\systemprofile".  This is on Windows Server 2012.  After replacing the cookie file in *that* folder the services start up and my cli is able to connect.

That's led me to a couple more questions, though.  After starting up the first node successfully, I did the same for the second node.  But it didn't automatically join back to the cluster it was part of before the upgrade.  I was able to manually stop node2, rejoin the cluster, then start node 2 and it appears to be working correctly. 

The couple of remaining issues far are:

On node 2, if I run "rabbitmqctl list_queues", it times out.  The suggested "list_unresponsive_queues" also hangs.  "rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' errors with an argument validation error.

On node 1, list_queues returns *some* queues, but then times out. 

Queues themselves appear to be properly mirrored and messages appear to be getting processed, but I'm trying to gauge the overall system health and results of the upgrade process.

To that - what *should* the upgrade process be like? I thought I followed the directions, I only have two nodes so shut both down by stopping the Windows services on each server.  Then the last one I shut down was the first one I upgraded and started back up.  Barring the deal with the erlang cookie, each node appeared to come back up and had the queues and settings from the cluster I expected.  Should I not have expected them to rejoin the cluster? Or for that to have happened would I have needed to make sure the cookies were synced before starting up the 2nd node after upgrading?  Although I'm pretty sure the installer started the service back up immediately when it finished. 

Thanks,
Peter

Peter Tirrell

unread,
Jul 5, 2018, 2:31:03 PM7/5/18
to rabbitmq-users
Another update with some further test results and a new question...

I set up another test cluster to go through the upgrade process again; a 2-node cluster on Windows, running RabbitMQ 3.6.1 with OTP 18.2.1.  I have system environment variables set for ERLANG_HOME, RABBITMQ_BASE, and RABBITMQ_MNESIA_DIR.  I had the nodes clustered, with an ha-all policy set up for a test queue to mirror messages.  Before the upgrade I published a message to the queue but did not consume it.

I first shut down node 2, then node 1.  On node 1 I installed the newer OTP (20.3), then updated the ERLANG_HOME variable to the new path.

Before running the RabbitMQ 3.7.6 installer (originally I had written 3.7.5, actually I am going to 3.7.6), I copied my existing erlang cookie to both C:\Windows\System32\config\systemprofile and C:\Windows\SysWOW64\config\systemprofile . 

I ran the 3.7.6 Windows installer and it installed and started RabbitMQ successfully.  According to the startup log on node 1, it did indeed use my existing cookie (same hash as pre-upgrade), and I was immediately able to successfully use the CLI to check status.  At that point the only thing off is that my test queue no longer showed a message being present. 

Next I followed similar steps on my node 2 server.  I installed the new OTP on node 2 server, then prior to upgrading RabbitMQ I copied my existing erlang cookie from the cluster over to *both* systemprofile folders.

I ran the 3.7.6 installer on node 2 and it successfully installed and started the service.  Again, on node 2, the RabbitMQ log showed it started up with my old cookie hash, and I was able to connect using the CLI.  *Added bonus* was that it appeared to redetect the cluster; after logging into the web console (I didn't have to reinstall that either), it showed both nodes and appeared to correctly mark the mirrors, etc. 

All in all, that process seemed to work the best with my expectations so far, except that the messages on the queue did not persist through the upgrade.  The test queue is marked as durable, with mirroring.  I don't know that the mirroring really comes into play anyway since my test message I created was originally created on node 1 in the first place, but they were synchronized before the upgrade anyway.

Should I expect messages to persist through the upgrade? Did I miss some other part of the process?

Thanks for any advice,

Peter

Michael Klishin

unread,
Jul 5, 2018, 2:37:28 PM7/5/18
to rabbitm...@googlegroups.com
Messages published as persistent in durable queues will be preserved. See [1][2].
RabbitMQ also takes a backup during node upgrades and removes it after it successfully applies
all schema migrations. But that's just a side note.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Peter Tirrell

unread,
Jul 5, 2018, 2:42:44 PM7/5/18
to rabbitmq-users
Thank you!  After the fact, I looked a little closer and I think that's it - in my test install I don't think I changed the persistence of the test messages, I just took the default.  Our application does set message persistence so if those should stick I think I should be good - the process seems to behave how I need it to.

Thanks again for your help,

Peter
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages