Rabbit and clustering clearlty does not work on ec2 - where are the real docs?

3,752 views
Skip to first unread message

David Montgomery

unread,
Aug 26, 2014, 4:03:03 AM8/26/14
to rabbitm...@googlegroups.com
Hi,

Why oh why will rabbit cluster on ec2?

Both servers have the same cookie...both have ports open to each other.  Hm....is it me or do the online docs refect what one is supposed ot do.  Are there hidden docs I have yet to find?

Thanks

Here is how I install on ubuntu

   touch /etc/apt/sources.list.d/rabbitmq.list
    echo 'deb http://www.rabbitmq.com/debian/ testing main' | tee - a /etc/apt/sources.list.d/rabbitmq.list
    curl http://www.rabbitmq.com/rabbitmq-signing-key-public.asc -o /tmp/rabbitmq-signing-key-public.asc
    apt-key add /tmp/rabbitmq-signing-key-public.asc
    rm /tmp/rabbitmq-signing-key-public.asc
    apt-get -qy update


sudo rabbitmqctl stop_app
sudo rabbitmqctl join_cluster rab...@ip-172-31-12-135.us-west-1.compute.internal


sudo rabbitmqctl join_cluster rab...@ip-172-31-12-135.us-west-1.compute.internal
Clustering node 'rabbit@ip-172-31-2-103' with 'rab...@ip-172-31-12-135.us-west-1.compute.internal' ...
Error: unable to connect to nodes ['rab...@ip-172-31-12-135.us-west-1.compute.internal']: nodedown

=ERROR REPORT==== 26-Aug-2014::07:51:12 ===
** System NOT running to use fully qualified hostnames **
** Hostname ip-172-31-12-135.us-west-1.compute.internal is illegal **

DIAGNOSTICS
===========

attempted to contact: ['rab...@ip-172-31-12-135.us-west-1.compute.internal']

rab...@ip-172-31-12-135.us-west-1.compute.internal:
  * connected to epmd (port 4369) on ip-172-31-12-135.us-west-1.compute.internal
  * epmd reports node 'rabbit' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?

current node details:
- node name: 'rabbitmqctl25363@ip-172-31-2-103'
- home dir: /var/lib/rabbitmq
- cookie hash: deaU3MfVotDW9r05xrIWwA==

ubuntu@ip-172-31-2-103:~$ hostname -f
ip-172-31-2-103.us-west-1.compute.internal

Michael Klishin

unread,
Aug 26, 2014, 4:19:22 AM8/26/14
to David Montgomery, rabbitm...@googlegroups.com
On 26 August 2014 at 12:03:10, David Montgomery (davidmo...@gmail.com) wrote:
> > =ERROR REPORT==== 26-Aug-2014::07:51:12
> ===
> ** System NOT running to use fully qualified hostnames **
> ** Hostname ip-172-31-12-135.us-west-1.compute.internal
> is illegal **

There are 3 most common issues:

 * Host names: see "Issues with hostname" on http://www.rabbitmq.com/ec2.html
 * Firewalls, port access: see "Firewalled nodes" on http://www.rabbitmq.com/clustering.html
 * Different Erlang versions across the cluster: "If using clustered nodes, all nodes should use the same version of Erlang" on http://www.rabbitmq.com/which-erlang.html

Your issue seems to be 1 or 2, although all 3 need to be checked to be sure.

We'll try to cross link the pages above better.

A quick search for the error message above yileds:

http://markmail.org/thread/2tgytqbittfvb2jq
http://markmail.org/thread/qfpphcemg73luf4j
http://markmail.org/thread/2f5alpmgwn2xybvj

which may clarify some of the issues in a bit more detail. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

David Montgomery

unread,
Aug 26, 2014, 5:19:29 AM8/26/14
to Michael Klishin, rabbitm...@googlegroups.com
Hi,

Well..I use chef to set the hostname and the cookies and for ports SG has access for ports 0-65000.  Why oh why?  Hostname is now illegal? 

hostname names now look like this on route53 and resalable to private ip address

1-rabbit-aws-development-west.test.com
2-rabbit-aws-development-west.test.com

Below is the /etc/hosts

127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

172.31.12.135 2-rabbit-aws-development-west.test.com 2-rabbit-aws-development-west


In chef this is how I set the cookie.  All servers have the same cookie and rabbit is restarted.

template "/var/lib/rabbitmq/.erlang.cookie" do
  path "/var/lib/rabbitmq/.erlang.cookie"
  source "erlang.cookie.erb"
  owner "rabbitmq"
  group "rabbitmq"
  #-r-------- 1 rabbitmq rabbitmq 20 Aug 21 00:00 /var/lib/rabbitmq/.erlang.cookie
  mode "0400"
  notifies :restart, resources(:service => "rabbitmq-server")
end







sudo rabbitmqctl join_cluster rab...@1-rabbit-aws-development-west.test.com
Clustering node 'rabbit@2-rabbit-aws-development-west' with 'rab...@1-rabbit-aws-development-westtest.com' ...
Error: unable to connect to nodes ['rab...@1-rabbit-aws-development-west.test.com']: nodedown

=ERROR REPORT==== 26-Aug-2014::09:13:37 ===

** System NOT running to use fully qualified hostnames **

  * epmd reports node 'rabbit' running on port 25672
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: hostname mismatch?
  * suggestion: is the cookie set correctly?

current node details:
- node name: 'rabbitmqctl16865@2-rabbit-aws-development-west'

- home dir: /var/lib/rabbitmq
- cookie hash: deaU3MfVotDW9r05xrIWwA==



Michael Klishin

unread,
Aug 26, 2014, 5:32:37 AM8/26/14
to David Montgomery, rabbitm...@googlegroups.com
On 26 August 2014 at 12:19:22, Michael Klishin (mic...@rabbitmq.com) wrote:
> > We'll try to cross link the pages above better.

Clustering and EC2 guides now cross-link a bit better and share some
of the troubleshooting info (hostname, firewall, and Erlang version sections).

http://www.rabbitmq.com/clustering.html
http://www.rabbitmq.com/ec2.html

Alvaro Videla

unread,
Aug 26, 2014, 5:53:55 AM8/26/14
to Michael Klishin, David Montgomery, rabbitm...@googlegroups.com
Hi,

I'm not familiar with RabbitMQ and EC2, but Erlang is giving you this error:

"** System NOT running to use fully qualified hostnames **", that means you can't use FQDNs to make clustering work.

That's the "shortname" vs "long name" thing described here: http://www.erlang.org/doc/reference_manual/distributed.html RabbitMQ uses -sname ie: short names.

You can either try to setup your /etc/hosts to point hostnames to the whole FQDN/IP or fiddle with the RabbitMQ start scripts to use -name instead, which I think is a bit cumbersome.

Does this help?

Regards,

Alvaro


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 26, 2014, 5:58:32 AM8/26/14
to Alvaro Videla, David Montgomery, rabbitm...@googlegroups.com
On 26 August 2014 at 13:53:59, Alvaro Videla (videl...@gmail.com) wrote:
> > You can either try to setup your /etc/hosts to point hostnames
> to the whole FQDN/IP or fiddle with the RabbitMQ start scripts
> to use -name instead, which I think is a bit cumbersome.

-sname can be overridden using RABBITMQ_NODENAME (NODENAME in rabbitmq-env.conf):
http://www.rabbitmq.com/configure.html 

Alvaro Videla

unread,
Aug 26, 2014, 6:01:19 AM8/26/14
to Michael Klishin, David Montgomery, rabbitm...@googlegroups.com
Keep in mind that if epmd (the process that manages Erlang distribution) was started using, say, short names, then it has to be restarted in order to use long names, and the other way around as well.

Michael Klishin

unread,
Aug 26, 2014, 6:08:30 AM8/26/14
to David Montgomery, rabbitm...@googlegroups.com


On 26 August 2014 at 13:19:35, David Montgomery (davidmo...@gmail.com) wrote:
> > Clustering node 'rabbit@2-rabbit-aws-development-west'
> with 'rab...@1-rabbit-aws-development-westtest.com'
> ...
> Error: unable to connect to nodes ['rab...@1-rabbit-aws-development-west.test.com(mailto:rab...@1-rabbit-aws-development-west.test.com)']:
> nodedown
>
> =ERROR REPORT==== 26-Aug-2014 
> ===
> ** System NOT running to use fully qualified hostnames **
> ** Hostname 1-rabbit-aws-development-west.augnodev.com
> is illegal **
>
> DIAGNOSTICS
> ===========
>
> attempted to contact: ['rab...@1-rabbit-aws-development-west.test.com(mailto:rab...@1-rabbit-aws-development-west.test.com)']
>
> rab...@1-rabbit-aws-development-west.augnodev.com(mailto:rab...@1-rabbit-aws-development-west.augnodev.com):
> * connected to epmd (port 4369) on 1-rabbit-aws-development-west.test.com
> * epmd reports node 'rabbit' running on port 25672
> * TCP connection succeeded but Erlang distribution failed
> * suggestion: hostname mismatch?

From this we can see that node2 can contact node1 ("TCP connection succeeded").
As you are provisioning the nodes with Chef, the cookie mismatch is also
very unlikely.

So the nodes have a different idea of what the hostnames are.

Can you please post short and full hostname output from both nodes,
plus /etc/hosts contents?

Stopping epmd processes on both nodes may be worth trying, too, although
I can't tell for sure if it may be aggressively caching something relevant
as TCP connection from 2 to 1 succeeds.

What would also be helpful to try is to run 2 nodes on 2 machines unclustered
and use `rabbitmqctl -n [other node] status` to see if the issue goes both
ways.

David Montgomery

unread,
Aug 26, 2014, 7:25:58 AM8/26/14
to Michael Klishin, rabbitm...@googlegroups.com
Hi,

Below are the hostnames

more /etc/hostname
1-rabbit-aws-development-west
2-rabbit-aws-development-west

hostname
1-rabbit-aws-development-west



I tried stopping the app on both servers.

I dont get from the docs.,how to cluster..It does not help when the example tries to cluster 3 servers of the same server which is unrealistic.  More realistic example is how to modify correctly the /etc/hostname and /etc/hosts files/ with 3 seperate IP address and hostnames in addition of any config modifications.

I also tried the following..rabbit will not start.


/etc/rabbbitmq/rabbitmq-env.confroot@2-rabbit-aws-development-west:/etc/rabbitmq#
more rabbitmq-env.conf
NODENAME=rab...@2-rabbit-aws-development-west.test.com

root@1-rabbit-aws-development-west:/etc/rabbitmq# service rabbitmq-server start
 * Starting message broker rabbitmq-server                                                                                                                                                                          
* FAILED - check /var/log/rabbitmq/startup_\{log, _err\}
                                                                                    

                                                                                  
 


Alvaro Videla

unread,
Aug 26, 2014, 8:11:00 AM8/26/14
to David Montgomery, Michael Klishin, rabbitm...@googlegroups.com
Hi David,

When you give this node name NODENAME=rab...@2-rabbit-aws-development-west.test.com you are effectively trying to start RabbitMQ, (and in this case the Erlang node), using a FQDN, ie: long names (The "-name" option documented on the Erlang distribution page).

The /etc/hosts file should have something like this:

10.0.0.1 rabbit1
10.0.0.2 rabbit2
10.0.0.3 rabbit3
10.0.0.4 rabbit4

Etc.

Then you can just start RabbitMQ providing the following variable NODENAME=rabbit1 and on the other host NODENAME=rabbit2, and so on.

Regards,

Alvaro

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Aug 26, 2014, 8:18:03 AM8/26/14
to David Montgomery, rabbitm...@googlegroups.com


On 26 August 2014 at 15:25:57, David Montgomery (davidmo...@gmail.com) wrote:
> > Below are the hostnames
>
> more /etc/hostname
> 1-rabbit-aws-development-west
> 2-rabbit-aws-development-west

Can you please cat the file and label what output is from what machine?

> hostname
> 1-rabbit-aws-development-west
>
> hostname -f
> 1-rabbit-aws-development-west.test.com 

What about 2-rabbit…? We need to inspect the configs on *both* machines.

> I tried stopping the app on both servers.

It may help resetting the node you are trying to join (node2)
after restart.

> I also tried the following..rabbit will not start.
> /etc/rabbbitmq/rabbitmq-env.confroot@2-rabbit-aws-development-west:/etc/rabbitmq#
> more rabbitmq-env.conf
> NODENAME=rab...@2-rabbit-aws-development-west.test.com
>
> root@1-rabbit-aws-development-west:/etc/rabbitmq# service
> rabbitmq-server start
> * Starting message broker rabbitmq-server
> * FAILED - check /var/log/rabbitmq/startup_\{log, _err\}

and what is in the log files?

Michael Klishin

unread,
Aug 26, 2014, 8:19:08 AM8/26/14
to David Montgomery, Alvaro Videla, rabbitm...@googlegroups.com
On 26 August 2014 at 16:11:04, Alvaro Videla (videl...@gmail.com) wrote:
> > Then you can just start RabbitMQ providing the following variable
> NODENAME=rabbit1 and on the other host NODENAME=rabbit2, and
> so on.

Clarifying again: NODENAME is used in rabbitmq-env.conf. RABBITMQ_NODENAME
is used on the command line.

Michael Klishin

unread,
Aug 26, 2014, 8:28:38 AM8/26/14
to David Montgomery, rabbitm...@googlegroups.com
On 26 August 2014 at 15:26:03, David Montgomery (davidmo...@gmail.com) wrote:
> hostname
> 1-rabbit-aws-development-west

This is the short hostname.

Add 1-rabbit-aws-development-west and 2-rabbit-aws-development-west
to /etc/hosts on both machines, reset the database (with `rabbitmqctl reset`)
on both machines, kill the epmd process on both machines
and try joining the cluster using that short name
(not the FQDN) like so:

# on node2
sudo rabbitmqctl join_cluster rabbit@1-rabbit-aws-development-west

note: there is no .test.com at the end of the hostname, so it is not a FQDN. 

David Montgomery

unread,
Aug 26, 2014, 11:09:08 AM8/26/14
to Michael Klishin, rabbitm...@googlegroups.com
yay!  Adding both hostnames to /etc/hosts did the trick!

Thanks so much!

See Ya

Carl Hörberg

unread,
Aug 26, 2014, 6:59:02 PM8/26/14
to rabbitm...@googlegroups.com, davidmo...@gmail.com
On CloudAMQP we do roughly this:

take a cluster name, say "happy-rabbit", take a domainname, say "rmq.cloudamqp.com"
Create two ec2 instances
set the hostname in /etc/hostname to happy-rabbit-01 and happy-rabbit-02 respectively
set "search" in /etc/resolv.conf to "rmq.cloudamqp.com ec2.internal"
For it to not be overwritten by DHCP, in Ubuntu, add 'supersede domain-name "rmq.cloudamqp.com ec2.internal"; ' to /etc/dhcp/dhclient.conf
Create two CNAMEs, for happy-rabbit-01.rmq.cloudamqp.com and for -02, point them to the ec2 internal hostname, It will resolve to the private ip when resolved from within ec2 and to the public ip if resolved from outside.
Start RabbitMQ on both nodes
On -02 run sudo rabbitmqctl join_cluster rabbit@happy-rabbit-01
If instance ip/hostname changes then only change the CNAME for happy-rabbit-0x.rmq.cloudamqp.com. Keep a low TTL.

Just using the /etc/hosts file can be easier if you just want to get started, but you have to login to all servers when the IP changes.. By using DNS you just have one place to make changes.

Alvaro Videla

unread,
Aug 28, 2014, 5:32:10 AM8/28/14
to David Montgomery, Michael Klishin, rabbitm...@googlegroups.com
Hi David,

Do you think you could write a blog post outlining how you solved the issue? I think this will be really beneficial for the community.

Regards,

Alvaro


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

David Montgomery

unread,
Aug 28, 2014, 12:00:30 PM8/28/14
to Alvaro Videla, Michael Klishin, rabbitm...@googlegroups.com
Hi,

actually thinking about it..I need to get a blog up and running anyway.  Challenge was to automate with chef and I assume a node with fail at anytime.  If I kill a server in the cluster the my python scripts will automatically boot a new server and join the cluster within 10 mins. 

1) I have python scripts monitor the required count of a cluster by entries in route53 and active servers in ec2. e.g. 1.rabbit.test.com and 2.rabbit.test.com
2) If 2 dies then remove the entry from route53 and adjust the cluster count with only those active
3) Boot a new node
4) Get a list of current ip address in route53 in a chef script and add the existing node along with the node from 53 to /etc/hosts
5) When chef is done then do post processing that joins the cluster.  In production will have 3 nodes. 
6) works like a charm

It all make sense how to resolve the issue because it parallels the docs and network setup with my cloudera hbase setup which is clearly specified in the cloudera docs on how to set up hostname and /etc/host files.

A week or two to get my blog up and running but will provide more details of my cluster automation if interested.

See Ya





Atul Sharma

unread,
May 7, 2016, 5:37:38 AM5/7/16
to rabbitmq-users
Same Issue i also have been gone through while configuring the HA cluster on AWS , Now we able to resolve it via putting 0.0.0.0 or localhist instead of mention your fqdn , Try it hope it would work for you as well, 
Reply all
Reply to author
Forward
0 new messages