Bug#418210: Fwd: Bug#418210: heartbeat-2: /etc/ha.d/authkeys should not determine which nodes are in the cluster

Simon Horman

unread,

Apr 8, 2007, 3:00:12 AM4/8/07

to

This seems to be a bit of an easy trap to fall into.
Are there any fixes floating around? I was thinking
that perhaps a cluster id of some sort would be a good
idea. But I'm not sure.

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

----- Forwarded message from Russell Coker <rus...@coker.com.au> -----

Subject: Bug#418210: heartbeat-2: /etc/ha.d/authkeys should not determine which nodes are in the cluster
From: Russell Coker <rus...@coker.com.au>
To: Debian Bug Tracking System <sub...@bugs.debian.org>
Date: Sun, 08 Apr 2007 10:53:02 +1000

Package: heartbeat-2
Version: 2.0.8-1
Severity: normal

Currently if you have two clusters using broadcast heartbeats on the same
network and they have the same contents of /etc/ha.d/authkeys then Heartbeat
will get confused as to which nodes are in the cluster.

The "node" config directive determines which nodes are permitted in the
cluster, this should be authoritative and any nodes which aren't listed with
a node statement should not be permitted to join.

It's not uncommon to configure multiple clusters on one VLAN. It's also common
to duplicate servers by copying the hard drive and changing the relevant config
file settings. When duplicating a server in such a manner it's common to leave
the passwords unchanged.

http://www.linux-ha.org/authkeys

The above URL says "The authkeys configuration file contains information for
Heartbeat to use when authenticating cluster members". Authentication and
authorisation are separate issues, the current implementation apparently uses
the authkeys file for authorisation as well as authentication, the
authorisation should only be the node line in ha.cf.

-- System Information:
Debian Release: 4.0
APT prefers testing
APT policy: (500, 'testing')
Architecture: i386 (i686)
Shell: /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18-3-xen-686
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8)

----- End forwarded message -----

Simon Horman

unread,

Apr 8, 2007, 5:10:08 AM4/8/07

to

[ Reposting as I sent it to linux-ha-devel instead of
linux-ha-devel the first time around ]

Russell Coker

unread,

Apr 8, 2007, 5:20:10 AM4/8/07

to

On Sunday 08 April 2007 16:46, Simon Horman <ho...@verge.net.au> wrote:
> This seems to be a bit of an easy trap to fall into.
> Are there any fixes floating around? I was thinking
> that perhaps a cluster id of some sort would be a good
> idea. But I'm not sure.

There is a cluster ID stored in the CIB. However that is going to be copied
if you copy both nodes including configuration.

The ha.cf file already lists all nodes that are in the cluster via the "node"
directive. Surely if a node calling itself "foo" asks to join the cluster
then regardless of whether it has a suitable auth key it should not be
accepted if the list of valid nodes includes no "foo".

Even if you have the case of a valid node in the cluster having the wrong name
due to a configuration error you can't keep a valid configuration if you
allow it to join as it makes the process of determining quorum difficult.
It's impossible to know whether it's a backup copy of a node or a mis-named
node. Allowing a machine with the wrong name to join and then rejecting a
machine with the right name because the number of nodes specified in the
config file have already joined (as is currently the case) is just wrong.

--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Russell Coker

unread,

Apr 8, 2007, 9:00:13 AM4/8/07

to

On Sunday 08 April 2007 19:14, Russell Coker <rus...@coker.com.au> wrote:
> The ha.cf file already lists all nodes that are in the cluster via the
> "node" directive. Surely if a node calling itself "foo" asks to join the
> cluster then regardless of whether it has a suitable auth key it should not
> be accepted if the list of valid nodes includes no "foo".

Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
[ha2-unstable] failed authentication
Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
[ha1-unstable] failed authentication
Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
[ha1-unstable] failed authentication
Apr 8 22:52:36 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
[ha2-unstable] failed authentication

For even more annoyance I get the above messages repeatedly in my syslog when
I use different auth values.

Running two clusters on the same VLAN is not going to be viable until after
this bug is fixed. Do you plan to run a back-ports repository for newer
versions of Heartbeat on Etch after the release of Etch? If you are
considering such things then this makes a good reason IMHO.

Lars Marowsky-Bree

unread,

Apr 8, 2007, 10:20:09 AM4/8/07

to

On 2007-04-08T02:01:45, Simon Horman <ho...@verge.net.au> wrote:

> [ Reposting as I sent it to linux-ha-devel instead of
> linux-ha-devel the first time around ]
>
> This seems to be a bit of an easy trap to fall into.
> Are there any fixes floating around? I was thinking
> that perhaps a cluster id of some sort would be a good
> idea. But I'm not sure.

It's only used as authorisation when "autojoin" is enabled. That is the
whole point of the autojoin method.

Users should not run distinct clusters on the same network media. (ie,
the same subnets + port when using bcast, nor the same mcast address +
port.)

When autojoin is not enabled, this will cause a bunch of errors. If they
have distinct shared secrets, they'll bail at the authentication step.

If they have _both_ the same key _and_ the same media _plus_ autojoin
enabled, they'll merge into one big cluster.

Sincerely,
Lars

--
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

Simon Horman

unread,

Apr 8, 2007, 3:50:09 PM4/8/07

to

On Sun, Apr 08, 2007 at 08:14:03PM +1100, Russell Coker wrote:
> On Sunday 08 April 2007 16:46, Simon Horman <ho...@verge.net.au> wrote:
> > This seems to be a bit of an easy trap to fall into.
> > Are there any fixes floating around? I was thinking
> > that perhaps a cluster id of some sort would be a good
> > idea. But I'm not sure.
>
> There is a cluster ID stored in the CIB. However that is going to be
> copied if you copy both nodes including configuration.
>
> The ha.cf file already lists all nodes that are in the cluster via the
> "node" directive. Surely if a node calling itself "foo" asks to join
> the cluster then regardless of whether it has a suitable auth key it
> should not be accepted if the list of valid nodes includes no "foo".
>
> Even if you have the case of a valid node in the cluster having the
> wrong name due to a configuration error you can't keep a valid
> configuration if you allow it to join as it makes the process of
> determining quorum difficult. It's impossible to know whether it's a
> backup copy of a node or a mis-named node. Allowing a machine with
> the wrong name to join and then rejecting a machine with the right
> name because the number of nodes specified in the config file have
> already joined (as is currently the case) is just wrong.

Yes, I agree that sounds a bit silly. I'm actually surprised that
what you describe is going on. Hopefully someone on the linux-ha-dev
list can explain in a little detail what is supposed to occur in
this kind of situation, and we can take things from there.

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

--

Simon Horman

unread,

Apr 8, 2007, 3:50:08 PM4/8/07

to

On Sun, Apr 08, 2007 at 11:54:42PM +1100, Russell Coker wrote:
> On Sunday 08 April 2007 19:14, Russell Coker <rus...@coker.com.au> wrote:
> > The ha.cf file already lists all nodes that are in the cluster via the
> > "node" directive. Surely if a node calling itself "foo" asks to join the
> > cluster then regardless of whether it has a suitable auth key it should not
> > be accepted if the list of valid nodes includes no "foo".
>
> Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
> [ha2-unstable] failed authentication
> Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
> [ha1-unstable] failed authentication
> Apr 8 22:52:35 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
> [ha1-unstable] failed authentication
> Apr 8 22:52:36 ha2 heartbeat: [2929]: WARN: string2msg_ll: node
> [ha2-unstable] failed authentication
>
> For even more annoyance I get the above messages repeatedly in my syslog when
> I use different auth values.

I believe that the work-around of choice is to run different clusers
on different ports.

> Running two clusters on the same VLAN is not going to be viable until after
> this bug is fixed. Do you plan to run a back-ports repository for newer
> versions of Heartbeat on Etch after the release of Etch? If you are
> considering such things then this makes a good reason IMHO.

Yes, I usually make newer versions available in backports as they
are released upstream.

Alan Robertson

unread,

Apr 10, 2007, 9:50:10 AM4/10/07

to

Simon Horman wrote:
> [ Reposting as I sent it to linux-ha-devel instead of
> linux-ha-devel the first time around ]
>
> This seems to be a bit of an easy trap to fall into.
> Are there any fixes floating around? I was thinking
> that perhaps a cluster id of some sort would be a good
> idea. But I'm not sure.

With only one role possible ("cluster member"), the distinction between
authentication and authorization is very small.

With only one role possible, it isn't completely clear what the value of
having someone be authenticated but not authorized. If they aren't
authorized as "cluster member", then they have NO role they can play in
the cluster.

If you're a cluster member, then you're a full peer. If you're not a
cluster member, then you're nobody.

But, what value there is, is implemented by the "node" directive for
those who don't want to use autojoin. If you're authenticated but not
authorized by this mechanism, it's almost certainly an error, so we
print error messages for such communication.

With autojoin enabled, if you're authenticated as cluster member, then
you area also authorized to take the role of "cluster member".

The two only truly need to be separate when there is more than one role
possible.

We don't have more than one possible role for this communication
mechanism - and we won't have from this authentication source.

If you make the mistake described in the email, and you don't change the
host name either, then you're completely screwed - and adding some kind
of authorization mechanism won't help you. Because cloning it onto a
new machine is indistinguishable from restoring it onto replacement
hardware for something broken.

So, you're probably going to change the system name. While you're at
it, turn off heartbeat or fix the configuration.

The moral of the story is, if you're going to be a system administrator,
you need to know how to do some things properly, and how to recover from
them when you screw up. Security mechanisms are NOT designed to keep
admins from screwing up. They're designed to keep bad guys out. If
admins with root privileges are going to screw up, security mechanisms
are not going to make everything happy.

I'd love to avoid this problem in general - if it were easy.

But if the best we can do is raise the overhead of managing the systems,
rewrite the software, and make everything else harder to avoid one case
where the admin screws up, but leave many other cases uncovered, then
I'm not interested. And, I suspect that's the best we can do.

Anyone who has a concrete proposal for how this can be fixed for all
cases correctly without a complete rewrite of the communications layer
is encouraged to suggest it.

Horms might have made the beginnings of such a proposal, but I didn't
understand what he said.

--
Alan Robertson <al...@unix.sh>

"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce

Simon Horman

unread,

Apr 17, 2007, 3:50:11 AM4/17/07

to

On Sun, Apr 08, 2007 at 04:10:26PM +0200, Lars Marowsky-Bree wrote:
> On 2007-04-08T02:01:45, Simon Horman <ho...@verge.net.au> wrote:
>
> > [ Reposting as I sent it to linux-ha-devel instead of
> > linux-ha-devel the first time around ]
> >
> > This seems to be a bit of an easy trap to fall into.
> > Are there any fixes floating around? I was thinking
> > that perhaps a cluster id of some sort would be a good
> > idea. But I'm not sure.
>
> It's only used as authorisation when "autojoin" is enabled. That is the
> whole point of the autojoin method.

Russell, does turning off autojoin give you the behaviour that
you were originally expecting?

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

--