[GENERAL] Postgres Clustering Options

David Kerr

unread,

Nov 11, 2009, 12:28:36 PM11/11/09

to

I'm trying to meet a very high uptime requirement in a high performance environment.

to do this we will need to have some form of cluster for our databases

What I plan on doing is:

Postgres installed on a Cluster configured in active/passive (both pointing to the same SAN
(If PG or the OS fails we trigger a failover to the passive node)

Log shipping between that cluster and a single PG Instance off site.

Is this a common/reccomended method of handling clusterin with Postgres? google searches
basically point to using a replication based solution, which i don't think would meet my
performance demands.

Does anyone have expereince with this or a similar setup that they could share with me?

Thanks

Dave

--
Sent via pgsql-general mailing list (pgsql-...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Ben Chobot

unread,

Nov 11, 2009, 12:35:35 PM11/11/09

to

What are you trying to protect against? Software failure? Hardware
failure? Both?

Depending on your budget, you could theoretically point any number of
failover nodes at a san, so long as you make sure only one of them is
running postgres at a time. Of course, you still have the single point
of failure in the SAN. If you aren't made of money and are running
linux, we've found DRBD is a great way to cluster two machines and it
avoids a few single points of failure. But you limit yourself to two or
three cluster nodes.

What are you trying to achieve with your offsite node? Is it supposed to
pick up the load if the cluster dies?

John R Pierce

unread,

Nov 11, 2009, 12:40:21 PM11/11/09

to

David Kerr wrote:
> I'm trying to meet a very high uptime requirement in a high performance environment.
>
> to do this we will need to have some form of cluster for our databases
>
> What I plan on doing is:
>
> Postgres installed on a Cluster configured in active/passive (both pointing to the same SAN
> (If PG or the OS fails we trigger a failover to the passive node)
>
> Log shipping between that cluster and a single PG Instance off site.
>
>
> Is this a common/reccomended method of handling clusterin with Postgres? google searches
> basically point to using a replication based solution, which i don't think would meet my
> performance demands.
>
> Does anyone have expereince with this or a similar setup that they could share with me?
>

thats your classic database cluster. the reason you don't see
much of that in online writeups is that the high availability SAN
hardware is expensive

presumably you'd manage this with classic cluster managemetn software
(veritas cluster, sun cluster, redhat cluster, heartbeat, or whatever is
appropriate to your environment. commercial cluster vendors
generally recommend doing the cluster 'heartbeat' over at least two
seperate network links so that a network failure doesn't trigger a false
failover. implementing 'fencing' in your storage switch is also a
very good idea, most fencing systems can send commands to common
fiberchannel switches to disable the access port or soft zone of the
current standby server so ti can't accidentally mount the storage.

your applications should be tolerant of database server disconnects, and
know how to reconnect and restart the transaction that was in progress.

John R Pierce

unread,

Nov 11, 2009, 12:51:59 PM11/11/09

to

Ben Chobot wrote:
> Of course, you still have the single point of failure in the SAN.

a proper SAN has two switches, each host connected to it has two HBA
interfaces, there are two redundant storage controllers with mirrored
cache, dual paths from each controller to all the storage, and redundant
power supplies throughout. there is no single point of failure
there. of course, none of this is cheap.

Greg Smith

unread,

Nov 11, 2009, 1:05:58 PM11/11/09

to

John R Pierce wrote:
> presumably you'd manage this with classic cluster managemetn software
> (veritas cluster, sun cluster, redhat cluster, heartbeat, or whatever
> is appropriate to your environment.

I've seen or heard of successful implementations like this done with
Veritas, Sun, RedHat, and Linux HA. This topic is actually addressed
briefly at both http://www.postgresql.org/about/press/faq and
http://www.enterprisedb.com/products/allfaq.do ; Sun even mentions
PostgreSQL support as a specific strength of their product at
http://blogs.sun.com/SC/entry/migrating_from_veritas_cluster_server

I think you don't hear about this much here because the sort of
companies who have the cash for this style of deployment are hiring
commercial vendors and private consultants under NDA rather than relying
on the PostgreSQL community. That's why I can't say more about the one
project like this I did, that featured a shared SAN and commercial
cluster software.

--
Greg Smith gr...@2ndQuadrant.com Baltimore, MD

Greg Smith

unread,

Nov 11, 2009, 1:11:52 PM11/11/09

to

David Kerr wrote:
> Postgres installed on a Cluster configured in active/passive (both pointing to the same SAN
> (If PG or the OS fails we trigger a failover to the passive node)
> Log shipping between that cluster and a single PG Instance off site.
> Is this a common/reccomended method of handling clusterin with Postgres? google searches
> basically point to using a replication based solution, which i don't think would meet my
> performance demands.
>

The part I'm having trouble with here is how it is you expect to keep a
remote node up to date with log-shipping, but then reject log-shipping
based replication as not high enough performance for you? The classic
problem with log-shipping in PostgreSQL is that you've got a single
recovery process trying to replay the work of what many workers did on
the master, and that can turn into a potential lag problem as volume
spikes upwards. If you don't expect a standby is going to be able to
keep up with your volume due to that issue, the remote one is going to
be even worse though.

--
Greg Smith gr...@2ndQuadrant.com Baltimore, MD

David Kerr

unread,

Nov 11, 2009, 1:13:05 PM11/11/09

to

On Wed, Nov 11, 2009 at 09:40:21AM -0800, John R Pierce wrote:
- David Kerr wrote:
- >Does anyone have expereince with this or a similar setup that they could
- >share with me?
- >
-
- thats your classic database cluster. the reason you don't see
- much of that in online writeups is that the high availability SAN
- hardware is expensive
-
- presumably you'd manage this with classic cluster managemetn software
- (veritas cluster, sun cluster, redhat cluster, heartbeat, or whatever is
- appropriate to your environment. commercial cluster vendors
- generally recommend doing the cluster 'heartbeat' over at least two
- seperate network links so that a network failure doesn't trigger a false
- failover. implementing 'fencing' in your storage switch is also a
- very good idea, most fencing systems can send commands to common
- fiberchannel switches to disable the access port or soft zone of the
- current standby server so ti can't accidentally mount the storage.
-
- your applications should be tolerant of database server disconnects, and
- know how to reconnect and restart the transaction that was in progress.

I'll look into Fencing this is the first i've heard of that. But everything
else you mentioned is exactly how I planned on doing it. so that's good
news =)

Thanks!

Dave

David Kerr

unread,

Nov 11, 2009, 1:16:34 PM11/11/09

to

On Wed, Nov 11, 2009 at 01:11:52PM -0500, Greg Smith wrote:
- David Kerr wrote:

- >Postgres installed on a Cluster configured in active/passive (both
- >pointing to the same SAN
- >(If PG or the OS fails we trigger a failover to the passive node)
- >Log shipping between that cluster and a single PG Instance off site.
- >Is this a common/reccomended method of handling clusterin with Postgres?
- >google searches
- >basically point to using a replication based solution, which i don't think
- >would meet my performance demands.
- >
- The part I'm having trouble with here is how it is you expect to keep a
- remote node up to date with log-shipping, but then reject log-shipping
- based replication as not high enough performance for you? The classic
- problem with log-shipping in PostgreSQL is that you've got a single
- recovery process trying to replay the work of what many workers did on
- the master, and that can turn into a potential lag problem as volume
- spikes upwards. If you don't expect a standby is going to be able to
- keep up with your volume due to that issue, the remote one is going to
- be even worse though.

We'd fail over to the standby db (recipient of the log shipping) in the case
that our hosting center was nuked. those are considered "extreme" circumstances
and we have a higher RTO/RPO in those cases.

The apps actually aren't as robust as the DB in this case, so i'll have time to
replay all of the logs that made it before "the big one" while those are being
configured to come up. and if it does take longer that's not a huge issue
i'll have a few hours to get 100% caught up.

Dave

David Kerr

unread,

Nov 11, 2009, 1:19:08 PM11/11/09

to

On Wed, Nov 11, 2009 at 09:35:35AM -0800, Ben Chobot wrote:
- What are you trying to protect against? Software failure? Hardware
- failure? Both?
-
- Depending on your budget, you could theoretically point any number of
- failover nodes at a san, so long as you make sure only one of them is
- running postgres at a time. Of course, you still have the single point
- of failure in the SAN. If you aren't made of money and are running
- linux, we've found DRBD is a great way to cluster two machines and it
- avoids a few single points of failure. But you limit yourself to two or
- three cluster nodes.

Protecting against both hardware and software failure.

SAN failure would be handled by the offsite node, but we've got a pretty robust
SAN, (I don't have all of the details) so it may even not have a single point
of failure.

We tried out DRBD and the performance impact was pretty sigificant. our
app is very sensitive to any performance hitch so I just can't see any
form of replication working for us.

Greg Smith

unread,

Nov 11, 2009, 1:35:58 PM11/11/09

to

David Kerr wrote:
> The apps actually aren't as robust as the DB in this case, so i'll have time to
> replay all of the logs that made it before "the big one" while those are being
> configured to come up. and if it does take longer that's not a huge issue
> i'll have a few hours to get 100% caught up.
>

It sounds like you've got the basics nailed down here and are on a well
trod path, just one not one documented publicly very well. Since you
said that even DRBD was too much overhead for you, I think a dive into
evaluating the commercial clustering approaches (or the free LinuxHA
that RedHat's is based on, which I haven't been real impressed by) would
be appropriate. The hard part is generally getting a heartbeat between
the two servers sharing the SAN that is both sensitive enough to catch
failures while not being so paranoid that it fails over needlessly (say,
when load spikes on the primary and it slows down). Make sure you test
that part out very carefully with any vendor you evaluate.

As far as the PostgreSQL specifics go, you need a solid way to ensure
you've disconnected the now defunct master from the SAN (the classic
"shoot the other node in the head" problem). All you *should* have to
do is start the database again on the backup after doing that. That
will come up as a standard crash, run through WAL replay crash recovery,
and the result should be no different than had you restarted after a
crash on the original node. The thing you cannot let happen is allowing
the original master to continue writing to the shared SAN volume once
that transition has happened.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
gr...@2ndQuadrant.com www.2ndQuadrant.com

David Kerr

unread,

Nov 11, 2009, 2:01:41 PM11/11/09

to

On Wed, Nov 11, 2009 at 01:35:58PM -0500, Greg Smith wrote:
- David Kerr wrote:

- >The apps actually aren't as robust as the DB in this case, so i'll have
- >time to
- >replay all of the logs that made it before "the big one" while those are
- >being
- >configured to come up. and if it does take longer that's not a huge issue
- >i'll have a few hours to get 100% caught up.
- >
- It sounds like you've got the basics nailed down here and are on a well
- trod path, just one not one documented publicly very well. Since you
- said that even DRBD was too much overhead for you, I think a dive into
- evaluating the commercial clustering approaches (or the free LinuxHA
- that RedHat's is based on, which I haven't been real impressed by) would
- be appropriate. The hard part is generally getting a heartbeat between
- the two servers sharing the SAN that is both sensitive enough to catch
- failures while not being so paranoid that it fails over needlessly (say,
- when load spikes on the primary and it slows down). Make sure you test
- that part out very carefully with any vendor you evaluate.
-
- As far as the PostgreSQL specifics go, you need a solid way to ensure
- you've disconnected the now defunct master from the SAN (the classic
- "shoot the other node in the head" problem). All you *should* have to
- do is start the database again on the backup after doing that. That
- will come up as a standard crash, run through WAL replay crash recovery,
- and the result should be no different than had you restarted after a
- crash on the original node. The thing you cannot let happen is allowing
- the original master to continue writing to the shared SAN volume once
- that transition has happened.

Thanks Greg that sounds good! and puts my (and my management's) concerns at ease!

Dave

John R Pierce

unread,

Nov 11, 2009, 2:05:32 PM11/11/09

to

Greg Smith wrote:
> It sounds like you've got the basics nailed down here and are on a
> well trod path, just one not one documented publicly very well. Since
> you said that even DRBD was too much overhead for you, I think a dive
> into evaluating the commercial clustering approaches (or the free
> LinuxHA that RedHat's is based on, which I haven't been real impressed
> by) would be appropriate. The hard part is generally getting a
> heartbeat between the two servers sharing the SAN that is both
> sensitive enough to catch failures while not being so paranoid that it
> fails over needlessly (say, when load spikes on the primary and it
> slows down). Make sure you test that part out very carefully with any
> vendor you evaluate.

hence the 'multiple dedicated heartbeat networks' previously suggested.

a typical cluster server has a quad ethernet, 2 ports (802.3ad link
aggregation w/ failover) for the LAN, and 2 dedicated for the heartbeat,
then a dual HBA for the SAN. the heartbeats can run over crossover
cables, even 10baseT is plenty as the traffic volume is quite low, it
just needs low latency and no possibility of congestion.

I setup the RHCS aka CentOS Cluster in a test lab environment... it
seemed to work well enough. I was using FC storage via a QLogic SANbox
5600 switch, which was supported by RHCS as a fencing device...

Note that ALL of the storage used by the cluster servers on the SAN
should be under cluster management as the 'standby' server won't see any
of it when its fenced (I implemented fencing via FC port disable).
This is can be an issue when you want to do rolling upgrades (update the
standby server, force a failover, update the previous master). I
built each cluster node with its own direct attached mirrored storage
for the OS and software.

> As far as the PostgreSQL specifics go, you need a solid way to ensure
> you've disconnected the now defunct master from the SAN (the classic
> "shoot the other node in the head" problem). All you *should* have to
> do is start the database again on the backup after doing that. That
> will come up as a standard crash, run through WAL replay crash
> recovery, and the result should be no different than had you restarted
> after a crash on the original node. The thing you cannot let happen
> is allowing the original master to continue writing to the shared SAN
> volume once that transition has happened.
>

which is what 'storage fencing' prevents.

Mikko Partio

unread,

Nov 12, 2009, 12:50:06 AM11/12/09

to

On Wed, Nov 11, 2009 at 7:28 PM, David Kerr <d...@mr-paradox.net> wrote:

What I plan on doing is:

Postgres installed on a Cluster configured in active/passive (both pointing to the same SAN
(If PG or the OS fails we trigger a failover to the passive node)

Is this a common/reccomended method of handling clusterin with Postgres? google searches
basically point to using a replication based solution, which i don't think would meet my
performance demands.

Does anyone have expereince with this or a similar setup that they could share with me?

We have done a setup like this with Red Hat Cluster Suite.

We are quite happy with the setup in general, and it has been working well even in 'unexpected circumstances' (power outages etc). The only thing I'd change in this setup if I could is the cluster software: RHCS is not mature enough and it seems every release contains new critical bugs, and sometimes even mission-critical components such as quorum disk do not work after an upgrade.

Regards

Mikko

David Kerr

unread,

Nov 12, 2009, 6:47:44 PM11/12/09

to

On Thu, Nov 12, 2009 at 07:50:06AM +0200, Mikko Partio wrote:
- On Wed, Nov 11, 2009 at 7:28 PM, David Kerr <d...@mr-paradox.net> wrote:
- > basically point to using a replication based solution, which i don't think
- > would meet my
- > performance demands.
- >
- > Does anyone have expereince with this or a similar setup that they could

- > share with me?
- >
-
-

- We have done a setup like this with Red Hat Cluster Suite.
-
- We are quite happy with the setup in general, and it has been working well
- even in 'unexpected circumstances' (power outages etc). The only thing I'd
- change in this setup if I could is the cluster software: RHCS is not mature
- enough and it seems every release contains new critical bugs, and sometimes
- even mission-critical components such as quorum disk do not work after an
- upgrade.
-
- Regards
-
- Mikko

Hi Mikko,

In your enviornment, are the applications able to recover automatically after
a DB failover?

For exmaple, we're using Java/JDBC connections +Geronimo we're researching whether
or not JDBC/Geronimo would be able to retry in the case of losing a connection to
the DB vs failing and crashing the app.

Mikko Partio

unread,

Nov 13, 2009, 2:32:39 AM11/13/09

to

On Fri, Nov 13, 2009 at 1:47 AM, David Kerr <d...@mr-paradox.net> wrote:

In your enviornment, are the applications able to recover automatically after
a DB failover?

For exmaple, we're using Java/JDBC connections +Geronimo we're researching whether
or not JDBC/Geronimo would be able to retry in the case of losing a connection to
the DB vs failing and crashing the app.

Since the system is active/passive, a failover *will* disconnect all sessions. We have coded our applications so that if they lose connection to the database, they will automatically try to reconnect.

Regards

Mikko

David Kerr

unread,

Nov 17, 2009, 7:48:13 PM11/17/09

to

On Thu, Nov 12, 2009 at 07:32:24PM -0900, Joshua J. Kugler wrote:
- On Wednesday 11 November 2009, David Kerr said something like:
- > I'm trying to meet a very high uptime requirement in a high
- > performance environment.
-
- If you don't mind Xen, have you considered:
-
- http://dsg.cs.ubc.ca/remus/ System mirroring/hot standby, with instant
- failover, complete with any open TCP/IP connections, etc.
-
- With paravirtualization, the performance hit shouldn't be very big.
-
- j

Hehe, that opens up a conversation that got closed on me long ago. =)
The I/O hit though would be tough for me though.

thanks!