WAN replication monitoring and failure behaviors

352 views
Skip to first unread message

Jeff T

unread,
May 8, 2012, 12:08:33 PM5/8/12
to Hazelcast
A few questions concerning WAN replication:

1. Is there a class/method or MBean that indicates whether WAN
replication is active and working?
From an operations-monitoring standpoint, folks will want to be able
to detect whether or not a WAN replication connection is running
successfully, and alert if WAN replication is failing for whatever
reason.

2. Is there a way to disable/enable WAN replication in the XML config
or programatically after the hazelcast cluster is started?

3. Is there a way to add more WAN replication targets to a hazelcast
cluster that is already up and running?

4. What is the behavior of WAN replication if there is a temporary
disruption in network connectivity between the two nodes for minutes/
hours, and then gets restored?

5. What is the behavior of WAN replication if the remote cluster is
completely rebooted, such that every node of the remote cluster gets
shutdown, and then comes back up, re-creating a new Hazelcast cluster?

6. What is the behavior of WAN replication if it is configured to a
non-existent remote cluster (for example, in dev/test environments)?

Apologize in advance for the barrage of questions, but the answers
should be useful for all looking to leverage Hazelcast WAN
replication. Thanks!

-Jeff

Talip Ozturk

unread,
May 8, 2012, 3:54:50 PM5/8/12
to haze...@googlegroups.com
> 1. Is there a class/method or MBean that indicates whether WAN
> replication is active and working?
> From an operations-monitoring standpoint, folks will want to be able
> to detect whether or not a WAN replication connection is running
> successfully, and alert if WAN replication is failing for whatever
> reason.

No MBean support yet.

> 2. Is there a way to disable/enable WAN replication in the XML config
> or programatically after the hazelcast cluster is started?

Not yet.

> 3. Is there a way to add more WAN replication targets to a hazelcast
> cluster that is already up and running?

Not yet.

> 4. What is the behavior of WAN replication if there is a temporary
> disruption in network connectivity between the two nodes for minutes/
> hours, and then gets restored?

WAN replication is ok with short disconnections. Current (default)
implementation will queue up the updates during disconnections. After
connection is restored, all queued updates are applied.

> 5. What is the behavior of WAN replication if the remote cluster is
> completely rebooted, such that every node of the remote cluster gets
> shutdown, and then comes back up, re-creating a new Hazelcast cluster?

Again with the current (default) implementation, rebooted cluster will
lose its data in-memory and the other cluster will push the updates
when the rebooted cluster comes back up.

> 6. What is the behavior of WAN replication if it is configured to a
> non-existent remote cluster (for example, in dev/test environments)?

Active cluster will keep trying to connect to the remote cluster. It
will fail each time but all updates are queued up but the queue is
bounded so that it doesn't cause OutOfMemory.

-talip

Jeff T

unread,
May 8, 2012, 5:46:06 PM5/8/12
to Hazelcast
One follow-on question-- what are the out-of-the-box constraints of
the WAN-replication queue in terms of maximum size? Is this max-size
in any way configurable?

Want to make sure we understand side-effects from a prolonged WAN-
replication-outage scenario and budget enough JVM heap to keep local
hazelcast clusters stable if the WAN link or remote data center goes
offline for hours/days from disaster event. Thanks!

-Jeff

Talip Ozturk

unread,
May 9, 2012, 4:18:02 AM5/9/12
to haze...@googlegroups.com
BoundedQueue size is 100k. In terms of memory you will need 100k *
(entry-size). But it also means, outage shouldn't take longer than
having 100k updates; otherwise you lose updates. Unfortunately
bounded-queue-size is hard-coded; not configurable yet.

Jeff T

unread,
May 9, 2012, 10:04:13 PM5/9/12
to Hazelcast
Does the 100k WAN Replication Queue have a backup count > 0? Answer
affects:

1.Whether WAN replication users need to budget 2 * 100k * (entry
size).
2. Impact of single cluster node loss to the WAN replication
queue... ie. if WAN replication link goes down for a few hours, and
during those few hours a single hazelcast node goes down or gets
rebooted, are any messages lost from the WAN replication queue...

I'll submit a feature request re: WAN replication queue size
configurability... some deployments may want more events to go in the
WAN replication queue, others may choose to configure smaller queue
size where preserving heap memory budgets takes precedence.

Thanks!

Talip Ozturk

unread,
May 10, 2012, 7:57:36 AM5/10/12
to haze...@googlegroups.com
It is actually 100k per node because every node keeps and pushes its
own backups. So if you have 5 nodes in a cluster, total update queue
size will be 5 x 100k. Backup count is not relevant at all.

-talip
> --
> You received this message because you are subscribed to the Google Groups "Hazelcast" group.
> To post to this group, send email to haze...@googlegroups.com.
> To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.
>
Reply all
Reply to author
Forward
0 new messages