backups not 100% in 2.1?

64 views
Skip to first unread message

Peter

unread,
May 29, 2012, 6:15:06 PM5/29/12
to Hazelcast
Using 2.1 we are apparently seeing Hazelcast not perform all backups.
My question is, are our expectations correct and our method of
counting the backups valid?

Scenario:
1. Start 39 nodes that will join a single cluster (3s delay between
each node start)
2. Wait for 10 nodes to join, once they do, individual nodes have
permission to start writing to a Hazelcast IMap
3. Wait for all nodes to join and complete their individual writes
4. Wait for Hazelcast to complete all migrations
5. Calculate owned keys versus backups.. there is a mismatch?

The cluster details:
* 39 nodes
* partition count of 128
* 1 backup
* wait to write data into Hazelcast until 10 nodes have joined, but do
not wait for all 39
* each node is reading and writing to Hazelcast's IMap, but they are
only updating existing keys, not putting new ones
* no ttl for the map
* other than these settings, the map configuration is basically
defaulted

We're counting the number of owned keys and backups by executing a
MultiTask on all members of the cluster. Each member does the
following (the oldest member initiates and totals the results):

long owned = 0L;
long backedUp = 0L;
for (Instance instance : Hazelcast.getInstances()) {
if (instance instanceof IMap) {
IMap map = (IMap) instance;
LocalMapStats stats = map.getLocalMapStats();
owned += stats.getOwnedEntryCount();
backedUp += stats.getBackupEntryCount();
}
}
return new EntryCount(owned, backedUp);

In the current run we have a key count of 1325000. The backup count
initially ends up somewhere around 1324101; after this backup count
slowly climbs.

[05.29.12 13:54:04.707 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324117
[05.29.12 13:54:35.545 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324129
[05.29.12 13:55:06.692 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324134
[05.29.12 13:55:37.518 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324144
[05.29.12 13:56:07.778 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324151
[05.29.12 13:56:38.288 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324158
[05.29.12 13:57:09.016 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324164
[05.29.12 13:58:41.911 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324182

Given a stable Hazelcast cluster that isn't doing writes or
migrations, shouldn't those values always match? It's our expectation
that the backups were supposed to have been written synchronously with
the initial writes.

Two additional notes:
1. We did not notice a similar discrepency using Hazelcast 1.9.
2. If I change the scenario and do not write to the cluster until all
39 nodes have joined, then the backup count is usually 100% of the
owned key count. I'm not sure there has ever been a discrepency in
that case. The issue has always been tied to writing to the cluster
while nodes are still joining.


Some more information. After reading this discussion,
https://groups.google.com/d/topic/hazelcast/2RT4gibv_3E/discussion, I
set this property:

<property name="hazelcast.partition.migration.interval">0</
property>

While that apparently increased the rate of these "catch-up" backups,
the map was still not 100% backed up after doing the initial writes.
Log excerpts:

[05.29.12 14:03:26.164 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321043
[05.29.12 14:03:56.614 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321295
[05.29.12 14:04:27.389 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321535
[05.29.12 14:04:57.587 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321760

Enes Akar

unread,
May 30, 2012, 4:03:12 AM5/30/12
to haze...@googlegroups.com
Hi Peter;

There is problem in your callable.
Hazelcast.getInstances() gives you instances of default Hazelcast instance.

You should replace it with a Callable like this:
public class CallablePeter implements Callable<String>, Serializable, HazelcastInstanceAware {
    HazelcastInstance hazelcastInstance;
    @Override
    public String call() throws Exception {

        long owned = 0L;
        long backedUp = 0L;
        for (Instance instance : hazelcastInstance.getInstances()) {

            if (instance instanceof IMap) {
                IMap map = (IMap) instance;
                LocalMapStats stats = map.getLocalMapStats();
                owned += stats.getOwnedEntryCount();
                backedUp += stats.getBackupEntryCount();
            }
        }
        return "owned:" + owned + " backup:" + backedUp;
    }
    @Override
    public void setHazelcastInstance(HazelcastInstance hazelcastInstance) {
        this.hazelcastInstance = hazelcastInstance;
    }
}

I tried a similar test and no problem.
Can you re-test with your scenario?


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To post to this group, send email to haze...@googlegroups.com.
To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.


Enes Akar

unread,
May 30, 2012, 4:06:39 AM5/30/12
to haze...@googlegroups.com
By the way, have you tried Management Center?

It also gives detailed numbers about your cluster.

Peter

unread,
May 30, 2012, 11:57:47 AM5/30/12
to Hazelcast
Hi Enes- thank you for the reply.

A few follow-up notes:

1. In our test application we are only using the default Hazelcast
instance. As far as I'm aware that means our existing Callable is
functioning the same as the one you outlined. That said, I plugged in
your version and repeated the tests just in case. The results were the
same: when writing keys to the cluster while many join operations are
being performed backups are apparently not guaranteed. In fact, I can
confirm data loss by turning off a few nodes (with a delay between
disabling them).

2. I also confirmed that the new Callable also returns the expected
100% backup count _IF_ no data is written to the cluster until all of
the nodes have joined. This is an argument, I believe, for thinking
that the counting mechanism can be trusted. It does show a 100% backup
rate in some scenarios.

3. Something I did not mention in the original outline is that there
are no apparent exceptions in the logs during the run. There isn't any
visible evidence of write failures, etc.

Regards,

Peter

unread,
May 30, 2012, 12:02:07 PM5/30/12
to Hazelcast
I haven't. At the moment I'm dealing with a many-node standalone test
application that's running on spare cycles within a production
environment. It's simulating production node behavior regarding
network connections, data size and activity levels, but it's extremely
stripped down.

I could, theoretically, replicate the results on a 2-node cluster (the
limit of the free license) if the management console is likely to
produce additional information regarding the backups? I defer to your
experience in the matter. If I did pursue that effort, what
information from the Center would you like me to report back?

Thank you,

On May 30, 3:06 am, Enes Akar <e...@hazelcast.com> wrote:
> By the way, have you tried Management
> Center<http://www.hazelcast.com/mancenter.jsp>
> ?

Enes Akar

unread,
May 31, 2012, 3:03:53 AM5/31/12
to haze...@googlegroups.com
I have reproduced your problem.
We will work on it.


Peter

unread,
May 31, 2012, 9:57:00 AM5/31/12
to Hazelcast
Good! Thank you for the update.
Reply all
Reply to author
Forward
0 new messages