Using 2.1 we are apparently seeing Hazelcast not perform all backups.
My question is, are our expectations correct and our method of
counting the backups valid?
Scenario:
1. Start 39 nodes that will join a single cluster (3s delay between
each node start)
2. Wait for 10 nodes to join, once they do, individual nodes have
permission to start writing to a Hazelcast IMap
3. Wait for all nodes to join and complete their individual writes
4. Wait for Hazelcast to complete all migrations
5. Calculate owned keys versus backups.. there is a mismatch?
The cluster details:
* 39 nodes
* partition count of 128
* 1 backup
* wait to write data into Hazelcast until 10 nodes have joined, but do
not wait for all 39
* each node is reading and writing to Hazelcast's IMap, but they are
only updating existing keys, not putting new ones
* no ttl for the map
* other than these settings, the map configuration is basically
defaulted
We're counting the number of owned keys and backups by executing a
MultiTask on all members of the cluster. Each member does the
following (the oldest member initiates and totals the results):
long owned = 0L;
long backedUp = 0L;
for (Instance instance : Hazelcast.getInstances()) {
if (instance instanceof IMap) {
IMap map = (IMap) instance;
LocalMapStats stats = map.getLocalMapStats();
owned += stats.getOwnedEntryCount();
backedUp += stats.getBackupEntryCount();
}
}
return new EntryCount(owned, backedUp);
In the current run we have a key count of 1325000. The backup count
initially ends up somewhere around 1324101; after this backup count
slowly climbs.
[05.29.12 13:54:04.707 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324117
[05.29.12 13:54:35.545 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324129
[05.29.12 13:55:06.692 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324134
[05.29.12 13:55:37.518 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324144
[05.29.12 13:56:07.778 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324151
[05.29.12 13:56:38.288 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324158
[05.29.12 13:57:09.016 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324164
[05.29.12 13:58:41.911 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1324182
Given a stable Hazelcast cluster that isn't doing writes or
migrations, shouldn't those values always match? It's our expectation
that the backups were supposed to have been written synchronously with
the initial writes.
Two additional notes:
1. We did not notice a similar discrepency using Hazelcast 1.9.
2. If I change the scenario and do not write to the cluster until all
39 nodes have joined, then the backup count is usually 100% of the
owned key count. I'm not sure there has ever been a discrepency in
that case. The issue has always been tied to writing to the cluster
while nodes are still joining.
Some more information. After reading this discussion,
https://groups.google.com/d/topic/hazelcast/2RT4gibv_3E/discussion, I
set this property:
<property name="hazelcast.partition.migration.interval">0</
property>
While that apparently increased the rate of these "catch-up" backups,
the map was still not 100% backed up after doing the initial writes.
Log excerpts:
[05.29.12 14:03:26.164 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321043
[05.29.12 14:03:56.614 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321295
[05.29.12 14:04:27.389 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321535
[05.29.12 14:04:57.587 INFO Thread-1
com.code42.hz.task.LogStats ] TOTAL KEYS IN CLUSTER:
owned=1325000, backup=1321760