Hazelcast-2.0: Cluster is loosing data for simultaneous node shut down

586 views
Skip to first unread message

Md Kamaruzzaman

unread,
Mar 5, 2012, 11:32:29 AM3/5/12
to Hazelcast
In the hazelcast 2.0, the distributed backup feature is added. Here is
the release note:

Distributed Backups: Data owned by a member will be evenly backed up
by all the other members

I have a three node cluster with backup-count 2. At first, node A, B
are connected where 100 elements are put in distributed map in A. Now
node C is connected with node A and cluster has node: A-B-C. If node
A,B are stopped almost simultaneously, then the following error
message is get:

Warnung: /127.0.0.1:8551 [dev] Owner of partition is being removed!
Possible data loss for partition[0].
PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0,
oldAddress=Address[127.0.0.1:8550], newAddress=null}

There are lots of warning for almost all partitions. At the end, node
C only has several elements. Hazelcast-2.0 is supposed to solve this
problem according to the documentation.

- Kamaruzzaman

Mehmet Dogan

unread,
Mar 5, 2012, 4:53:07 PM3/5/12
to haze...@googlegroups.com

You should either wait until all 2nd backup operations are completed before terminating A and B or shutdown nodes A and B gracefully.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To post to this group, send email to haze...@googlegroups.com.
To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.

Md Kamaruzzaman

unread,
Mar 6, 2012, 5:45:26 AM3/6/12
to Hazelcast
The problem is for the 3rd node (node C: 127.0.0.1:8560), the backup
is painfully slow. Here is the output for 100 elements:

Information: /127.0.0.1:8560 [dev] Address[127.0.0.1:8560][customers]
loaded 0 in total.
Map is initialized
Number of Customers: 100
MigrationEvent{partitionId=128, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
MigrationEvent{partitionId=128, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
Current time: 2012.03.06.11.28.58
Own Entry: 0, Backup Entry: 0
Number of Customers: 100
Own Entry: 0, Backup Entry: 0
Current time: 2012.03.06.11.29.03
Number of Customers: 100
MigrationEvent{partitionId=154, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
MigrationEvent{partitionId=154, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
Current time: 2012.03.06.11.29.08
Own Entry: 0, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.13
Own Entry: 0, Backup Entry: 0
Number of Customers: 100
MigrationEvent{partitionId=38, oldOwner=Member [127.0.0.1:8550],
newOwner=Member [127.0.0.1:8560] this}
MigrationEvent{partitionId=38, oldOwner=Member [127.0.0.1:8550],
newOwner=Member [127.0.0.1:8560] this}
Current time: 2012.03.06.11.29.18
Own Entry: 0, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.23
Own Entry: 0, Backup Entry: 0
Number of Customers: 100
MigrationEvent{partitionId=263, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
MigrationEvent{partitionId=263, oldOwner=Member [127.0.0.1:8540],
newOwner=Member [127.0.0.1:8560] this}
Current time: 2012.03.06.11.29.28
Own Entry: 1, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.33
Own Entry: 1, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.38
Own Entry: 1, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.43
Own Entry: 1, Backup Entry: 0
Number of Customers: 100
Current time: 2012.03.06.11.29.48
Own Entry: 1, Backup Entry: 1
Number of Customers: 100
Current time: 2012.03.06.11.29.53
Own Entry: 1, Backup Entry: 1
Number of Customers: 100
Current time: 2012.03.06.11.29.58
Own Entry: 1, Backup Entry: 1
Number of Customers: 100
Own Entry: 1, Backup Entry: 1
Current time: 2012.03.06.11.30.03
Number of Customers: 100
Own Entry: 1, Backup Entry: 1
Current time: 2012.03.06.11.30.08
Number of Customers: 100
Own Entry: 1, Backup Entry: 1
Current time: 2012.03.06.11.30.13
Number of Customers: 100
Own Entry: 1, Backup Entry: 2
Current time: 2012.03.06.11.30.18
Number of Customers: 100
Own Entry: 1, Backup Entry: 2
Current time: 2012.03.06.11.30.23
Number of Customers: 100
Current time: 2012.03.06.11.30.28
Own Entry: 1, Backup Entry: 2
Number of Customers: 100
Own Entry: 1, Backup Entry: 2
Current time: 2012.03.06.11.30.33
Number of Customers: 100
Current time: 2012.03.06.11.30.38
Own Entry: 1, Backup Entry: 2
Number of Customers: 100
Current time: 2012.03.06.11.30.43
Own Entry: 1, Backup Entry: 2
Number of Customers: 100

It shows that it takes almost 2 minutes to get only two backup for 100
elements.
Is there any listener to listen when backup is made?

Thanks,
Md Kamaruzzaman



On Mar 5, 10:53 pm, Mehmet Dogan <meh...@hazelcast.com> wrote:
> You should either wait until all 2nd backup operations are completed before
> terminating A and B or shutdown nodes A and B gracefully.

Mehmet Dogan

unread,
Mar 6, 2012, 6:50:36 AM3/6/12
to haze...@googlegroups.com
First backups are taken immediately, as fast as possible. Others are processed slowly, one partition-by-one by an interval (default 10 seconds). This is to avoid heavy migration and copy load on cluster when a member joins or leaves. First backups should be available immediately to ensure data safety.

You can change migration/backup interval using property: 'hazelcast.partition.migration.interval'. If it is set to 0, then all backups will be immediate.

@mmdogan

Md Kamaruzzaman

unread,
Mar 6, 2012, 7:57:52 AM3/6/12
to Hazelcast
Thanks for the explanation. I have set the following in the xml file:

<property name="hazelcast.partition.migration.interval">0</property>

Now, even if 2 nodes are shut down simultaneously, no data is lost.

Stock

unread,
Feb 9, 2013, 6:30:58 PM2/9/13
to haze...@googlegroups.com, tush...@gmail.com
Hi all,

I'm here in delay but I wnat to outline that I get this message in other situation as well.

It happens when I' have only a lite member active and no other nodes.

I've checked inside the PartitionManager code and here is "if statement" which generates the message

          if (event.getReplicaIndex() == 0 && event.getNewAddress() == null
                  && node.isActive() && node.joined()) {
                        final String warning = "Owner of partition is being removed! " +
                                "Possible data loss for partition[" + event.getPartitionId() + "]. "
                                + event;
                        logger.log(Level.WARNING, warning);
                        systemLogService.logPartition(warning);
                    }
I suggest to add another check, if the node is not lite. If it is, this message is a false warning because the cluster is actually down because no data are available.
Make sense?

I'm using HC 2.4.

Thks a lot.

Mehmet Dogan

unread,
Feb 11, 2013, 3:57:44 AM2/11/13
to haze...@googlegroups.com
Sure, makes sense.

@mmdogan



--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages