Migration hangs after data node restart

264 views
Skip to first unread message

Volkov Sergey

unread,
Dec 19, 2021, 8:00:45 AM12/19/21
to Hazelcast
Hi!
I have cluster with 3 data nodes and around 40 lite members.
When one of datanodes restart cluster sometimes hangs in migration (almost always)

I get 

> [2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701 [dev] [4.2.2] Remaining migration tasks: 206. (repartitionTime=Thu Dec 16 23:02:10 MSK 2021, plannedMigrations=215, completedMigrations=2, remainingMigrations=213, totalCompletedMigrations=591)

in datanode log for many times in a long period (couple of hours).

Lite member operations fails with message
> com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this: [2a02:6b8:c0c:5094:0:1688:bf9d:2]:5701, partitionId: 32, operation: com.hazelcast.map.impl.operation.PutOperation, service: hz:impl:mapService

And after some time migration failed:
> [2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701 [dev] [4.2.2] Migration failed: MigrationInfo{uuid=ad52e830-9070-48b6-bc25-02e520f26b36, partitionId=115, source=[2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701 - dfda5d36-1865-40ad-84fb-52f7c04c872a, sourceCurrentReplicaIndex=0, sourceNewReplicaIndex=2, destination=[2a02:6b8:c0b:6085:0:1688:b160:1]:5701 - c366133e-5e34-4f1e-a4a2-1dfe55b2f309, destinationCurrentReplicaIndex=-1, destinationNewReplicaIndex=0, master=[2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701, initialPartitionVersion=101, partitionVersionIncrement=2, status=ACTIVE}

I also getting
> [2a02:6b8:c0b:6085:0:1688:b160:1]:5701 [dev] [4.2.2] Member [2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701 - dfda5d36-1865-40ad-84fb-52f7c04c872a is suspected to be dead for reason: Suspecting Member [2a02:6b8:c0c:8fa6:0:1688:89ac:0]:5701 - dfda5d36-1865-40ad-84fb-52f7c04c872a because it has not sent any heartbeats since 2021-12-16 20:52:05.711. Now: 2021-12-16 20:53:07.613, heartbeat timeout: 60000 ms, suspicion level: 1.00
from one of datanodes, but not the another.

Then it reconnects and heartbeats resumes, but migration fails again after some time.


I think there is some problem with connection between two nodes while migrating, but I don't know how to debug it. There is no high cpu or network issues on nodes.

Any ideas what should I check next?
hz.xml

Volkov Sergey

unread,
Dec 19, 2021, 8:51:36 AM12/19/21
to Hazelcast
I also added trace logs for com.hazelcast.internal.partition

Looks like there is 4 successful migration parts like

> 2021-12-18 18:02:59 TRACE com.hazelcast.internal.partition.operation.MigrationRequestOperation [2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 [dev] [4.2.2] Invoking MigrationOperation for namespaces [DistributedObjectNamespace{service='hz:impl:mapService', objectName='serpSetLock'}] and MigrationInfo{uuid=906bb551-ce85-48f7-816e-c81e6edde40f, partitionId=4, source=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 - 47d87e75-eb4f-4b47-9a9b-7106ab1f4b94, sourceCurrentReplicaIndex=0, sourceNewReplicaIndex=3, destination=[2a02:6b8:c0c:4ca0:0:1688:f8f6:0]:5701 - fac9cfd8-0c35-496f-a2e6-fb95ebc85c01, destinationCurrentReplicaIndex=-1, destinationNewReplicaIndex=0, master=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701, initialPartitionVersion=318, partitionVersionIncrement=2, status=ACTIVE}, lastFragment: false

> 2021-12-18 18:02:59 TRACE com.hazelcast.internal.partition.operation.MigrationOperation [2a02:6b8:c0c:4ca0:0:1688:f8f6:0]:5701 [dev] [4.2.2] ReplicaVersions are set after migration. MigrationInfo{uuid=906bb551-ce85-48f7-816e-c81e6edde40f, partitionId=4, source=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 - 47d87e75-eb4f-4b47-9a9b-7106ab1f4b94, sourceCurrentReplicaIndex=0, sourceNewReplicaIndex=3, destination=[2a02:6b8:c0c:4ca0:0:1688:f8f6:0]:5701 - fac9cfd8-0c35-496f-a2e6-fb95ebc85c01, destinationCurrentReplicaIndex=-1, destinationNewReplicaIndex=0, master=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701, initialPartitionVersion=318, partitionVersionIncrement=2, status=ACTIVE}, namespace=DistributedObjectNamespace{service='hz:impl:mapService', objectName='serpSetLock'}, replicaVersions=[10896, 0, 0, 0, 0, 0]

And one that fails

>2021-12-18 18:02:59 TRACE com.hazelcast.internal.partition.operation.MigrationRequestOperation [2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 [dev] [4.2.2] Invoking MigrationOperation for namespaces [DistributedObjectNamespace{service='hz:impl:mapService', objectName='observationResultCache'}] and MigrationInfo{uuid=906bb551-ce85-48f7-816e-c81e6edde40f, partitionId=4, source=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 - 47d87e75-eb4f-4b47-9a9b-7106ab1f4b94, sourceCurrentReplicaIndex=0, sourceNewReplicaIndex=3, destination=[2a02:6b8:c0c:4ca0:0:1688:f8f6:0]:5701 - fac9cfd8-0c35-496f-a2e6-fb95ebc85c01, destinationCurrentReplicaIndex=-1, destinationNewReplicaIndex=0, master=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701, initialPartitionVersion=318, partitionVersionIncrement=2, status=ACTIVE}, lastFragment: false

>2021-12-18 18:27:59 WARN com.hazelcast.internal.partition.operation.MigrationRequestOperation [2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 [dev] [4.2.2] Failure while executing MigrationInfo{uuid=906bb551-ce85-48f7-816e-c81e6edde40f, partitionId=4, source=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701 - 47d87e75-eb4f-4b47-9a9b-7106ab1f4b94, sourceCurrentReplicaIndex=0, sourceNewReplicaIndex=3, destination=[2a02:6b8:c0c:4ca0:0:1688:f8f6:0]:5701 - fac9cfd8-0c35-496f-a2e6-fb95ebc85c01, destinationCurrentReplicaIndex=-1, destinationNewReplicaIndex=0, master=[2a02:6b8:c0b:3e1c:0:1688:f829:2]:5701, initialPartitionVersion=318, partitionVersionIncrement=2, status=ACTIVE} com.hazelcast.core.OperationTimeoutException: MigrationOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2021-12-18 18:27:59.477. Start time: 2021-12-18 18:02:59.385. Total elapsed time: 1500092 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2021-12-18 18:27:48.967

Is there any way to find where last migration hangs? I can collect stack traces, but I have no idea what to look for.

воскресенье, 19 декабря 2021 г. в 16:00:45 UTC+3, Volkov Sergey:

Volkov Sergey

unread,
Dec 19, 2021, 10:21:11 AM12/19/21
to Hazelcast
I got traces from migration source and destination when migration is hanging

But there is no traces of MigrationOperation or MigrationRequestOperation

воскресенье, 19 декабря 2021 г. в 16:51:36 UTC+3, Volkov Sergey:

Volkov Sergey

unread,
Dec 23, 2021, 3:31:50 AM12/23/21
to Hazelcast
Looks like setting hazelcast.initial.min.cluster.size=2 solves problem. But I have no idea how it prevents cluster from hanging in endless migration.

воскресенье, 19 декабря 2021 г. в 18:21:11 UTC+3, Volkov Sergey:

Volkov Sergey

unread,
Dec 28, 2021, 5:21:08 AM12/28/21
to Hazelcast
Actually it don't. Sometimes cluster still hangs in migration.
Any ideas how to debug this problem?

четверг, 23 декабря 2021 г. в 11:31:50 UTC+3, Volkov Sergey:

Neil Stevenson

unread,
Jan 10, 2022, 6:08:31 AM1/10/22
to Hazelcast
Hi
 Is this problem still occurring ?

 If so, are you able to try with the latest Hazelcast (5.0.2) in case it's a fixed issue ?
 If it still occurs, are there any logs you can share please ?  It would be useful also to know if having no or less lite-members connected varies the behaviour


Neil

Volkov Sergey

unread,
Jan 11, 2022, 7:51:06 AM1/11/22
to Hazelcast
>  Is this problem still occurring ?
Yes. Now I can reproduce problem in dev environment with 3 data nodes and 2 lite members.
I think problem reproduced when I have around 1k entries in map with values around 1Mb each.

> If so, are you able to try with the latest Hazelcast (5.0.2) in case it's a fixed issue ?
I reproduced it with 5.0.2

> If it still occurs, are there any logs you can share please ?

This is logs from 4.2.2:

https://gist.github.com/volkov/01d643bf8e42cf941367e5dd9f721773 - full com.hazelcast logs with trace enabled for com.hazelcast.internal.partition and without PartitionBackupReplicaAntiEntropy*

https://gist.github.com/volkov/086cae323efe3d1795f31057aa931d8c#file-hz_single_migration-log-L9 -  filtered by single migration uuid=f6efba55-c3a4-4cf2-9587-a3ea37bb74d1

https://gist.github.com/volkov/608cc6c6fe7fbe15d6e58c306664e961#file-hazelcast_warns-log-L6 - this is interesting warn - right after migration started migration destination suspected migration source to be dead because of missing heartbeat for one minute.


понедельник, 10 января 2022 г. в 14:08:31 UTC+3, ne...@hazelcast.com:

Volkov Sergey

unread,
Jan 14, 2022, 5:31:19 AM1/14/22
to Hazelcast
I think problem was with hazelcast.io.thread.count
It was set to 20, after restoring default problem disappeared

Probably 20 threads leads to havy load on source for 1 minute.

вторник, 11 января 2022 г. в 15:51:06 UTC+3, Volkov Sergey:

Jiri Holusa

unread,
Jan 14, 2022, 5:56:36 AM1/14/22
to Hazelcast
Hi,

quick question - why did you change the default IO thread count? Did it give better performance or something?

Jiri

Volkov Sergey

unread,
Jan 18, 2022, 10:03:15 AM1/18/22
to Hazelcast
Quick answer - I don't know, we have different IO thread count for years.

пятница, 14 января 2022 г. в 13:56:36 UTC+3, ji...@hazelcast.com:

Jiri Holusa

unread,
Jan 19, 2022, 4:02:35 AM1/19/22
to Hazelcast
Hehe, got it :) Let's see if you'll observe any bad behavior with the thread count back to default.

Jiri
Reply all
Reply to author
Forward
0 new messages