Weird cluster behavior

103 views
Skip to first unread message

Mike Leone

unread,
Jul 23, 2025, 3:04:21 PM7/23/25
to NTSysAdmin
So I am making a cluster on my Nutanix environment (so all VMs). I've done this a dozen and more times, but always with Win 2016/2019/2022. Now I'm trying to do it with Win 2025, and it's responding differently.

I create the shared storage using iSCSI. I present to both hosts. Both see it. I initialize on 1 host, format them. All fine.

I go to create the cluster. I run the validation tests, all tests pass. I have it create the cluster, all good.

2 hosts - host #27 and host #28. 5 disks, 1 is quorum.

So now, there's only 1 role (User Manager - apparently that's something new in Win 2025?). So that role, and all disks are on host #28. So now I go to reboot host #28, figuring things should Just Move, as in the past (yes, I know it's always best to move everything gracefully before rebooting a node, but I gotta test ...

So now I'm watching in Failover Manager on host #27. I reboot host #28.
1 disk moved over to host #27, the other 4 say "Online pending". And they don't come back online, until host #28 does. At which point, all disks and roles re on host #28 again, and I have cluster events of 1795.

Cluster physical disk resource encountered an error while attempting to terminate.

Phsyical Disk Resource Name: Cluster Disk 5
Device Number: 3
Device Guid: {0985144c-ad45-ae83-b1bd-dddc71656c98}
Error Code: 1168
Reason: ReleaseDiskPRFailure

Cluster disk 1 thru 5 (#1 is quorum), all show that error.

The weird part? If I go to Disks, right click and MOve Available storage to host #27, they all move perfectly. No errors, no online pending, nothing. Textbook perfect.

If I reboot host #27? Disk 2,3,5 move immediately to #28. Other 2 are Online Pending.  When #27 comes back online, disks 2,3,4,5 go back to #27, and #1 stays on #28 (mind you, this was one of the ones that said Online Pending, and didn't move.

Role went back to #27, too.

What gives? All the validation tests passed. I even ran the validate  again, after the tests above, and it passed. Again. (although it says the storage tests are not applicable).

Any idea what gives? I have another cluster (a SQL cluster, like this is eventually going to be) that doesn't exhibit this behavior. And I made them the same way - same iSCSI, same procedure to create the cluster ...





--

Mike. Leone, <mailto:tur...@mike-leone.com>

PGP Fingerprint: 0AA8 DC47 CB63 AE3F C739 6BF9 9AB4 1EF6 5AA5 BCDF
Photo Gallery: <http://www.flickr.com/photos/mikeleonephotos>

Philip Elder

unread,
Jul 23, 2025, 4:02:49 PM7/23/25
to ntsys...@googlegroups.com

Check your storage subnet(s) and VLAN Tag paths to make sure everything lines up.

 

Node 1 is off from Node 2 if it works only one 1. Something in the setup on, or relative to, Node 1 is whacked.

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

--
You received this message because you are subscribed to the Google Groups "ntsysadmin" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ntsysadmin+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ntsysadmin/CAHBr%2B%2Bj2OM3HBtQ12Yx%3DSJP8SPOGCxNbLR2c3bR9SbvSaX0OwQ%40mail.gmail.com.

Lieckfeldt.Sven

unread,
Jul 24, 2025, 9:46:28 AM7/24/25
to ntsys...@googlegroups.com

Hi Mike,

 

maybe you run into a bug which seems to in Failover Cluster Service in Windows Server 2025. According to MS it’s getting a fix in the next couple of months, however it’s not officially documented, with the exception of the comments at the Exchange Team Blog: Released: May 2025 Exchange Server Hotfix Updates | Microsoft Community Hub

Here is my example: Exchange DAG Cluster is using the Failover Cluster Service from the OS. When the cluster group is not moved to the other node before restarting, all DBs might blow up, if the rebooted server holds the cluster group. When the cluster group is moved before reboot, everything is fine.

This leads basically to a DOA feature, because an outage doesn’t ask kindly to move this role before taking the node away 😃

 

Cheers,

Sven 

 

Von: ntsys...@googlegroups.com <ntsys...@googlegroups.com> Im Auftrag von Mike Leone
Gesendet: Mittwoch, 23. Juli 2025 21:04
An: NTSysAdmin <ntsys...@googlegroups.com>
Betreff: [ntsysadmin] Weird cluster behavior

 

Achtung! Externe E-Mail. Bitte mit Links und Anhängen aufpassen!

--

Philip Elder

unread,
Jul 24, 2025, 3:47:59 PM7/24/25
to ntsys...@googlegroups.com

I’ve reached out to the team. I’ll let all y’all know once I hear back and share what I can.

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

Philip Elder

unread,
Jul 24, 2025, 4:09:55 PM7/24/25
to ntsys...@googlegroups.com

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

From: 'Lieckfeldt.Sven' via ntsysadmin <ntsys...@googlegroups.com>
Sent: Thursday, July 24, 2025 07:46
To: ntsys...@googlegroups.com
Subject: AW: [ntsysadmin] Weird cluster behavior

 

Hi Mike,

Mike Leone

unread,
Jul 24, 2025, 5:17:22 PM7/24/25
to NTSysAdmin
No, no BitLocker on these drives.

     

Philip Elder

unread,
Jul 24, 2025, 5:34:44 PM7/24/25
to ntsys...@googlegroups.com

I didn’t think so.

 

I’m waiting on feedback. Will follow-up one way or the other.

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

Philip Elder

unread,
Jul 24, 2025, 7:08:19 PM7/24/25
to ntsys...@googlegroups.com

Mike,

 

The reply back form the team:

[QUOTE]

If there isn’t much running on this cluster, I would suggest trying to get the storage validation running. I think that can be accomplished by putting the disks into maintenance mode, if my memory serves me correctly – then storage validation should run on the disks, and PR issues should be surfaced.

 

I suspect that the issue could be LUN masking – maybe only one of the nodes can see the LUNs?

[/QUOTE]

 

Please post the results if you can.

 

Thanks,

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

From: ntsys...@googlegroups.com <ntsys...@googlegroups.com> On Behalf Of Mike Leone


Sent: Thursday, July 24, 2025 15:17
To: NTSysAdmin <ntsys...@googlegroups.com>
Subject: Re: [ntsysadmin] Weird cluster behavior

 

No, no BitLocker on these drives.

 

     

Mike Leone

unread,
Jul 25, 2025, 10:56:41 AM7/25/25
to ntsys...@googlegroups.com
On Thu, Jul 24, 2025 at 7:08 PM Philip Elder <Phili...@mpecsinc.ca> wrote:

Mike,

 

The reply back form the team:

[QUOTE]

If there isn’t much running on this cluster, I would suggest trying to get the storage validation running. I think that can be accomplished by putting the disks into maintenance mode, if my memory serves me correctly – then storage validation should run on the disks, and PR issues should be surfaced.

 

I suspect that the issue could be LUN masking – maybe only one of the nodes can see the LUNs?

[/QUOTE]


We use disk quorum, and you can't turn on maintenance mode for that disk. But I did put the others into maintenance mode, and ran the validation again.,

* The disks are already clustered and currently Online in the cluster. When testing a working cluster, ensure that the disks that you want to test are Offline in the cluster.

Maintenance mode leaves the disk Online, and none of the storage tests ran, because the disks were Online.
So I turned off maintenance mode, took the disks offline. That way, all 5 disks went offline.
I then ran the validation again. 

It seems to think my quorum disks (which is only 1G in size) has no free space ...

image.png

Bringing the disks online show the actual use and capacity

image.png


All other disk tests passed ...


image.png

I ran these tests from host #27. So I went to host #28 (owner of the role and all disks, as shown above), and just rebooted it. TO see if it would gracefully failover to host #27. (I have another cluster using this same configuration, cluster storage via iSCSI from the same Nutanix cluster as above, and it works perfectly).

This is while host #28 is rebooting:

image.png

And after host #28 comes back up, everything went back to it.

image.png

So same problem as before. I see cluster errors for each disk ...

image.png



I had disconnected all drives in iSCSI. I even deleted the Discovery Portal, and re-entered it. I had even deleted all the disks in the Nutanix Volume Group, and created new ones, which were then presented.

I am at a loss ..... The way Nutanix works, storage is (or can be) presented as iSCSI.

image.png

That is the same discovery target I am using on the cluster that works ...
Access to this Volume Group is presented to the 2 IPs of the nodes:

image.png

Z:\>nslookup 10.64.126.224
Server:  DC1WRK014.wrk.ads.pha.phila.gov
Address:  10.64.7.95

Name:    DC1DBS027.wrk.ads.pha.phila.gov
Address:  10.64.126.224


Z:\>nslookup 10.64.126.225
Server:  DC1WRK014.wrk.ads.pha.phila.gov
Address:  10.64.7.95

Name:    DC1DBS028.wrk.ads.pha.phila.gov
Address:  10.64.126.225

(kinda obviously, otherwise I wouldn't be seeing the disks in iSCSI on both hosts. All disks are CONNECTED in iSCSI, but only brought online in Disk Manager on host #28 - yes, I tried having them online in disk manager on both nodes, same failed results ...)

I am stumped, at this point. Especially since I have an earlier cluster with these same settings that is working just fine ...



Mike Leone

unread,
Jul 25, 2025, 11:30:36 AM7/25/25
to ntsys...@googlegroups.com
SO ... 

if Quorum disk is on Host #27, and all other disks are on host #28, and I reboot host #27 ... all is fine. Disks move to host #28, role (User Manager) goes over to host #28.
If Quorum disk is on host #28, and all other disks are on host #27, and I reboot host #28 ... the disks stay on host #27. The QUORUM does NOT go to host #27, it comes back to host #28 (as does the role) ...

Role User Manager has no preferred owner. So I checked off both nodes as preferred owners (hey, I'm clutching at straws here ...). 

Failback is set to "Prevent Failback", so I left it at that.
I checked all disks, "Possible Owners" are both nodes".

So I reboot host #28 ... The quorum disk and role went to host #27 ... and stayed there, as it should! 

So now, with role, quorum, and all disks on host #27, I decide to try rebooting host #27 ...

Role, quorum, and all disks go over to host #28, exactly as they should.

In other words ... all working?!?!

I tried again. 
Role, quorum, disks all on host #28. Reboot host #28.  All moved back to host #27, exactly as it should ...

Role, quorum, disks all on host #27. Reboot host #27.  problem is back, quorum went to host #28, role stayed on host #27, as did the disks. They TRIED to come online on host #28, and 2 did. But as soon as host #28 came back, all the disks went back to host #27 ... 

Don't ask me, I just work here ....

At this point, I may just destroy everything. Cluster, delete the nodes, delete the Volume Group, start all over. (which I think I've already tried ....)

It's practically Friday lunchtime, maybe I can just let it sit until Monday ... I have other tasks to occupy my afternoon ...










Philip Elder

unread,
Jul 25, 2025, 11:51:25 AM7/25/25
to ntsys...@googlegroups.com

Mike,

 

Does the Cluster Name Object IP require access to the Nutanix iSCSI Target?

 

Is that how the other cluster is set up? Three IPs accessing one for each node and one for the CNO?

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

MPECS Inc.

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Teams: Phili...@MPECSInc.Cloud

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

From: ntsys...@googlegroups.com <ntsys...@googlegroups.com> On Behalf Of Mike Leone
Sent: Friday, July 25, 2025 08:56
To: ntsys...@googlegroups.com
Subject: Re: [ntsysadmin] Weird cluster behavior

 

On Thu, Jul 24, 2025 at 7:08PM Philip Elder <Phili...@mpecsinc.ca> wrote:

Mike,

 

The reply back form the team:

[QUOTE]

If there isn’t much running on this cluster, I would suggest trying to get the storage validation running. I think that can be accomplished by putting the disks into maintenance mode, if my memory serves me correctly – then storage validation should run on the disks, and PR issues should be surfaced.

 

I suspect that the issue could be LUN masking – maybe only one of the nodes can see the LUNs?

[/QUOTE]

 

We use disk quorum, and you can't turn on maintenance mode for that disk. But I did put the others into maintenance mode, and ran the validation again.,

 

* The disks are already clustered and currently Online in the cluster. When testing a working cluster, ensure that the disks that you want to test are Offline in the cluster.

 

Maintenance mode leaves the disk Online, and none of the storage tests ran, because the disks were Online.

So I turned off maintenance mode, took the disks offline. That way, all 5 disks went offline.

I then ran the validation again. 

 

It seems to think my quorum disks (which is only 1G in size) has no free space ...

 

 

Bringing the disks online show the actual use and capacity

 

 

 

All other disk tests passed ...

 

 

 

I ran these tests from host #27. So I went to host #28 (owner of the role and all disks, as shown above), and just rebooted it. TO see if it would gracefully failover to host #27. (I have another cluster using this same configuration, cluster storage via iSCSI from the same Nutanix cluster as above, and it works perfectly).

 

This is while host #28 is rebooting:

 

 

And after host #28 comes back up, everything went back to it.

 

 

So same problem as before. I see cluster errors for each disk ...

 

 

 

 

I had disconnected all drives in iSCSI. I even deleted the Discovery Portal, and re-entered it. I had even deleted all the disks in the Nutanix Volume Group, and created new ones, which were then presented.

 

I am at a loss ..... The way Nutanix works, storage is (or can be) presented as iSCSI.

 

 

That is the same discovery target I am using on the cluster that works ...

Access to this Volume Group is presented to the 2 IPs of the nodes:

 

 

Z:\>nslookup 10.64.126.224
Server:  DC1WRK014.wrk.ads.pha.phila.gov
Address:  10.64.7.95

Name:    DC1DBS027.wrk.ads.pha.phila.gov
Address:  10.64.126.224


Z:\>nslookup 10.64.126.225
Server:  DC1WRK014.wrk.ads.pha.phila.gov
Address:  10.64.7.95

Name:    DC1DBS028.wrk.ads.pha.phila.gov
Address:  10.64.126.225

(kinda obviously, otherwise I wouldn't be seeing the disks in iSCSI on both hosts. All disks are CONNECTED in iSCSI, but only brought online in Disk Manager on host #28 - yes, I tried having them online in disk manager on both nodes, same failed results ...)

 

I am stumped, at this point. Especially since I have an earlier cluster with these same settings that is working just fine ...

 

 

 

--

You received this message because you are subscribed to the Google Groups "ntsysadmin" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ntsysadmin+...@googlegroups.com.

Mike Leone

unread,
Jul 25, 2025, 2:44:32 PM7/25/25
to ntsys...@googlegroups.com
On Fri, Jul 25, 2025 at 11:51 AM Philip Elder <Phili...@mpecsinc.ca> wrote:

Mike,

 

Does the Cluster Name Object IP require access to the Nutanix iSCSI Target?


It shouldn't. I mean, I can try adding that IP address of the cluster object to the list ...

Role and Quorum disk is on host #28, and all other disks are on host #27, and I reboot host #28 .... Role, quorum  movee to host #27,and disks stayed where they were. After host #28 came back up, quorum moved to #28. Role and disks stayed on #27.


Role and disks on #27, quorum disk on #28 (I don't see a way to tell just the quorum disk to move). Reboot #27 - All good - role, quorum and all disks go to #28.

Role, quorum desks, and all other disks on #28, Reboot #28 ... all good again - role, quorum, all disks go to #27.
Role, quorum desks, and all other disks on #27, Reboot #27 ... Same problem. "Online Pending" for all disks. Quorum goes on #28, role and disks go back to #27, they never make it to #28. Yet they did above ...


Reply all
Reply to author
Forward
0 new messages