How To Bring Your 2 Node S2D Cluster Back Up When Witnes Share Is Gone

0 views

Skip to first unread message

Message has been deleted

Kirby Apodaca

unread,

Jul 10, 2024, 5:15:03 AM7/10/24

to siobridmatchkyrk

If I shutdown node1, then node2 will be active. If I then shut down node2, then cluster will be stopped (obviously), however, if I then start only node1, the cluster will never recover. Not only will it not recover, without node2, but I don't see an easy way to make the cluster come into service with the cluster manager. The only way I can recover the cluster, in this scenario, would be to start node2, however, that does not seem (to me) to be real high-availability. IMO I should be able to set a policy or have a reasonably easy way to bring the cluster back on-line (perhaps after a waiting period), even if node2 never recovers.

How to bring your 2 node S2D cluster back up when witnes share is gone

Download File https://urluss.com/2yX7wV

However, the witness was available at that time, which makes me suspect that this is a permission issue, that is, the witness share is available to the cluster but not the cluster service accounts on each node. Is that possible?

Two node clusters will always have a compromise in terms of HA. The witness file share establishes quorum, but it cannot cover all scenarios. A 3-node (or other odd node) cluster would provide better fault tolerance.

If the quorum witness share is accessible to the online node, it should definitely be able to bring the cluster online. This is standard WSFC behavior. If your cluster is not starting and the witness share is online, something else must be preventing it from starting. Look for any errors.

As a general rule when you configure a quorum, the voting elements in the cluster should be an odd number. Therefore, if the cluster contains an even number of voting nodes, you should configure a disk witness or a file share witness. The cluster will be able to sustain one additional node down. In addition, adding a witness vote enables the cluster to continue running if half the cluster nodes simultaneously go down or are disconnected.

A disk witness is usually recommended if all nodes can see the disk. A file share witness is recommended when you need to consider multisite disaster recovery with replicated storage. Configuring a disk witness with replicated storage is possible only if the storage vendor supports read-write access from all sites to the replicated storage. A Disk Witness isn't supported with Storage Spaces Direct.

If you configure a file share witness or a cloud witness then shutdown all nodes for a maintenance or some reason, you have to make sure that start your cluster service from a last-man standing node since the latest cluster database is not stored into those witnesses. See this also.

You might want to remove votes from nodes in certain disaster recovery configurations. For example, in a multisite cluster, you could remove votes from the nodes in a backup site so that those nodes do not affect quorum calculations. This configuration is recommended only for manual failover across sites. For more information, see Quorum considerations for disaster recovery configurations later in this topic.

The cluster software automatically configures the quorum for a new cluster, based on the number of nodes configured and the availability of shared storage. This is usually the most appropriate quorum configuration for that cluster. However, it is a good idea to review the quorum configuration after the cluster is created, before placing the cluster into production. To view the detailed cluster quorum configuration, you can use the Validate a Configuration Wizard, or the Test-Cluster Windows PowerShell cmdlet, to run the Validate Quorum Configuration test. In Failover Cluster Manager, the basic quorum configuration is displayed in the summary information for the selected cluster, or you can review the information about quorum resources that returns when you run the Get-ClusterQuorum Windows PowerShell cmdlet.

On the Select Quorum Witness page, select an option to configure a disk witness or a file share witness. The wizard indicates the witness selection options that are recommended for your cluster.

You can also select Do not configure a quorum witness and then complete the wizard. If you have an even number of voting nodes in your cluster, this may not be a recommended configuration.

You can also select No Nodes. This is generally not recommended, because it does not allow nodes to participate in quorum voting, and it requires configuring a disk witness. This disk witness becomes the single point of failure for the cluster.

On the Configure Quorum Management page, you can enable or disable the Allow cluster to dynamically manage the assignment of node votes option. Selecting this option generally increases the availability of the cluster. By default the option is enabled, and it is strongly recommended to not disable this option. This option allows the cluster to continue running in failure scenarios that are not possible when this option is disabled.

On the Select Quorum Witness page, select an option to configure a disk witness, file share witness or a cloud witness. The wizard indicates the witness selection options that are recommended for your cluster.

You can also select Do not configure a quorum witness, and then complete the wizard. If you have an even number of voting nodes in your cluster, this may not be a recommended configuration.

The following example changes the quorum configuration on the local cluster to a node majority with witness configuration. The file share resource named \\CONTOSO-FS\fsw is configured as a file share witness.

After you determine that you cannot recover your cluster by bringing the nodes or quorum witness to a healthy state, forcing your cluster to start becomes necessary. Forcing the cluster to start overrides your cluster quorum configuration settings and starts the cluster in ForceQuorum mode.

Forcing a cluster to start when it does not have quorum may be especially useful in a multisite cluster. Consider a disaster recovery scenario with a cluster that contains separately located primary and backup sites, SiteA and SiteB. If there is a genuine disaster at SiteA, it could take a significant amount of time for the site to come back online. You would likely want to force SiteB to come online, even though it does not have quorum.

When a cluster is started in ForceQuorum mode, and after it regains sufficient quorum votes, the cluster automatically leaves the forced state, and it behaves normally. Hence, it is not necessary to start the cluster again normally. If the cluster loses a node and it loses quorum, it goes offline again because it is no longer in the forced state. To bring it back online when it does not have quorum requires forcing the cluster to start without quorum.

After you have force started the cluster on a node, it is necessary to start any remaining nodes in your cluster with a setting to prevent quorum. A node started with a setting that prevents quorum indicates to the Cluster service to join an existing running cluster instead of forming a new cluster instance. This prevents the remaining nodes from forming a split cluster that contains two competing instances.

This becomes necessary when you need to recover your cluster in some multisite disaster recovery scenarios after you have force started the cluster on your backup site, SiteB. To join the force started cluster in SiteB, the nodes in your primary site, SiteA, need to be started with the quorum prevented.

In this configuration, the cluster consists of a primary site, SiteA, and a backup (recovery) site, SiteB. Clustered roles are hosted on SiteA. Because of the cluster quorum configuration, if a failure occurs at all nodes in SiteA, the cluster stops functioning. In this scenario the administrator must manually fail over the cluster services to SiteB and perform additional steps to recover the cluster.

So a client had patched there 2 node S2D cluster this Sunday. When the 2nd node came up it would not join the cluster. So the client started troubleshooting, without realizing that the Quorum file share Witness was not working. Someone had managed to delete the folder where the File Share Witness was on. At first, the first node that was patched and booted was working. But at some point both nodes where booted over and over again.

Thus causing the cluster not to come online again. Several fixes was tried, by trying to force the cluster online. By setting the quorum node vote to only 1 node and so on. Also causing the 2nd node to be evicted from the cluster at one point. The logs kept screaming that there was not a majority vote to bring the cluster online.

First published on MSDN on Apr 13, 2018
One of the quorum models for Failover Clustering is the ability to use a file share as a witness resource. As a recap, the File Share Witness is designated a vote in the Cluster when needed and can act as a tie breaker in case there is ever a split between nodes (mainly seen in multi-site scenarios).

However, over the years, we have seen where this share is put on a DFS Share. This is an awfully bad idea and one not supported by Microsoft. Please do not misunderstand that this is a stance against DFS. DFS is a great feature with numerous deployments out there. I am specifically talking about putting a cluster File Share Witness on a DFS share.

Let me give you an example of what can happen on a Windows Server 2016 Cluster. Let's take the example of a 4-node multisite cluster with two nodes at each site running SQL FCI. Each side has shared drives utilizing some sort of storage replication ( Storage Replica for those Ned fans ) . The cluster connects to a file share witness that is a part of DFS share. So, it would look something like this.

All is fine, dandy and working smoothly. But this is what can happen if there is some sort of break in communications between the two sites.

What has happened is there is a loss of connectivity between the two sites. Site A already has the file share witness and places a lock on it so no one else can come along and take it. Because it is running SQL already, it stays status quo. Over on Site B, is where the problem occurs. Since it cannot communicate to Site A, it has no idea what is going on. Site B nodes do what it is supposed to which is to arbitrate to get the Cluster Group and the witness resource. It goes to connect and DFS Referral sends it to one of the other machines and connects. Site B nodes see it has the witness, so it starts bringing everything online, which would include SQL and its databases. For those not so familiar with Failover Clustering and all its jargons, this is known as a split brain.

So as far as each sides view of membership, they have quorum and SQL clients are connecting and writing/updating the databases. When connectivity is restored between the sites and we get back to our normal cluster view again, we think everything is all roses again.

However, remember, each side had the SQL databases being written to. Once the storage replication begins again, a very possible outcome is that everything that was written on one of the sides is now gone.

So as pointed out earlier:

This is an awfully bad idea.

Microsoft does not support running a File Share Witness on DFS shares.

For Windows Server 2019, additional safeguards have been added to help protect from misconfigurations. We have added logic to check to check if the share is going to DFS.

In Failover Cluster Manager, if you go through the quorum configuration wizard and try to use a DFS share, it will fail on the Summary Page with this dialog:

If you attempt to set it through PowerShell, it will fail with this error: