DPM 2010 RC Backups Seem to Crash Cluster Service

Rich McCue

unread,

Mar 22, 2010, 5:35:01 PM3/22/10

to

Hi,

We are protecting a number of HyperV (2008 R2 Enterprise) hosts using DPM
2010 RC, we have two clusters of 4 nodes with each node running 4 virtual
machines. Our storage is on a Netapp 3140 connected via Emulex HBAs, and all
the servers run the Netapp DSM 3.3.1, NetApp host utils 5.2 and Snapdrive
6.2. Everything seems fine under normal conditions and when we create
recovery points manually but when our sheduled DPM backups run at 8PM we find
that the server appears to reboot and all of the guests on the node affected
move to another node, however the reboot doesn't appear to generate a blue
screen.

There appears to be no pattern to this and I've tried running different
guests on different hosts but nothing seems to help, the only common factor
seems to be that there are multiple backups occuring at the same time on the
same nodes that fail.

Here is one of the events logged when the problem occurs:

Time: 20:00:44
EventID: 1135
Cluster node 'SERVER' was removed from the active failover cluster
membership. The Cluster service on this node may have stopped. This could
also be due to the node having lost communication with other active nodes in
the failover cluster. Run the Validate a Configuration wizard to check your
network configuration. If the condition persists, check for hardware or
software errors related to the network adapters on this node. Also check for
failures in any other network components to which the node is connected such
as hubs, switches, or bridges.

Time: 20:03:00
EventID: 5121
Cluster Shared Volume 'Volume1' ('HyperV-CSV') is no longer directly
accessible from this cluster node. I/O access will be redirected to the
storage device over the network through the node that owns the volume. This
may result in degraded performance. If redirected access is turned on for
this volume, please turn it off. If redirected access is turned off, please
troubleshoot this node's connectivity to the storage device and I/O will
resume to a healthy state once connectivity to the storage device is
reestablished.

Time: 20:04:50
EventID: 41
The system has rebooted without cleanly shutting down first. This error
could be caused if the system stopped responding, crashed, or lost power
unexpectedly.

Ive installed the 2008 R2 hotfixes that should resolve the CSV problems on
both the host and guest operating systems but we still get this strange
restart problem.

Can anyone shed any light on this problem?

Thanks,
Richard

Shyama Hembram[MSFT]

unread,

Mar 23, 2010, 1:48:22 AM3/23/10

to

Is the host and cluster communication happening on the same network?
Is the cluster communication happening over a Gbps LAN? Can you increase the
Cluster heartbeat timeout by running the following on one of the node of the
cluster.

cluster.exe /prop ClusSvcHangTimeout=120

--
Thanks
Shyama
[This posting is provided "AS IS" with no warranies, and confers no rights]

"Rich McCue" <Rich...@discussions.microsoft.com> wrote in message
news:0B9B5B2D-5D5E-4F7F...@microsoft.com...

Rich McCue

unread,

Mar 23, 2010, 4:13:02 AM3/23/10

to

Hi,

Cluster comms and managament traffic occur accross the same 1GBPs NIC, the
HyperV guests have their own dedicated 1GBPs NIC, unfortunately we are
limited to two NICs per blade so can't separate this any further.

I've now updated the timeout as per your recommendation, I'll update this
post later once I know if the system is still going down.

Thanks,
Richard

Rich McCue

unread,

Mar 25, 2010, 5:08:01 AM3/25/10

to

Hi again,

I've come in today to check the setup and we have experienced the cluster
crash once again, below is the event chain upto the failure:

20:00 DPM backup of VM's begins

20:04 Navssprv starts the VSS hardware provider
20:04 SnapDrive states that a snapshot has been successfully created
20:04 Navssprv - 'Data ONTAP VSS hardware provider has successfully
completed CommitSnapshots for SnapshotSetId
{da581d58-74d3-45f6-b5df-2e86e5e0ae3c} in 234 milliseconds'
20:05 Navssprv and Snapdrive map a LUN.

Time: 20:05:21
EventID: 1000
Data: Faulting application name: clussvc.exe, version: 6.1.7600.16385, time
stamp: 0x4a5bc614
Faulting module name: KERNELBASE.dll, version: 6.1.7600.16385, time stamp:
0x4a5bdfe0
Exception code: 0x80000003
Fault offset: 0x0000000000032442
Faulting process id: 0x3fbc
Faulting application start time: 0x01cac9317842c77c
Faulting application path: C:\Windows\Cluster\clussvc.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 92cbeed2-3780-11df-88f4-001e0b61d0a2

Time: 20:05:21
EventID: 1001
Data: Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0

Problem signature:
P1: clussvc.exe
P2: 6.1.7600.16385
P3: 4a5bc614
P4: KERNELBASE.dll
P5: 6.1.7600.16385
P6: 4a5bdfe0
P7: 80000003
P8: 0000000000032442
P9:
P10:

Attached files:

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_clussvc.exe_7351f8239c1232a3331892369f231b119a972bd_8b2f5b28

Analysis symbol:
Rechecking for solution: 0
Report Id: 92cbeed2-3780-11df-88f4-001e0b61d0a2
Report Status: 4

The backup seems to succeed but the cluster resources failover to other
nodes, I then get various events from snapdrive and Navssprv saying tha the
LUNS were successfully unmapped and deleted around 1 minute later.

Thanks,
Rich

Rich McCue

unread,

Mar 29, 2010, 4:06:01 AM3/29/10

to

Hi,

Has anyone got any further ideas for this problem?

Thanks,
Richard

Rich McCue

unread,

Apr 7, 2010, 11:41:02 AM4/7/10

to

Hi Perry,

Thanks for the update, I'll try it without the Netapp DSM on a test platform
and let you know, our diskSignature is 3032226888 (0xb4bc1c48).

Please could you let me know if you find anything out regarding this problem.

Thanks,
Rich

"Perry van Erning" wrote:

> Hi,
>
> We have the same problem only we use the NetApp Snapmanager for Hyper-V and
> get the same error 5121 on the hosts. Microsoft and Netapp are asking me to
> install DPM 2010 rc and try to backup with it. So they can see if the problem
> is by NetApp DSM or not. You could try to uninstall the Netapp Ontap (Mpio)
> DSM and use the ms MPIO.
>
> Also you could check the following command in a dos box on the cluster:
>
> cluster.exe res "CSV1" /priv
>
> You get a list of private properties.
> What is the DiskSignature property? Is it 0 (0x0)
>
> Regards,
>
> Perry

Perry van Erning

unread,

Apr 15, 2010, 9:56:01 AM4/15/10

to

Hi Richard,

The errors 5121 in the cluster we must except because it's a normal
behaviour when you backup the CSV.

DPM or Snap manager for Hyper-V will backups the VM's per Host and put the
CSV in a backup state. The communications to the CSV from the other cluster
nodes (hosts) will be redirected. So the other hosts will try to connect true
an other network (card) to the owner of the CSV (the one that's backuped) and
then go to the Cluster Shared Volume.

When this happens you get the 5121 error. From NetApp they say there will be
KB soon on the Microsoft site.

Regards,

Perry

JBritto

unread,

May 13, 2010, 10:23:01 AM5/13/10

to

Hello, take a look at
http://fawzi.wordpress.com/2008/09/15/cluster-disk-0-does-not-support-persistent-reservation/

Chris Meehan

unread,

Jun 7, 2010, 9:39:49 AM6/7/10

to

Rich,

I'm having the same problem as you mentioned here with a different storage
solution and VSS provider. Were you able to resolve your issue?

Thanks in advance.

Chris

n3ur0si

unread,

Jun 30, 2010, 5:04:41 AM6/30/10

to

Hi,

I am having the same problem with EqualLogic PS6000 and Equallogic VSS provider install on Hyper-V. This provider is compatible with DPM2010 and Hyper-V R2.

Someone has a solution ?

Thanks for reply.

Chris Meehan wrote:

Rich,I am having the same problem as you mentioned here with a different
07-Jun-10

Rich,

I am having the same problem as you mentioned here with a different storage

solution and VSS provider. Were you able to resolve your issue?

Thanks in advance.

Chris

"Rich McCue" wrote:

Previous Posts In This Thread:

On Monday, March 22, 2010 5:35 PM
Rich McCue wrote:

DPM 2010 RC Backups Seem to Crash Cluster Service
Hi,

We are protecting a number of HyperV (2008 R2 Enterprise) hosts using DPM
2010 RC, we have two clusters of 4 nodes with each node running 4 virtual
machines. Our storage is on a Netapp 3140 connected via Emulex HBAs, and all
the servers run the Netapp DSM 3.3.1, NetApp host utils 5.2 and Snapdrive
6.2. Everything seems fine under normal conditions and when we create
recovery points manually but when our sheduled DPM backups run at 8PM we find
that the server appears to reboot and all of the guests on the node affected

move to another node, however the reboot does not appear to generate a blue
screen.

There appears to be no pattern to this and I have tried running different

Thanks,
Richard

On Tuesday, March 23, 2010 1:48 AM
Shyama Hembram[MSFT] wrote:

Is the host and cluster communication happening on the same network?
Is the host and cluster communication happening on the same network?
Is the cluster communication happening over a Gbps LAN? Can you increase the
Cluster heartbeat timeout by running the following on one of the node of the
cluster.

cluster.exe /prop ClusSvcHangTimeout=120

--
Thanks
Shyama
[This posting is provided "AS IS" with no warranies, and confers no rights]

On Tuesday, March 23, 2010 4:13 AM
Rich McCue wrote:

Hi,Cluster comms and managament traffic occur accross the same 1GBPs NIC,
Hi,

Cluster comms and managament traffic occur accross the same 1GBPs NIC, the
HyperV guests have their own dedicated 1GBPs NIC, unfortunately we are

limited to two NICs per blade so cannot separate this any further.

I have now updated the timeout as per your recommendation, I will update this

post later once I know if the system is still going down.

Thanks,
Richard

"Shyama Hembram[MSFT]" wrote:

On Thursday, March 25, 2010 5:08 AM
Rich McCue wrote:

Hi again,I have come in today to check the setup and we have experienced the
Hi again,

I have come in today to check the setup and we have experienced the cluster

Attached files:

Thanks,
Rich

On Monday, March 29, 2010 4:06 AM
Rich McCue wrote:

Hi,Has anyone got any further ideas for this problem?Thanks,Richard
Hi,

Has anyone got any further ideas for this problem?

Thanks,
Richard

On Friday, April 02, 2010 4:46 PM
Perry van Erning wrote:

Hi,We have the same problem only we use the NetApp Snapmanager for Hyper-V
Hi,

We have the same problem only we use the NetApp Snapmanager for Hyper-V and
get the same error 5121 on the hosts. Microsoft and Netapp are asking me to
install DPM 2010 rc and try to backup with it. So they can see if the problem
is by NetApp DSM or not. You could try to uninstall the Netapp Ontap (Mpio)
DSM and use the ms MPIO.

Also you could check the following command in a dos box on the cluster:

cluster.exe res "CSV1" /priv

You get a list of private properties.
What is the DiskSignature property? Is it 0 (0x0)

Regards,

Perry

"Rich McCue" wrote:

On Wednesday, April 07, 2010 11:41 AM
Rich McCue wrote:

Hi Perry,Thanks for the update, I will try it without the Netapp DSM on a test
Hi Perry,

Thanks for the update, I will try it without the Netapp DSM on a test platform

and let you know, our diskSignature is 3032226888 (0xb4bc1c48).

Please could you let me know if you find anything out regarding this problem.

Thanks,
Rich

"Perry van Erning" wrote:

On Thursday, April 15, 2010 9:56 AM
Perry van Erning wrote:

Hi Richard,The errors 5121 in the cluster we must except because it is a
Hi Richard,

The errors 5121 in the cluster we must except because it is a normal

behaviour when you backup the CSV.

DPM or Snap manager for Hyper-V will backups the VM's per Host and put the
CSV in a backup state. The communications to the CSV from the other cluster
nodes (hosts) will be redirected. So the other hosts will try to connect true

an other network (card) to the owner of the CSV (the one that is backuped) and

then go to the Cluster Shared Volume.

When this happens you get the 5121 error. From NetApp they say there will be
KB soon on the Microsoft site.

Regards,

Perry

On Thursday, May 13, 2010 10:23 AM
JBritto wrote:

Hello, take a look athttp://fawzi.wordpress.

Hello, take a look at
http://fawzi.wordpress.com/2008/09/15/cluster-disk-0-does-not-support-persistent-reservation/

"Shyama Hembram[MSFT]" wrote:

On Monday, June 07, 2010 9:39 AM
Chris Meehan wrote:

Rich,I am having the same problem as you mentioned here with a different
Rich,

I am having the same problem as you mentioned here with a different storage

solution and VSS provider. Were you able to resolve your issue?

Thanks in advance.

Chris

"Rich McCue" wrote:

Submitted via EggHeadCafe - Software Developer Portal of Choice
Composite UI Pattern And Enterprise Settings
http://www.eggheadcafe.com/tutorials/aspnet/14dd2b7f-9da4-4a45-bc93-ce5fdba5c5ee/composite-ui-pattern-and-enterprise-settings.aspx

Jako

unread,

Oct 20, 2011, 4:20:28 AM10/20/11

to

I don't think that error 5121 is normal and we have except that.

If I don't have room for hardware snapshots then I don't get errors! Just information, Event Id 5140 Cluster Shared Volume 'Volume1' ('Cluster Disk 2') backup was turned on. Once the backup application completes the backup process, the Cluster Shared Volume backup mode will be turned off. If the backup application has not initiated a snapshot using the Volume Shadow Copy Service within 30 minutes, Cluster Shared Volume backup will be turned off.

And when backup ends, then Event 5122: Cluster Shared Volume 'Volume1' ('Cluster Disk 2') has now resumed normal operation.

But if I have room for hardware snapshot, then I get error 5121 and after 2-4 minutes Event 5122: Cluster Shared Volume 'Volume1' ('Cluster Disk 2') has now resumed normal operation.

So it's releated with hardware VSS provider - maybe it works badly.

I use IBM DS3512 and LSI VSS hardware provider software.

>>>>>>>> Hi Richard,
>>>>>>>>
>>>>>>>> The errors 5121 in the cluster we must except because it is a normal

>>>>>>>> behaviour when you backup the CSV.
>>>>>>>>
>>>>>>>> DPM or Snap manager for Hyper-V will backups the VM's per Host and put the
>>>>>>>> CSV in a backup state. The communications to the CSV from the other cluster
>>>>>>>> nodes (hosts) will be redirected. So the other hosts will try to connect true

>>>>>>>> an other network (card) to the owner of the CSV (the one that is backuped) and

>>>>>>>> then go to the Cluster Shared Volume.
>>>>>>>>
>>>>>>>> When this happens you get the 5121 error. From NetApp they say there will be
>>>>>>>> KB soon on the Microsoft site.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Perry