I have Windows a 2000 SP3 Cluster. We have clustered a number of
services including MSDTC and MSMQ.
Everything was working fine until Ca eTrust Antivirus software v7
(supposedly compatible with Clusters) was installed on the Cluster.
The Quorum drives have been excluded from the virus scan, although the
software is still scanning read/write to the disks.
What happens now, is that occasionally, on failover the Cluster
service will crash. Sometimes this crashes the whole node.
I've been through the Cluster log. When the Cluster fails over to a
node everything is ok until it tries to write a MSMQ checkpointing
file to drive F: (the Quorum). This seems to fail and then the Cluster
runs a Chkdsk on F: and the Cluster service restarts. Here's what I
make of the log:
// Cluster node starting
00000fb4.00000b84::2003/11/27-14:39:01.359 Physical Disk <Disk F:>:
MountieVerify: DriveLetters mask is now 00000020.
000007d8.00000d98::2003/11/27-14:39:01.859 [FM] FmpRmOnlineResource:
release quolock/group lock and wait on ghQuoOnlineEvent
000007d8.00000d98::2003/11/27-14:39:02.359 [FM] FmpRmOnlineResource:
release quolock/group lock and wait on ghQuoOnlineEvent
000007d8.000010cc::2003/11/27-14:39:02.359 [CP] CppRegNotifyThread
checkpointing key SOFTWARE\Microsoft\MSMQ\Clustered
QMs\MSMQ$MSMQ\Parameters to id 1 due to timer
000007d8.000010cc::2003/11/27-14:39:02.406 [CP] CpSaveData:
checkpointing data id 1 to quorum node 1
// MSMQ Writing checkpoint file
000007d8.000010cc::2003/11/27-14:39:02.421 [CP] CppWriteCheckpoint
checkpointing file C:\DOCUME~1\SERVER~1\LOCALS~1\Temp\CLS99.tmp to
file F:\MSCS\\23fceb64-56eb-4ff7-87fe-7060f5c7daf2\00000001.CPT
// Here the cluster is waiting for something - this line is repeated a
number of times.
000007d8.00000d98::2003/11/27-14:39:02.859 [FM] FmpRmOnlineResource:
release quolock/group lock and wait on ghQuoOnlineEvent
// Disk F is doing somthing here (not sure what)
000007d8.00000d98::2003/11/27-14:39:03.359 [FM] FmpRmOnlineResource:
release
00000fb4.00000b90::2003/11/27-14:39:17.531 Physical Disk <Disk F:>:
[DiskArb] CompletionRoutine, status 0.
00000fb4.00000b90::2003/11/27-14:39:17.531 Physical Disk <Disk F:>:
[DiskArb] posting AsyncCheckReserve request.
00000fb4.00000b90::2003/11/27-14:39:17.531 Physical Disk <Disk F:>:
[DiskArb] error checking disk reservation thread, error 995.
// Here things seem to start to go wrong with the Quorum F:
00000fb4.00000b90::2003/11/27-14:39:17.531 Physical Disk <Disk F:>:
[DiskArb] CompletionRoutine: reservation lost!
000007d8.000007e8::2003/11/27-14:39:17.562 [EVT] s_ApiEvPropEvents:
Calling into EvpPropPendingEvents, size=648...
000007d8.000007e8::2003/11/27-14:39:17.562 [EVT] s_ApiEvPropEvents:
Called EvpPropPendingEvents...
// Here things go badly wrong
00000fb4.00000b90::2003/11/27-14:39:17.562 [RM] RmpLostQuorumResource,
cluster service terminated...
These problems have happened with two different versions of Virus
software now. Can anyone confirm why the Virus software is doing this,
and also what the recommended methods of installing Virus software on
a Win 2K Cluster would be ?
I am also aware of a Knowledgebase article which recommends removing
the Virus software Filter drivers. Unfortunately this is not an option
as this company is adamant that Antivirus software is installed on the
cluster.
thanks,
Neil