I've got an ES40 running VMS 7.3-1 which is connected to an ESA12000
(running ACS 8.7S-2).
On July 2nd, we saw a few of the DGAnn: devices go into
mntverifytimeout. I was forced to reboot the system in an attempt to
get things running again. The OS came-up okay but when attempted to
start our IDX software (running on Cache), we found that the Cache
database files were badly corrupted. After spending a few hours on
the phone with a support person from IDX and a support person from
Intersytems, the decision was made to restore from tape. That process
took pretty much all day (restoring from the last full backup and then
re-playing the journal files).
We lost an entire day's worth of production on July 2nd. Our clinics
that use the IDX system for scheduling and other "front desk"
patient-related activities were dead-in-the-water. We also had
hundreds and hundreds of "back office" billing people who could do
nothing that day because our IDX system was unavailable.
I have an open IPMT case with "Storage Engineering". I have sent them
tons and tons of logs and console output files.
We haven't received any encouraging news, patches or tips on how to
keep the ESA12000 from going "incommunicado".
Last night, we got hit with the same problem that whacked us on July
2nd. Two DGAnn: devices went to "mntverifytimeout". Had to reboot
the ES40. We got lucky and did not have to restore the IDX
environment from tape. The odd thing that I noticed is that when I
tried a "restart this" on one of the HSG80s, the CLI hung on both
controllers. I had to have an operator hit the buttons on the OCP on
the front of the HSG80s to restart them.
AND THIS MORNING, we got hit again. Same symptoms. Two DGAnn:
devices went mntverifytimeout. Had to reboot without shutting down
Cache. Got lucky once more and the Cache data files were not corrupt
after reboot. Tried a "restart other" on the HSG80s. The CLI on both
controllers hung. Had to walk an operator through hitting the buttons
on the front of the HSG80s to restart them. Does anyone see a pattern
here?
Is anyone else experiencing problems like this with HSG80s connected
to their VMS systems? I heard that another big shop here in Milwaukee
was having similar problems....
This is just killing us. The IDX system is one of our most important
systems. I can't have the storage going "bye-bye" in the middle of
the night.
The only hunch we have about this bug in the ESA12000 is that it seems
to be related to periods of high I/O activity. On all three
occassions, we were hit during the time that our backups run. We are
also using controller-based snapshots...
Is anyone suffering this problem? Does anyone have advice on how I
escalate this problem higher-up in the HP food chain? There is
clearly something wrong with the HSG80s under our VMS system.
Thanks,
-Sleepless Scott in Milwaukee.
svi...@wi.rr.nospam.com
The HSG80 software was a bit outdated, and we wanted to upgrade to the latest
version, V8.7-1. However we were advised to update to V8.6-13, and not to go to
V8.7-1. Don't know why, maybe someone at HP was a bit suspicious about the
V8.7-1 release....
Yes I see a pattern.
Why July 2nd? Did you upgrade firmware July 1st? What changed?
Tips?
Googling for a dcl snippet, So you can immediately catch when a
drive goes into MntVerify stick something like this in a loop:
http://groups.google.com/groups?selm=8NOV00.19075364%40feda34.fed.ornl.gov&oe=UTF-8&output=gplain
$ mntvfy = %x4000
$ valid = %x0800
$ loop:
$ sts = f$getdvi(vol,"sts")
$ if (sts .and. mntvfy) .eq. mntvfy then ... ! Disk is in MntVerify
Sound an alarm, and reboot the hung controller.
$ if (sts .and. valid) .ne. valid then ... ! Disk is in MntVerifyTimeout
Run in circles, scream and shout
$ wait 00:01:00
$ goto loop
---
Second, raise your MVTIMEOUT to give yourself more time.
Otherwise you are timing out - HANGING THE DRIVE(s) - and then forced
to reboot the node. (Here is what I have, you may want to bounce
this and any advice off HP by the way):
$ mcr sysgen show mv
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
MVTIMEOUT 36000 3600 1 64000 Seconds D
3600 seconds is far too short in my opinion. You may want to
adjust shad timeouts if shadowing is in use:
$ mcr sysgen show shadow_mbr
Parameter Name Current Default Min. Max. Unit
Dynamic
-------------- ------- ------- ------- ------- ----
-------
SHADOW_MBR_TMO 18000 120 1 65535 Seconds D
This way, things will hang on shadowsets also until you intervene.
But why the corruption? That's an interseting question.
> Is anyone else experiencing problems like this with HSG80s connected
> to their VMS systems? I heard that another big shop here in Milwaukee
> was having similar problems....
>
> This is just killing us. The IDX system is one of our most important
> systems. I can't have the storage going "bye-bye" in the middle of
> the night.
>
> The only hunch we have about this bug in the ESA12000 is that it seems
> to be related to periods of high I/O activity. On all three
> occassions, we were hit during the time that our backups run. We are
> also using controller-based snapshots...
>
Oh - skip the shadowing advice above. But at your version
of VMS and *if* you had shadowing, and only ONE of the two shadow
members were in MntVerify (maybe one on one controller pair another
on another controller pair that isn't flaking out at the time) you
could kick out the naughty member (maybe you want to bring down
both controller pairs or are forced to - therefore need to kick
the members out on the flakey controllers(s) ):
$ dismount/force_removal badboy:
that is one advantage of having a long MVTIMEOUT , SHADOW_MBR_TMO,
and using shadowing.
> Is anyone suffering this problem?
Not here. Different kit.
Seems like the heavy IO from concurrent operations, controller copies,
backups and night jobs is stretching the HSG80s to a breaking point.
Rob
We have never lost data.
Our ESA12000 has been rock steady!!!
Are you using XFC or VIOC?
We decided to stay with VIOC 'cuz we suspect it's more stable.
WWWebb
svi...@wi.rr.com (Scott Vieth) wrote in message news:<5a85bce2.03091...@posting.google.com>...
Before drives enter Mount Verify Timeout state, they sit in Mount
Verify state for a length of time determined by the MVTIMEOUT
parameter. Default value for this parameter is 3600 seconds, or 1
hour. You could raise this parameter, which would delay the drives
going into Timeout state. But that doesn't solve the underlying
problem, of course.
> I was forced to reboot the system in an attempt to
> get things running again.
Disks in this state can also be dismounted with DISMOUNT/ABORT. Any
outstanding I/Os will be returned with an error status.
> I have an open IPMT case with "Storage Engineering". I have sent them
> tons and tons of logs and console output files.
>
> We haven't received any encouraging news, patches or tips on how to
> keep the ESA12000 from going "incommunicado".
Problems, particularly intermittent ones, can take a lot of time and
effort to diagnose. Not having heard anything doesn't necessarily
mean there isn't a lot of effort going on in the background.
Call back and request an update on the status of the IPMT case, and
Engineering is obligated to provide you with one.
> This is just killing us. The IDX system is one of our most important
> systems. I can't have the storage going "bye-bye" in the middle of
> the night.
Do you have host-based Volume Shadowing set up to shadow across two
different controller pairs? Seems like such a mission-critical system
might warrant that.
> The only hunch we have about this bug in the ESA12000 is that it seems
> to be related to periods of high I/O activity. On all three
> occassions, we were hit during the time that our backups run.
You might try lowering the quotas for the Backup process so it can't
drive the I/O subsystem quite as heavily, as a temporary mitigating
measure.
> Does anyone have advice on how I
> escalate this problem higher-up in the HP food chain?
Start by calling the CSC back, asking for the Manager On Duty, and
expressing your concerns about the situation.
<Snip>
> Is anyone suffering this problem? Does anyone have advice on how I
> escalate this problem higher-up in the HP food chain? There is
> clearly something wrong with the HSG80s under our VMS system.
>
Not this problem, but we have had others. In regard to yours, it
should be obvious to HP by now it is serious, and if you are paying
for support, Scott, the only variable should be how fast they escalate
it.
With the problems that we have had to deal with, you just learn to
lean on them to get the problem addressed to your satisfaction. Feel
free to email me at Gary _at_ McCready -dot- com, and perhaps I can
give you some hints based upon what you have not tried yet.
Good luck,
Off topic here, Keith.
Do you have a website on VMS buffer overflows and Denial of
Service info?
I need to look into this further. Sure wish I could have
attended some of your seminars.
I have 36 HSG's all with the F version of firmware code. Some are at 7-x,
others still at 8.6-8. I have not seen anything like this issue in the F
version. Back in VMS7.2-1H1, there was a problem with SYS$CLUSTER.EXE that
would cause the drives to go immediately into MVTIMEOUT when a path was
lost. That was killing us. That was fixed back in 2001 I believe. It took
several months, and running a debug version of someVMS OS code before
engineering could identify the problem.
> On July 2nd, we saw a few of the DGAnn: devices go into
> mntverifytimeout. I was forced to reboot the system in an attempt to
> get things running again. The OS came-up okay but when attempted to
> start our IDX software (running on Cache), we found that the Cache
> database files were badly corrupted. After spending a few hours on
> the phone with a support person from IDX and a support person from
> Intersytems, the decision was made to restore from tape. That process
> took pretty much all day (restoring from the last full backup and then
> re-playing the journal files).
>
I guess I do not understand the IDX software and what you mean about it
"running on Cache".
Next time force a crash. This will give engineering something to look at.
{On your system console}
CTRL-P
>>> CRASH
{system will bugcheck and write a crash dump file. This is what engineering
will need to look at.}
> We lost an entire day's worth of production on July 2nd. Our clinics
> that use the IDX system for scheduling and other "front desk"
> patient-related activities were dead-in-the-water. We also had
> hundreds and hundreds of "back office" billing people who could do
> nothing that day because our IDX system was unavailable.
>
> I have an open IPMT case with "Storage Engineering". I have sent them
> tons and tons of logs and console output files.
It took us several months and several outages before the SYS$CLUSTER.EXE
problem was identified in VMS 7.2. Drives where not really in MVTIMEOUT,
but the state was reported as such.
Are you monitoring all of your HSG80 controllers? We have Decserver 700's
and Consoleworks monitoring and recording our HSG console port output. This
may be key to giving engineering the data that they need to fix the problem.
Manufactured by:
It's the leading database in HealthCare (according to Intersystems
web page and other accounts) and a number of vendors use it as a
backend. IDX, Epic as another example.
http://www.idx.com/
http://www.epicsystems.com/
IDX is great stuff. Epic is great stuff. Epic got a plum
recently:
http://www.wistechnology.com/kaiserepic.php
Kaiser On Schedule With Epic's Epic Project
By Jeff Moad, IDG
Kaiser Permanente's closely watched, six-month-old electronic medical record
project has already expanded significantly beyond its original scope, officials
said this week. Still the $1.8-billion, three-year project is on schedule, with
the initial software configuration phase due to wrap up in three weeks.
Kaiser surprised many when, in February, the country's largest HMO announced it
would discontinue a multi-year effort with IBM to develop its own electronic
medical record system -- called the Clinical Information System -- and instead
purchase packaged EMR software from Epic Systems, a relatively small private
software developer (click here for more information)
This is the referenced URL in the click here:
http://www.computerworld.com/databasetopics/data/story/0,10801,78384,00.html
"Carl Dvorak, Epic's chief operating officer, said the company's software can
store 45,000 data elements that cover all aspects of patient care. Dvorak added
that Kaiser's system will manage all end-user interactions through Cache, a
multidimensional database developed by InterSystems Corp. in Cambridge, Mass.,
for use in transaction-processing applications."
So while Cerner is great stuff running on Oracle, a number of
the other major players use Cache.
Rob
>From: SMTP%"svi...@wi.rr.com" 11-SEP-2003 14:51:40.72
>To: Info...@Mvb.Saic.Com
>CC:
>Subj: Anyone else having problems with VMS and HSG80-based arrays?
>
>Hi:
>
>I've got an ES40 running VMS 7.3-1 which is connected to an ESA12000
>(running ACS 8.7S-2).
I'm running 8.7F.
<snip>
.
.
.
>
>environment from tape. The odd thing that I noticed is that when I
>tried a "restart this" on one of the HSG80s, the CLI hung on both
>controllers. I had to have an operator hit the buttons on the OCP on
>the front of the HSG80s to restart them.
I've seen this problem when cache module is flaky. When the cache module
(not the DIMMs) finally failed, the redundant controller handled things
fine. Until till then, it behave like you say or very close to it. Volume
shadowing between ESA12000 cabinets saved us during this flaky hardware problem.
<snip>
.
.
.
>The only hunch we have about this bug in the ESA12000 is that it seems
>to be related to periods of high I/O activity. On all three
>occassions, we were hit during the time that our backups run. We are
>also using controller-based snapshots...
>
If you don't I would make sure you are using mirror cache option between
dual redundant HSG-80 controllers. Also, have you change default settings
of the HSG-80's to retain cache for longer periods? If so, you might want
to return to defaults to see if problem persists.
:) jck
Kos...@hatespam.bender.com
> > AND THIS MORNING, we got hit again. Same symptoms. Two DGAnn:
> > devices went mntverifytimeout. Had to reboot without shutting down
> > Cache. Got lucky once more and the Cache data files were not corrupt
> > after reboot. Tried a "restart other" on the HSG80s. The CLI on both
> > controllers hung. Had to walk an operator through hitting the buttons
> > on the front of the HSG80s to restart them. Does anyone see a pattern
> > here?
> >
>
> Yes I see a pattern.
>
> Why July 2nd? Did you upgrade firmware July 1st? What changed?
No firmware upgrade or any other change on July 1st. We make it a
policy to NOT make systems changes within five days of the end of the
month. July 1st was when we started the month-end batch processing
for June. On the night of July 1st, we had three Backup streams
running as well as our month-end processing. The HSG80s locked-up
during the month-end and made quite a mess for us.
>
> Seems like the heavy IO from concurrent operations, controller copies,
> backups and night jobs is stretching the HSG80s to a breaking point.
>
> Rob
That is my suspicion as well: we are overwhelming the array with I/O
at night while backup is running.
-Scott
Hoff told me that it was okay to go ahead and use the XFC. It kicks
ass when compared to the VIOC. If you are running 7.3-1 and have all
your patches in place, then turn on the XFC. I have the cache limit
set to 12GB. I am going to increase it this weekend. We are getting
pretty respectible cache hit numbers with 12 gigs but I think we can
do better.
The fastest disk I/O is the one that doesn't have to go to disk. :^)
-Scott
> > Does anyone have advice on how I
> > escalate this problem higher-up in the HP food chain?
>
> Start by calling the CSC back, asking for the Manager On Duty, and
> expressing your concerns about the situation.
I did that yesterday. "Pulled the fire alarm" at HP (in a virtual
manner). Quite a few HP people are talking today to find out why I
have an open IPMT case that is 8 weeks old and I haven't heard
anything from Storage Engineering. I suspect that someone's butt is
going to be in a sling shortly... :^)
> With the problems that we have had to deal with, you just learn to
> lean on them to get the problem addressed to your satisfaction. Feel
> free to email me at Gary _at_ McCready -dot- com, and perhaps I can
> give you some hints based upon what you have not tried yet.
>
> Good luck,
Gary:
I think that my emails yesterday light a fire under quite a few
people. I received a couple of phone calls from "higher-up" people at
HP who promised to bring the resources together to address this issue.
One person I spoke to said he was talking with another HP person and
said "This is bad. Scott asked to speak to a V.P. on August 4th and
no one got back to him." The other person replied "You mean
*September* 4th, right?" and the first person said "No, I mean AUGUST
4th"
I have had enough of working with "storage engineering" and playing
the game where the engineer working on the IPMT emails a senior
support person and then that support person acts an interpreter and
asks me for data. i mail the log files back to the middleman who
forwards the data back to the engineer. That is a load of crap. From
now on, I only talk with "VP" or higher on this problem. No more
storage techies.
-Scott "Takin names, kickin ass"
Okay. So why didn't it occur June 1st, May 1st, April 1st, etc.?
Were backup schedules modified such that 3 are running at one
time, prior to that 1 ran? In other words, for a short term fix go
back to prior processes/processing and cut down on concurrent
IO and see if it mysteriously disappears.
Rob
This is easy to check and more doesn't always mean better.
Firstly, Cache will be caching hot global regions and as the
installation guide says you can run GLOSTAT to determine if
you need to bump up buffers:
http://platinum.intersystems.com/GCI/GCI_vmsparms.html
"You can use the statistics produced by the GLOSTAT utility to determine if
adding more global buffers will reduce disk access and thereby improve
performance."
Give it as many as it needs, etc. But XFC comes into play
as only so much will be utilized and a utility like GLOSTAT is
conservative (as is MEMREQ ;-). I'd be intersested in seeing
your cache hit rates and Free MBytes:
$ show memory/cache
During high IO times during the day, is all your memory dedicated to
cache (not to be confused with Cache) in use? If not, why dedicate
more?
Rob
...whichs falls down periodically in the middle of the night. :^)
I am here this morning to post a fix (or two) for the storage issue I
described yesterday.
After "lighting a fire" under HP yesterday via email and phone call, I
received an email from one of the gods in VMS Engineering.
He said that the problem with our VMS system and the ESA12000 sounded
very similar to a problem experienced at another site. VMS
Engineering had also reproduced this problem in their test lab (prolly
where that funny white elephant lives).
Anywho, the root problem that we are seeing is that our mighty ES40
running OpenVMS 7.3-1 is simply OVERWHELMING the ESA12000 with I/O
requests. The VMS Engineering person told me that some of the
performance tweaks in 7.3-1 really make VMS fly when it comes to I/O.
Now our ES40 is demanding data so fast from the HSG80s that eventually
the HSG80s tip over.
This makes perfect sense. The three times that we have been bitten by
this problem, we had *extremely* heavy I/O on the ES40. The first
time, we were running three concurrent backup streams from snapshots
AND running our month-end batch processing. The poor HSG80s could not
handle the load and gave up (which corrupted our Cache data files and
made us restore from tape).
The VMS Engineering person told me that the HSG80s have a total queue
depth (if that is the correct term) of *240* outstanding I/O requests.
After that, the controllers try to tell the host system to slow down
a little. But the ES40 and VMS 7.3-1 are hungry for more data and
finally the HSG80 faints.
First Fix: The guru from VMS Engineering asked me to check the DIOLM
setting on the account that we use to run out backup jobs. Knowing
that the HSG80s have a maximum queue depth of 240, we don't want to
bury the HSG80s any more. In Authorize, I found that DIOLM for our
backup account was set to 32767. Three backup jobs running at the
same time under that account were issuing TONS and TONs of I/O
requests and burying the HSG80s.
So, per his advice, I set the DIOLM for our backup account to "32".
This will give very good backup performance and still not bury the
HSG80s.
Second fix: The VMS Engineering guru told me to install
DEC-AXPVMS-VMS731_MSA1000-V0100 as soon as I can. It fixes a timeout
value for fibre channel read/writes. The value got set to "4 seconds"
in VMS 7.3-1 and this patch changes the timeout value back to "24
seconds". This will help the OS be more tolerant when the HSG80s are
being pokey with returning requested data.
How do you know if you are burying your HSG80s? You can run
"anal/sys"
and then do a "fc stdt/all"
look at the far-righthand column. It is labelled "Seq Tmo". If that
number is huge, that is a bad sign. Also look at the column labelled
"QF Seen". We don't want that value to large either.
So, that appears to be the problem I ran into and those are the ways
to fix it. The Mighty VMS Operating system (7.3-1) is simply too
powerful for the HSG80 controllers. By throttling the I/Os a little,
we can get the OS to "behave" until we get rid of the HSG80s and move
to our....
[E]xtra-high-performance
[V]MS-ready
[A]rray
:^) :^) :^)
You may have noticed that I did not reveal the name of the person in
VMS Engineering who helped me. He asked that I not post his name,
phone number or email address (or yearbook photo).
You will need to open *your own* case and get *your own* VMS
Engineering god to help you out. Those are the breaks. :^)
Now that I know what is causing our 'mntverifytimeouts' in the middle
of the night and how to fix it, I feel like a large weight has been
lifted from my shoulders. I can finally get some sleep. :^)
Many, many thanks to the person from VMS Engineering who took the time
to email me last night and talk with me on the phone this morning. I
learned more in 30 minutes talking with this person than I did working
with the storage group for eight weeks. VMS Engineering rocks!!
Storage Engineering...well, you kinda suck.
Hmmm... I have a GS160 Model 16 with 7.3-1 and all but last couple of weeks
of patches installed with 6 dual redundant pairs of HSG-80's that are 256 meg
mirrored caches, 12 fibre HBAs, and 3 pair redundant switches, and 14 gigs of
XFC. To me, my GS160 might be able to present similar I/O load like your
ES40 to HSG-80's, if my application drives it to.
Granted, your I/O load may be higher than mine. But my MONITOR stats
indicate I have what I consider fairly high I/O load over extended time.
My I/O request queue length are in the thousands (2K to 5K) on DGA devices,
as are I/O operation rates (15K to 36K) on all my DSA devices, and (1K - 2K)
on my DGA devices. My ECP stats also seem to match up. In general, I/O
bound system with spread I/O across the disk farm we could afford (about
140 of 15Krpm drives). I'm not doing cloning with HSG-80's like you, so load
on HSG-80 may not be as high. But I do think I have high sustained (hours)
I/O load to my HSG-80's, with about 50/50 read/write mix.
I guess I will have to measure doing ANAL/SYS sometime and see how big the
numbers are. "Huge" and "large" are quite relative terms. Can you be more
quantitative? If not, I understand. I don't have the problem you see/had,
but I am curious to compare, and to take action to avoid the problem if need
be.
--
David McKenzie
David.M...@paradigm-shift.dot.biz remove the "dot"
OpenVMS IT Privacy and Law
"Scott Vieth" <svi...@wi.rr.com> wrote in message
news:5a85bce2.03091...@posting.google.com...
Do you mean firmware that it out-of-date? I am very careful to keep
my ES40 up-to-date whenever new AlphaServer firmware is announced. I
let the firmware update upgrade *all* of the pieces in the system that
it can upgrade.
I'll double-check the firmware version this afternoon when the ES40 is
rebooted to install the 731-MSA1000 patch.
Thanks,
-Scott
In particular lp7000's displayed bad behavious and lp8000's did not.
--
David McKenzie
David.M...@paradigm-shift.dot.biz remove the "dot"
OpenVMS IT Privacy and Law
"Scott Vieth" <svi...@wi.rr.com> wrote in message
news:5a85bce2.03091...@posting.google.com...