Tape drive hung

FrankS

unread,

May 16, 2009, 8:02:02 PM5/16/09

to

There have been issues in the past with OpenVMS not handling
situations where a tape drive goes offline while operations are in
progress. I thought those were all resolved, but apparently that's
not the case.

So here's the situation:

OpenVMS v7.3-2, last patch set was Update V15 and anything released
thru June 2008
Directly connected SCSI tape library (TL892)

The tape croaked during a backup. The batch job died, and as far as
anyone can see the process is no longer attached to the device. The
tape drive was repaired and appears to be working.

Attempts to do SYSMAN IO SCSI, IO AUTO result in a hang. The SHOW
DEV /FULL shows no owner process, but "operations being canceled".
The error count on PKB0 increments during this IO AUTO and the error
log shows SCSI bus resets. Any attempt to $MOUNT the drive or control
the library returns with "device timeout".

SYSGEN TAPE_MVTIMEOUT is set to 600 seconds, but it's been hours and
still no joy.

Other than rebooting the system, are there any "undocumented" tools to
resolve this problem?

Various command outputs follow:

$ sh dev/fu pkb0

Device PKB0:, device type Qlogic ISP1020 SCSI port, is online, error
logging is
enabled, operations being canceled.

Error count 287 Operations
completed 760
Owner process "" Owner UIC
[SYSTEM]
Owner process ID 00000000 Dev Prot
S:RWPL,O:RWPL,G,W
Reference count 0 Default buffer
size 65535
Current preferred CPU Id 3
Fastpath 1
Current Interrupt CPU Id 3

$ anal/sys

OpenVMS (TM) system analyzer

SDA> show device pkb0

PKB0 Unknown
UCB: 83E0CA80

Device status: 50000018 cancel,online,fast_path,fp_hwint
Characteristics: 0C440000 avl,elg,idv,odv
00000200 nnm
SUD Status 00000000

Owner UIC [000001,000004] Operation count 760 ORB
address 83E0CCC0
PID 00000000 Error count 287 DDB
address 83E0C800
Class/Type 80/30 Reference count 0 DDT
address 822112E0
Def. buf. size 65535 BOFF 000010F8 SUD
address 83E0CBC0
DEVDEPEND 00000000 Byte count 000000FF CRB
address 83E0C880
DEVDEPND2 00000000 SVAPTE 84EB4C88 I/O wait
queue 83E0CAEC
DEVDEPND3 00000000 DEVSTS
00000004
FLCK index
3A
DLCK address
83E0C900
Preferred CPUDB
8F53A000
Preferred CPUID
03

*** I/O request queue is empty ***

$ sh dev/fu gkb0

Device GKB0:, device type Generic SCSI device, is online, shareable.

Error count 0 Operations
completed 253256
Owner process "" Owner UIC
[SYSTEM]
Owner process ID 00000000 Dev Prot
S:RWPL,O:RWPL,G:RWPL,W:RWPL
Reference count 0 Default buffer
size 0

$
$ sh dev/fu mkb400

Magtape PORTIA$MKB400:, device type TZ89, is online, record-oriented
device,
file-oriented device, available to cluster, error logging is
enabled,
controller supports compaction (compaction enabled), device
supports
fastskip (always).

Error count 100 Operations completed
721106267
Owner process "" Owner UIC
[SYSTEM]
Owner process ID 00000000 Dev Prot
S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer
size 65024
Density TK89 Format
Normal-11

Volume status: no-unload on dismount, beginning-of-tape, odd
parity.

$ sh dev/fu mkb500

Magtape PORTIA$MKB500:, device type TZ89, is online, file-oriented
device,
available to cluster, error logging is enabled, controller
supports
compaction (compaction disabled), device supports fastskip
(per_io).

Error count 2 Operations
completed 4
Owner process "" Owner UIC
[SYSTEM]
Owner process ID 00000000 Dev Prot
S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer
size 2048
Density default Format
Normal-11

Volume status: no-unload on dismount, beginning-of-tape, write-
locked, odd pa
rity.

H Vlems

unread,

May 17, 2009, 6:28:41 AM5/17/09

to

The system obviously is a production system that cannot reboot at
will, otherwise you'd have done that, right?
The process that locks the device runs on cpu #3. If it runs
exclusively on #3, can you stop that cpu and see whether that releases
the tapedrive?
Hans

Richard B. Gilbert

unread,

May 17, 2009, 8:56:41 AM5/17/09

to

AFAIK, no!

This sort of thing has been a problem for at least the last thirty
years! It's not as bad as it used to be but. . . .

If I were in a position to ask; e.g. if i were a paying customer, I
would ask for a tool to reset the device, reinitialize the driver, etc,
etc. I don't think I would get it but it would be nice to have. . . .

sapienzaf

unread,

May 17, 2009, 9:55:25 AM5/17/09

to

On May 17, 6:28 am, H Vlems <hvl...@freenet.de> wrote:
> The system obviously is a production system that cannot reboot at
> will, otherwise you'd have done that, right?

Precisely, though if there's no other resolution then a reboot will be
done (and is scheduled for later today). The tape drive is a
necessary part of the backup plan.

> The process that locks the device runs on cpu #3. If it runs
> exclusively on #3, can you stop that cpu and see whether that releases
> the tapedrive?

I guess it's worth trying. I suspect the fastpath service will just
gets passed off to another CPU and not do any resetting.

Michael Moroney

unread,

May 17, 2009, 9:58:32 AM5/17/09

to

H Vlems <hvl...@freenet.de> writes:

>The system obviously is a production system that cannot reboot at
>will, otherwise you'd have done that, right?
>The process that locks the device runs on cpu #3. If it runs
>exclusively on #3, can you stop that cpu and see whether that releases
>the tapedrive?

That wouldn't release the drive.

If the problem is the wayward process ties up a necessary drive, a
potentially better "fix" might be to reconfigure the physical drive to
have a different SCSI ID or something, reconnect it and use IO AUTO so
that VMS finds the "new" drive, which you can use from this point on,

As always, be very careful with production systems. The problem with the
above is to remember to put the drive back to the "old" ID upon the next
reboot, as well as having running programs find the drive at the new, not
the old ID (logical names go far here).

sapienzaf

unread,

May 17, 2009, 9:59:47 AM5/17/09

to

On May 17, 8:56 am, "Richard B. Gilbert" <rgilber...@comcast.net>
wrote:

> This sort of thing has been a problem for at least the last thirty
> years! It's not as bad as it used to be but. . . .

Previous issues (if I remember correctly) would have the MK device
hung up, with a process still attached to it in the $SHOW DEVICE
output. Here the PK device is hung, and there's no (apparent) process
attached to it.

I'd say it's just as bad and perhaps worse than before. It's clearly
a persistent issue, though, going back to at least v5.5-2 that I can
recall. I would have hoped someone in OpenVMS engineering would have
found the bug by now.

> If I were in a position to ask; e.g. if i were a paying customer, I
> would ask for a tool to reset the device, reinitialize the driver, etc,
> etc. I don't think I would get it but it would be nice to have. . . .

Maybe they'll get to it after the IPSEC implemenatation.

sapienzaf

unread,

May 17, 2009, 10:13:30 AM5/17/09

to

On May 17, 9:58 am, moro...@world.std.spaamtrap.com (Michael Moroney)
wrote:

> If the problem is the wayward process ties up a necessary drive, a
> potentially better "fix" might be to reconfigure the physical drive to
> have a different SCSI ID or something, reconnect it and use IO AUTO so
> that VMS finds the "new" drive, which you can use from this point on,

Ah, but you missed the details. There is no process attached to the
drive, and it's the PK device that's getting blocked.

In fact, I wrote in the original post that an IO AUTO gets hung.

H Vlems

unread,

May 17, 2009, 2:14:07 PM5/17/09

to

I know, it's why I stopped using 8mm Exabyte drives at home. I
invested in a couple of large SCSI drives and used them for backup.
And yes, I realise that's a hobbyist option and useless to production
shops with lots of data.
On a VAXstation 4000-90A (yes, SCSI-2 only) I managed to go on
replacing the Exabyte with a TK50. Migh just as well have been dumb
luck, again not something you'd even consider doing on a production
system.
If the stop cpu doesn't work, then all you can do is a reboot. Which
makes VMS behave like a Windows system....

Richard B. Gilbert

unread,

May 17, 2009, 2:58:34 PM5/17/09

to

The "excuse" I recall is that there might be a read, write, or other
operation pending so you can't just zap the process that owns the drive.

As of 2004 and VMS V7.2 the problem seemed to be pretty well under
control. I haven't used a DLT drive since then. The TK-50 was a good
drive as far was it went, which wasn't very. The speed and capacity
were totally inadequate for doing full backups. More modern DLT drives
were a great improvement; a TZ88 was almost adequate; with a large disk
farm you needed either some sort of stacker or someone to change tapes.
AIRC the TZ87 and TZ88 drives were "high maintenance"!

sapienzaf

unread,

May 17, 2009, 4:04:25 PM5/17/09

to

To follow up on my own post, after rebooting the system the same
problem occurred. So it's back with the hardware folks to get the
tape library replaced.

winston...@yahoo.com

unread,

May 18, 2009, 1:00:27 AM5/18/09

to

On May 17, 8:56 am, "Richard B. Gilbert" <rgilber...@comcast.net>
wrote:

>

> This sort of thing has been a problem for at least the last thirty
> years! It's not as bad as it used to be but. . . .
>
> If I were in a position to ask; e.g. if i were a paying customer, I
> would ask for a tool to reset the device, reinitialize the driver, etc,
> etc. I don't think I would get it but it would be nice to have. . . .

Gee, ISTR that we *had* just such a program from Compaq that was
supposed to do just that.
But I don't remember that it ever worked. Maybe just not the *right*
situation.

Anyway, this would've been around '01-'02...

Richard B. Gilbert

unread,

May 18, 2009, 8:09:38 AM5/18/09

to

I don't recall it. It may just be my fading memory or it may have been
well worth forgetting!

ajmurtha

unread,

May 18, 2009, 2:44:02 PM5/18/09

to

Hi,

I have run into this problem where a batch job is running a backup and
someone stops the batch job and the tape drive is stuck, allocated to
a non-existent process. I was able to free up the tape drive using
DELTA to set the device reference count to 0 and reset the device
status bits. It has saved us from having to reboot a production
system, but read the warnings -- undocumented, unsupported, use at
your own risk. here it is:

Procedure for Un-Sticking a Tape Drive

There a bug which occasionally results in a tape device being
allocated to a non-existent process. It seems to happen when there was
a
batch job running a backup and someone did a STOP/ID on the process.

Note the VMS 7.3-2 version of DELTA has a bug and won't work. The
DELTA.EXE from VMS 7.2
works, as may others.

In this example, MKA200 was allocated to a non-existent process, and
marked for
dismount.

MKA100 was functional and not in use. It is used as the example for
correct
device status bits that need to be set to fix the stuck drive.

You might also want to save the typical configuration of your tape
drive
when it is working correctly, unallocated and not in use, for
reference.

This is easier if you get several windows going.

Use one window to display data in the MKA200 UCB.

$ ANAL/SYS
SDA> SHOW DEVICE MKA200:
SDA> READ SYSDEF
SDA> FORMAT UCB

Use another window to do the same for the "good" device, MKA100.
(or your previously saved device status).

$ ANAL/SYS
SDA> SHOW DEVICE MKA100:
SDA> FORMAT UCB

In the third window, you will use DELTA to modify data in the MKA200
UCB.
(in this example I have saved a copy of the VMS 7.2 DELTA.EXE in
SYS$MAINTENANCE).

$ ANAL/SYS
SDA> SPAWN RUN SYS$SPECIFIC:[SYSMAINT]DELTA

You will get an initial display that is an instruction. For example:
OpenVMS Alpha Debugger

Exit 00000001

80058F80! LDQ R28, #X0008(SP)

At this point, enter
1;M and a carriage return.

(this sets all processes writable).

You will not get any prompts.

Now there are several fields in the UCB you want to check.

UCB$L_REFC the device reference count, will probably be non-zero, you
want to
set it to zero. Here is how to view and change a memory location:

On your "bad tape" (MKA200) window, find the UCB$L_REFC field in the
UCB formatted output. It will have a location on the left, something
like
FFFFFFFF.8xxxxxxx

The number on the right is the value at that address. Example:

FFFFFFFF.81DC54CC UCB$L_REFC 00000002

Type the following:
00010001:gxxxxxxx/

Where xxxxxxx is the 7 digits following the FFFFFFFF.8 in the address
of the
location.

The 00010001 is the internal pid of the swapper, that owns all the
device UCBs.
After the ":" you put the hex address (64 bit) of the address to
view;
the "/" tells DELTA to output the data at that address. After you
type type the "/" you will see:

00010001:gxxxxxxx/ nnnnnnnn

Where nnnnnnnn is the value, in this case 00000002.

After the output value, enter the new value, followed by a carriage
return:

00010001:g1dc54cc/ 00000002 0 <CR>

This will update the UCB refcount field to 0.

You will also need to compare the device status field to that of the
good tape
device and set your bad tape device to have the same values. (I have
not made the
effort to figure out which bits are which). You should check the
following fields.

UCB$L_PID (pid of device owner, set to 0)
UCB$L_STS (status)
UCB$L_DEVSTS (device status)
UCB$L_DEVCHAR (device characteristics)
UCB$L_AMOD (not sure if you really need this one)

Of course, this is COMPLETELY unsupported, and if you screw something
up you
could crash the system. Remember, you are mucking around system space
in
kernel mode . Be careful!

sapienzaf

unread,

May 18, 2009, 3:02:48 PM5/18/09

to

On May 18, 2:44 pm, ajmurtha <ajmur...@gmail.com> wrote:
> I have run into this problem where a batch job is running a backup and
> someone stops the batch job and the tape drive is stuck, allocated to
> a non-existent process. I was able to free up the tape drive using
> DELTA to set the device reference count to 0 and reset the device
> status bits. It has saved us from having to reboot a production
> system, but read the warnings -- undocumented, unsupported, use at
> your own risk. here it is:
>

Actually, I had already written a little program to update UCB$L_STS
to no effect. The problem continues to be hardware related.

Dale Dellutri

unread,

May 18, 2009, 3:41:30 PM5/18/09

to

On Mon, 18 May 2009 12:02:48 -0700 (PDT), sapienzaf <sapi...@noesys.com> wrote:
> On May 18, 2:44?pm, ajmurtha <ajmur...@gmail.com> wrote:
> > I have run into this problem where a batch job is running a backup and
> > someone stops the batch job and the tape drive is stuck, allocated to

> > a non-existent process. ?I was able to free up the tape drive using

> > DELTA to set the device reference count to 0 and reset the device

> > status bits. ?It has saved us from having to reboot a production

> > system, but read the warnings -- undocumented, unsupported, use at

> > your own risk. ?here it is:
> >

> Actually, I had already written a little program to update UCB$L_STS
> to no effect. The problem continues to be hardware related.

If it's hardware, then, of course, you'll have to replace it.

However, I've had good luck with recalcitrant drives by simply
running LTT (available on the HP website, I have LTT V4.7-0).
First doing a "hardware scan", then "test" with a writeable
tape in the drive. This typically clears up mysterious
intermittent parity errors.

I haven't had your problem, but LTT is easy enough to install,
although it's operation is a bit difficult to figure out.
(It's a port from some other OS, obviously.) You might want
to try it.

--
Dale Dellutri <ddelQ...@panQQQix.com> (lose the Q's)