Using MSA$UTIL to online maintain logical disk units?

Rod Regier

unread,

Jan 17, 2018, 7:29:03 AM1/17/18

to

I'm trying to use MSA$UTIL to perform maintenance action on a failed JBOD unit.
Config is V/I64 (remastered) 8.4 w/patches, RX2600, SA6402 controller and front-access disks.
Unit 0 is a RAID mirror set
Unit 1 is a JBOD

36G Disk of JBOD failed, replacement hot-swapped into slot.

n.b. Next time around I'll try SCAN ALL and ACCEPT UNIT 1,
but that ship has already sailed.

It would appear I'll be forced to recreate the UNIT offline,
any other online option ideas?

$ mcr msa$util
MSA> set controller
MSA> sho controller

Adapter: _PKC0: (DEFAULT)
SA6400 (c) HP P57820PXQRG2OE Software 2.84
SCSI_VERSION = X3.131:1994 (SCSI-2)
Supported Redundancy Mode:
Not currently Redundant
Current Role: Active
Cache:
64 megabyte read cache 64 megabyte write cache
Cache is enabled and Cache is GOOD.
No unflushed data in cache.
Battery:
Battery is fully charged.
Controller Mode:
Controller is in HBA Mode.

MSA> sho unit

Unit 0:
In PDLA mode, Unit 0 is Lun 0.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 0:
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 102: Partition 0; (SCSI bus 1, SCSI id 2)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 68.36 [73.40] GB

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME failed
1 Data Disk(s) used by lun 1:
Disk 1: Partition 0; (SCSI bus 0, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 0 or JBOD. No fault tolerance
stripe_size=128.0KB
Logical Volume Capacity : 33.91 [36.41] GB

MSA> sho disk

Parallel SCSI device [Disk]
Disk 0: SCSI bus 0 id 0 size 68.36 [73.40] GB
Disk 000, # 0, size 143363040 blocks, (68.36 [73.40] GB), Unit 0.
Disk 000, # 1, size 2418 blocks, (1.18 [1.24] MB), Unused.

Parallel SCSI device [Disk]
Disk 102: SCSI bus 1 id 2 size 68.36 [73.40] GB
Disk 102, # 0, size 143363040 blocks, (68.36 [73.40] GB), Unit 0.
Disk 102, # 1, size 2418 blocks, (1.18 [1.24] MB), Unused.

MSA> scan all

MSA> sho disk

Parallel SCSI device [Disk]
Disk 0: SCSI bus 0 id 0 size 68.36 [73.40] GB
Disk 000, # 0, size 143363040 blocks, (68.36 [73.40] GB), Unit 0.
Disk 000, # 1, size 2418 blocks, (1.18 [1.24] MB), Unused.

Parallel SCSI device [Disk]
Disk 1: SCSI bus 0 id 1 size 34.18 [36.70] GB
Disk 001, # 0, size 71122560 blocks, (33.91 [36.41] GB), Unit 1.
Disk 001, # 1, size 563724 blocks, (275.26 [288.63] MB), Unused.

Parallel SCSI device [Disk]
Disk 102: SCSI bus 1 id 2 size 68.36 [73.40] GB
Disk 102, # 0, size 143363040 blocks, (68.36 [73.40] GB), Unit 0.
Disk 102, # 1, size 2418 blocks, (1.18 [1.24] MB), Unused.

MSA> delete unit 1

All data on this Device will be lost.
Do you really want to delete it? (Y/N)[N]:y
Unit deleted!

MSA> add unit 1/jbod/disk=1
set_configuration SCSI status = ff, ASC = 0, ASCQ = 0

SCSI status was: Reserved..
Dump of CDB used:
27, Opcode 39 : BMIC write or no data transfer operation.
1, Logical Unit Number 1.
0, Command specific: 0.
0, Command specific: 0.
0, Command specific: 0.
0, Command specific: 0.
51, BMIC: Set Configuration.
2 0, Bytes to transfer = 512.
0, Reserved, must be 0.
Adding or modification of Raid Unit failed.

Neil Rieck

unread,

Jan 18, 2018, 8:06:16 PM1/18/18

to

While I don't have anything more to add to your post, I would be glad to see someone else respond with additional data. Something similar happened to me in November but there were so many things going on at the time that I didn't have time to play with the downed drive other than this:
http://neilrieck.net/docs/openvms_notes_itanium_diary.html#msa

Don't forget that the SAS utility goes hand-in-hand with the MSA utility. So until I have something more to rely on, I will continue my Friday walk-through of our computer room looking at the visual indicators :-)

Neil Rieck
Waterloo, Ontario, Canada.
http://neilrieck.net

Stephen Hoffman

unread,

Jan 19, 2018, 10:22:21 AM1/19/18

to

On 2018-01-17 12:29:00 +0000, Rod Regier said:

> MSA> add unit 1/jbod/disk=1
> set_configuration SCSI status = ff, ASC = 0, ASCQ = 0
>
> SCSI status was: Reserved..

> ...

> Adding or modification of Raid Unit failed.

See if the MSA$UTIL command "ACCEPT UNIT" helps drag the newly-replaced
unit back to usefulness.

--
Pure Personal Opinion | HoffmanLabs LLC

Rod Regier

unread,

Jan 24, 2018, 2:36:26 PM1/24/18

to

Rod Regier

unread,

Jan 24, 2018, 2:43:20 PM1/24/18

to

Looks like I'll have to create an artificial failure test case to see if ACCEPT UNIT will do the trick next time {sigh}

Something like:

RX2600, SA6402, RAID mirror for OS and blank drive in 3rd slot.

Using MSA$UTIL

- Try online JBOD create as Unit 1

- Use offline create if necessary

- Load content into drive UNIT 1

Simulate failure:

- pull drive

- front zero

- insert drive

Recovery test sequence:

- check status

- SCAN ALL

- check status

- ACCEPT Unit 1

- check status

- reload content into UNIT 1

- check status

Any suggested embellishments?

Rod Regier

unread,

Jan 24, 2018, 2:45:42 PM1/24/18

to

Concur Neil. MSA$UTIL seems to talk to older controllers
SAS$UTIL seems to talk to newer controllers

Rod Regier

unread,

Jan 24, 2018, 2:55:26 PM1/24/18

to

Neil: In response to your diary.

I wrote a DCL procedure that dumps MSA$UTIL and SAS$UTIL status
to a file and searches for key phrases that things are going south
on RAID logical units.
If it detects issues it sends out an email alert.

Since we supply the procedure as part of our operations package I can't
quote the whole thing here, but here are the key search strings:

"failed","recovery","Degraded"

I wrote another one to monitor the status of RAID controller batteries
so I can detect their failure and replace them too on detection.
That is especially pronounced on the SA6402 units.

Unreplaced failed RAID cache batteries disables Battery-backed write caching.

I think the RX2800 RAID controller uses super capacitors so batteries
are a thing of the past on that product model.

Rod Regier

unread,

Jan 25, 2018, 3:27:10 PM1/25/18

to

Test sequence produced a different outcome than expected.

This is part due to my artificially "failing" the JBOD volume.

The Unit was functional after inserting the front-erased disk for a second time
and performing the SCAN ALL without any other commands.

Of course, the contents were trash, but that was expected.

Now my problem is there is no way for me to perform a valid test
without the JBOD disk spontaneously failing and the RAID controller recognizing that and the controller degrading the Unit status.

So all I can do for now is try the Accept Unit # the *next* time I have a real failure.

Rod Regier

unread,

Jan 25, 2018, 3:44:21 PM1/25/18

to

I did learn how to create a populated JBOD volume via hot plug with no reboots.

* Blank drive or re-purposed drive that has been front or full-surface zeroed.

* Hot insert the drive into a slot linked to the RAID controller

Something like this:

$SET PROC/PRIV=ALL
$MCR MSA$UTIL ! or SAS$UTIL depending on HW
MSA> SET CONTROLLER !selects first valid RAID controller
MSA> SCAN ALL
MSA> SHOW UNIT ! confirm existing highwater UNIT in use
and ref'd DISKs
MSA> SHO DISK ! You should see you new DISK not ref'd by UNITs
MSA> ADD UNIT # /JBOD/DISK=### ! new UNIT
MSA> SCAN ALL
MSA> SHO UNIT ! You should see your new UNIT
MSA> EXIT
$MCR SYSMAN
SYSMAN> IO AUTOCONFIGURE !Get OS to re-examine RAID UNITS and assign any
missing to DQ drives.
SYSMAN> EXIT
$SHO DEV DK !Your new JBOD should be present but unmounted

Create content on your JBOD. I like to create a JBOD as a separate FTP
recovery volume.

$MOU/FOR DKC1:
$BACKUP/IMAGE/INIT/FAST/IGN=(INTE) SYS$SYSDEVICE: DKC1:

If you want to make bootable:

$DISMOUNT DKC1:
$MOU/OVER=IDENT DKC1:
$SET VOL/IDENT=newvalue DKC1:
$@SYS$MANAGER:BOOT_OPTIONS

Rod Regier

unread,

Jan 29, 2018, 9:44:27 AM1/29/18

to

I have since learned that before existing MSA$UTIL
the followup command should be performed to complete the controller-level setup for the new unit:

MSA> ACCEPT UNIT #