Cluster node fails to boot from shadow-set member

Richard B

unread,

Apr 17, 2019, 8:22:57 AM4/17/19

to

HI all,

I'm having a serious issue trying to boot a node of a three-node cluster. The cluster is comprised of 2x rx2600 i2 machines and 1x rx2660. Each node is running VMS v8.4 and has internal-only storage. Each node has its own SYSTEM disk but the 2 rx2600 i2's system disk is shadowed between the two nodes:

Node 1: DSA0: $1$DKA0:
Node 2: DSA0: $2$DKA0:

Node 2 boots fine and mounts the virtual device DSA0: with its physical unit at $2$DKA0:

It then forms a cluster just fine with the third, rx2660, node.

When I try to boot Node 1 I get a bugcheck during boot. The output is:

%CNXMAN, Now a VMScluster member -- system RYEIP1
%SHADOW-F-/-CONFIGSCAN, enabled automatic disk serving

**** OpenVMS I64 Operating System V8.4 - BUGCHECK ****

** Bugcheck code = 000008CC: SHADBOOTFAIL, SHADOWING failed to boot from system disk shadow set
** Crash CPU: 00000001 Primary CPU: 00000000 Node Name: RYEIP1
** Highest CPU number: 00000003
** Active CPUs: 00000000.0000000F
** Current Process: NULL

As you can see Node 1, the one with the bugcheck, sends the cluster request to the third node and the three-node cluster is formed. (Kinda/sorta). Then immediately the bug check.

Obviously there is something Node 1 doesn't like about the SYSTEM shadow-set. For the life of me I can't figure out a: why and b: how to get around this.

All SYSGEN parameters are set appropriately as far as shadowing is concerned. (SHADOWING, SHADOW_SYS_UNIT, etc.)

Anyone have any ideas what is going on here and what I might do to fix this?

Thanks in advance.

Richard

Jan-Erik Söderholm

unread,

Apr 17, 2019, 8:49:21 AM4/17/19

to

Has it worked with the same setup before?
Is this something that just appeared suddenly?

Richard B

unread,

Apr 17, 2019, 8:51:34 AM4/17/19

to

Jan-Erik,

This is a newly formed cluster using an imaged save-set from a previously functioning cluster.

Volker Halle

unread,

Apr 17, 2019, 10:18:21 AM4/17/19

to

Richard,

the reason for the SHADBOOTFAIL Crash seems to be hidden/overwritten by the CONFIGSCAN message on the console.

%SHADOW-F-...

Starting with OpenVMS Alpha V6.2, there is an additional message printed on the
console terminal together with the bugcheck printout, which contains additional
information about the reason for this bugcheck.

Node 2 seems to boot from the physical disk $2$DKA0: and forms a DSA0: System disk shadowset with a single shadowset member: $2$DKA0:

Node 1 then boots from it's local physical disk $1$DKA0: and tries to also form a DSA0: shadowed System disk DSA0: with a member of $1$DKA0: - after joining the Cluster, it detects, that there already is a DSA0: shadowset with a single VALID member of $2$DKA0:.

Node 1 then crashes with SHADBOOTFAIL and the most likely reason:

%SHADOW-F-BDNOMBREX, boot device is not a member of existing shadow set

If you want the 2 rx2600 i2 Systems to boot from the same shadowed System disk DSA0: you need to use a shared SCSI bus via which both Systems can directly access $1DKA0: and $2$DKA0:

Volker.

Richard B

unread,

Apr 17, 2019, 10:29:21 AM4/17/19

to

On Wednesday, April 17, 2019 at 8:22:57 AM UTC-4, Richard B wrote:

Thanks for the info Volker. Actually, this is the entire output from the console:

%SHADOW-F-/-CONFIGSCAN, enabled automatic disk serving

**** OpenVMS I64 Operating System V8.4 - BUGCHECK ****

** Bugcheck code = 000008CC: SHADBOOTFAIL, SHADOWING failed to boot from system disk shadow set

** Crash CPU: 00000001 Primary CPU: 00000000 Node Name: RYEIP2

** Highest CPU number: 00000003
** Active CPUs: 00000000.0000000F
** Current Process: NULL

** Current PSB ID: 00000001
** Image Name:

So, in the absence of a shared SCSI bus, which I do not have, what are the possible solutions? Non-shadowed system disks?

Volker Halle

unread,

Apr 17, 2019, 11:30:39 AM4/17/19

to

>> %SHADOW-F-/-CONFIGSCAN, enabled automatic disk serving

Seems like the %SHADOW_F-... Message has been overwritten by the ...-CONFIGSCAN message.

You could still have single member System disk shadowsets DSA0: on node 1 with $1$DKA0 as the only member and DSA1: on node 2 with $2$DKA0: as the only member.

Volker.

Richard B

unread,

Apr 17, 2019, 11:54:31 AM4/17/19

to

On Wednesday, April 17, 2019 at 8:22:57 AM UTC-4, Richard B wrote:

That's what I'm working on right now Volker, after giving it some further thought. Thank you very much for your input, sir, it's much appreciated. Vielen Dank.

Stephen Hoffman

unread,

Apr 17, 2019, 12:19:12 PM4/17/19

to

On 2019-04-17 14:29:18 +0000, Richard B said:

> So, in the absence of a shared SCSI bus, which I do not have, what are
> the possible solutions? Non-shadowed system disks?

For a cluster member to boot... You need a direct and multi-host path
to the system device storage and here that'd usually be either be a
multi-host supported and Tagged Command Queuing (TCQ) -capable SCSI
controller and supported shared SCSI storage such as an MSA30-MI HDD or
SSD shelf or would be a Fibre Channel Host Bus Adapter (HBA) and Fibre
Channel storage, or you could configure this Integrity box as a
satellite and a served satellite path and network boot and the OpenVMS
host serving the disks can shadow or RAID, or you can configure
multiple parallel system disks. Here, you're likely going to have
parallel system disks and with these then mounting a storage device and
that usually in SYLOGICALS.COM storage for a common set of shared
files—see SYLOGICALS.TEMPLATE for a partial list of some of the roughly
two-dozen files necessarily shared for a stable and consistent cluster.
The shared storage for SYSUAF, RIGHTSLIST and the rest of the
necessarily-shared files can be configured with host-based or
controller-based RAID. The local system boot device storage will
probably be controller-based RAID.

--
Pure Personal Opinion | HoffmanLabs LLC

Richard B

unread,

Apr 17, 2019, 12:27:28 PM4/17/19

to

On Wednesday, April 17, 2019 at 8:22:57 AM UTC-4, Richard B wrote:

Thanks for the input Steve, much appreciated sir.

Richard B

unread,

Apr 17, 2019, 12:54:09 PM4/17/19

to

On Wednesday, April 17, 2019 at 8:22:57 AM UTC-4, Richard B wrote:

Steve,

I think I'm going to go along with Volker's suggestion: one which I had contemplated anyway.
On Node 1 I'm going to have an HBVS two-disk system disk - DSA0:/shadow=($1$dka1:,$1$dka2:)
On Node 2 I'm going to have an HBVS two-disk system disk - DSA1:/shadow=)$2$dka1:,$2$dka2:)

I think this should work. However it will require more maintenance on my part as far as the UAFs and RIGHTSLIST are concerned.

What do you think?

PS: Yes, the P410i controllers in each of the rx2800 i2's are in RAID mode but I think I'm going to go with the HBVS for the system disks in lieu of, say, RAID 1. Just a personal preference I suppose.

Stephen Hoffman

unread,

Apr 17, 2019, 2:02:28 PM4/17/19

to

On 2019-04-17 16:54:07 +0000, Richard B said:

> I think I'm going to go along with Volker's suggestion: one which I had
> contemplated anyway.
> On Node 1 I'm going to have an HBVS two-disk system disk -
> DSA0:/shadow=($1$dka1:,$1$dka2:)
> On Node 2 I'm going to have an HBVS two-disk system disk -
> DSA1:/shadow=)$2$dka1:,$2$dka2:)

Parenthetical typo aside, I'd avoid using allocation classes 1 and 2
here and in general, as those two allocation classes are (also) used
(and are required) by FC disk and tape storage, and there's no way to
change that allocation class usage. That's not a problem now. But
that's a problem if (when?) FC storage arrives here. Best to avoid that
usage.

> I think this should work. However it will require more maintenance on
> my part as far as the UAFs and RIGHTSLIST are concerned.

Once you get the files shared, there's no difference.

Get one OpenVMS system installed, with nothing else configured or
added. Use the files from that installation as your start for the
common area.

Getting the logical names configured to reference the shared file is an
utter and unmitigated and hilarious disaster of a user interface and
the printed documentation here has been problematic for decades,
but—following SYLOGICALS.TEMPLATE—it does work.

> What do you think?

No sé. No quiero saber. No necesito saber.

> PS: Yes, the P410i controllers in each of the rx2800 i2's are in RAID
> mode but I think I'm going to go with the HBVS for the system disks in
> lieu of, say, RAID 1. Just a personal preference I suppose.

I'd usually use the hardware RAID. Just personal preference. One less
thing to mess with "upstairs" in OpenVMS, and I don't need to contend
with host-based RAID-1 and a system disk.

Richard B

unread,

Apr 17, 2019, 4:23:52 PM4/17/19

to

On Wednesday, April 17, 2019 at 8:22:57 AM UTC-4, Richard B wrote:

Steve,

I defined both the SYSUAF and RIGHTLIST logicals to point to SYS$SYSTEM:SYSUAF.DAT and SYS$SYSTEM:RIGHTSLIST.DAT respectively, in the LNM$SYSCLUSTER_TABLE yet when I made a test mod to an account on one node it did not show on the other. (I deassigned both logicals from the SYSTEM_TABLE which is where they were before.). Is there something I'm not catching here?

Stephen Hoffman

unread,

Apr 17, 2019, 4:35:09 PM4/17/19

to

On 2019-04-17 20:23:49 +0000, Richard B said:

> I defined both the SYSUAF and RIGHTLIST logicals to point to
> SYS$SYSTEM:SYSUAF.DAT and SYS$SYSTEM:RIGHTSLIST.DAT respectively, in
> the LNM$SYSCLUSTER_TABLE yet when I made a test mod to an account on
> one node it did not show on the other. (I deassigned both logicals
> from the SYSTEM_TABLE which is where they were before.). Is there
> something I'm not catching here?

Please read the comments in the previously-cited SYLOGICALS.TEMPLATE file.

There's a whole lot more than two files involved here.

I would define these in the system table and not the cluster table, but
that's your choice.

You'll need to reboot to use the new definitions.

And you're repeatedly quoting your own message here, and not the replies.

Not everybody runs with threaded news readers, or with threading enabled.

And even if somebody is running with threaded, quoting your own message
provides no useful new context.

Richard B

unread,

Apr 17, 2019, 4:42:04 PM4/17/19

to

Steve,

I had no intention of quoting myself which, as you rightfully point out, serves no useful purpose. But the browser-interface to this forum provides no other method of replying directly to a specific message other than to cut the original text and then do the reply, as I just did. Leave it to Giggle.

I'll refer to syslogicals.template as you suggested, yes.

Thanks again for your input. And I promise not to reply while quoting my original text. Very similar to talking to oneself on a park bench, in a trench coat whilst feeding the pigeons. Mea culpa!

Stephen Hoffman

unread,

Apr 17, 2019, 4:56:53 PM4/17/19

to

On 2019-04-17 20:42:03 +0000, Richard B said:

> I had no intention of quoting myself which, as you rightfully point
> out, serves no useful purpose. But the browser-interface to this forum
> provides no other method of replying directly to a specific message
> other than to cut the original text and then do the reply, as I just
> did. Leave it to Giggle.

The Google Groups interface and gateway into usenet isn't the only path
into the newsgroups.

As an alternative, a news client for your preferred platform, and access via...

http://www.eternal-september.org
http://albasani.net/index.html.en
http://news.aioe.org
etc.

If you want support, there's a not-free plan at:
http://easynews.com/usenet-plans.html
etc.

Michael Moroney

unread,

Apr 17, 2019, 5:07:41 PM4/17/19

to

Richard B <richard...@gmail.com> writes:

>HI all,

>I'm having a serious issue trying to boot a node of a three-node cluster. =
>The cluster is comprised of 2x rx2600 i2 machines and 1x rx2660. Each node=
> is running VMS v8.4 and has internal-only storage. Each node has its own =
>SYSTEM disk but the 2 rx2600 i2's system disk is shadowed between the two n=

>odes:

>Node 1: DSA0: $1$DKA0:
>Node 2: DSA0: $2$DKA0:

>Node 2 boots fine and mounts the virtual device DSA0: with its physical uni=

>t at $2$DKA0:

>It then forms a cluster just fine with the third, rx2660, node.

>When I try to boot Node 1 I get a bugcheck during boot. The output is:

>%CNXMAN, Now a VMScluster member -- system RYEIP1
>%SHADOW-F-/-CONFIGSCAN, enabled automatic disk serving

>**** OpenVMS I64 Operating System V8.4 - BUGCHECK ****

A setup like that with two nodes with a system disk shadowset with non-shared
members will not work. When a node is down, its disk gets removed from the shadowset
(as seen by the other node) and when it tries to return, its system disk is out of
date, and shadowing won't accept it.

You really need shared storage of some sort.

I have heard of a *hack* that sort of allows this. Sort of. Each system MSCP
serves its drive to the other. Each system is set up to boot over the network from
the other node's disk. As long as the other node is up (and the quorum issue has
been dealt with), it will boot, add its own disk as a copy target and once the
shadow copy completes (can take a while!), all is well.

Much better to add some sort of shared storage (fibrechannel).

Obviously, this will not work if both nodes are down. Each will be waiting for the
other, and nothing will happen. One of the nodes will have to be manually booted
from its own disk.

Stephen Hoffman

unread,

Apr 17, 2019, 6:13:23 PM4/17/19

to

On 2019-04-17 21:07:40 +0000, Michael Moroney said:

> I have heard of a *hack* that sort of allows this. Sort of. Each
> system MSCP serves its drive to the other. Each system is set up to
> boot over the network from the other node's disk. As long as the other
> node is up (and the quorum issue has been dealt with), it will boot,
> add its own disk as a copy target and once the shadow copy completes
> (can take a while!), all is well.

Wouldn't touch that configuration with the proverbial barge pole.

And a shadow copy across gigabit Ethernet isn't speedy with what's
large for OpenVMS but now considered a dinky terabyte or two of hard
disk or SSD storage. Yeah, maybe minicopy...

And neither system can boot from the secondary if the local primary
storage fails, absent reconfiguration as a cluster satellite.

Whether scrounging used HBAs and a switch and some low-end FC SAN
storage, or scrounging multi-host SCSI and storage, would be cheaper?

More supportable to use the existing disks and the existing RAID
controllers, and to then configure the shared files...

Phillip Helbig (undress to reply)

unread,

Apr 17, 2019, 7:26:18 PM4/17/19

to

In article <fa564223-9342-4e11...@googlegroups.com>,
Richard B <richard...@gmail.com> writes:

> I had no intention of quoting myself which, as you rightfully point out, se=
> rves no useful purpose. But the browser-interface to this forum provides n=
> o other method of replying directly to a specific message other than to cut=
> the original text and then do the reply, as I just did. Leave it to Giggl=
> e.

Get NEWSRDR. It runs on VMS. Use EDT, customizable with macros. Don't
waste your time with a web browser in usenet.

Dave Froble

unread,

Apr 17, 2019, 11:16:46 PM4/17/19

to

What people don't seem to want to face is, VMS cluster and Star coupler
were made for each other.

Yeah, there have been newer storage options, but the bottom line is, if
your plan is to throw CPUs at the system, shared storage is what's
called for.

--
David Froble Tel: 724-529-0450
Dave Froble Enterprises, Inc. E-Mail: da...@tsoft-inc.com
DFE Ultralights, Inc.
170 Grimplin Road
Vanderbilt, PA 15486

Michael Moroney

unread,

Apr 17, 2019, 11:21:15 PM4/17/19

to

Stephen Hoffman <seao...@hoffmanlabs.invalid> writes:

>On 2019-04-17 21:07:40 +0000, Michael Moroney said:

>> I have heard of a *hack* that sort of allows this. Sort of. Each
>> system MSCP serves its drive to the other. Each system is set up to
>> boot over the network from the other node's disk. As long as the other
>> node is up (and the quorum issue has been dealt with), it will boot,
>> add its own disk as a copy target and once the shadow copy completes
>> (can take a while!), all is well.

>Wouldn't touch that configuration with the proverbial barge pole.

Not endorsing it. After all, it is a configuration that that requires a shadow
copy upon every reboot.

>And a shadow copy across gigabit Ethernet isn't speedy with what's
>large for OpenVMS but now considered a dinky terabyte or two of hard
>disk or SSD storage. Yeah, maybe minicopy...

Yes. I don't recall if minicopy is an option for system disks like this.

>And neither system can boot from the secondary if the local primary
>storage fails, absent reconfiguration as a cluster satellite.

Maybe you didn't understand what I wrote. In this bizarre configuration, each
system is essentially a satellite of the other.

>Whether scrounging used HBAs and a switch and some low-end FC SAN
>storage, or scrounging multi-host SCSI and storage, would be cheaper?

At least better. I haven't looked to see what islandco may have or if one can
get a couple of interface cards, switch and FC chassis on Ebay cheap, or
something.

Hans Vlems

unread,

Apr 18, 2019, 1:55:51 AM4/18/19

to

@Richard B: it’s a newly formed cluster, right?
Are you sure that the cluster authentication is correct?
Node 1’s shadowing parameters all set correctly (also shadow_sys_disk)?
Does the node boot standalone, from a dvd, and if so will it boot from a local shadow set (which may have just one physical volumefor this purpose)?