Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Heads up: directory corruption on V8.4

83 views
Skip to first unread message

Simon Clubley

unread,
Jun 6, 2012, 4:57:45 AM6/6/12
to
A new patch kit has been released by HP to help them diagnose reported
problems with directory corruption on V8.4. The full announcement requires
a HP portal login, so I am not sure how much I can post, but this is the
relevant part of the annoucement:

|5 NEW FUNCTIONALITY AND/OR PROBLEMS ADDRESSED IN THE VMS84A_F11X-V0100
| KIT
|
| 5.1 New functionality addressed in this kit
|
| 5.2 Problems addressed in this kit
|
| 5.2.1 Additional consistency checks have been added to
| the File System's (F11BXQP) directory operations.
|
| 5.2.1.1 Problem Description:
|
| Reports have been received about inconsistencies in
| directory files. In some cases, the directory
| records were not in alphabetical sequence. In some
| cases, there were stale records. In some cases,
| records that should have been present were missing.
| In all these cases, the affected directory files
| were several hundreds to several thousands of blocks
| long, and contained long file names that were
| identical or very similar in their first 12-17
| characters. The directories also contained many
| files with hundreds of versions. The problem has
| not been seen on engineering test systems as yet.
| Additional consistency checks have been added to the
| File System's (F11BXQP) directory operations. Where
| appropriate, the image has been enhanced to also
| report a warning to the system operator via OPCOM
| upon discovering directory inconsistencies.
|
| These OPCOM messages looks like the ones below:
|
| EVENT : DIRECTORY CORRUPTION DETECTED
|
| VOLUME LABEL = FOO
|
| DIR_FID = (123,0,0)
|
| DIR FILE NAME = FOO.DIR;1
|
| MISSING FILE NAME = FOO1.TXT
|
| EVENT : DIRECTORY CORRUPTION DETECTED
|
| VOLUME LABEL = FOO
|
| Page 3
|
|
| DIR_FID = (123,0,0)
|
| DIR FILE NAME = FOO.DIR;1
|
| UNEXPECTED EFBLK UPDATE
|
| The SYSGEN parameter XQPCTLD7 is used to control the
| new behavior through setting bits 0, 1 and 2.
|
| The default value of XQPCTLD7 is ZERO, in which case
| the XQP uses a new internal counter to validate the
| number of Directory entries before and after a file
| create or delete operation.
|
| This default value should be appropriate for most
| systems. The only difference from previous XQP
| behavior is the additional validation described
| above. There is a negligible performance hit (under
| 1%) with this validation.
|
| If you have observed consistency issues with a
| directory file, and believe the issue is likely to
| recur in the near future, you may modify XQPCTLD7 as
| described below to help HP troubleshoot the issue.
|
| Setting Bit 0 causes the XQP to fill a known pattern
| in the unused blocks of a directory file. This
| enables the XQP to detect a directory inconsistency
| nearer the problem origin than otherwise. The
| performance impact of enabling this check, as
| observed in tests at HP, is around 3% for file
| create and delete operations, and negligible for
| other operations.
|
| Setting Bit 1 causes the XQP to request an inline
| bugcheck if it detects an inconsistency. This
| generates a system dump to aid HP engineering in
| root causing this issue. Bit 1 is clear (zero) by
| default and the XQP only reports an OPCOM message
| when it detects an inconsistency; the XQP does not
| request a bugcheck by default.
|
| Bit 2 is set to disable the new default behavior
| (XQPCTLD7=0) described above. XQPCTLD7 is a Dynamic
| SYSGEN parameter and any changes to it take effect
| without requiring a re-boot.
|
| This parameter applies to both ODS-2 and ODS-5
| volumes.

Has anyone seen this on V8.3 ?

(I am running V8.3 Alpha and have not seen this yet.)

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world

VAXman-

unread,
Jun 6, 2012, 8:34:28 AM6/6/12
to
I've been working with a client for the past 2 weeks WRT this directory
corruption on OpenVMS Alpha V8.3. So, while you may not have seen this,
it does occur on V8.3 and is not isolated to Itanium.

FWIW, this client had HUGE directory files on the order of many thousand
blocks. They have installed this patch (part of UPDATE V17) and are now
busy cleaning up their directories. Drudge work but it has to be done.


--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

Well I speak to machines with the voice of humanity.

Simon Clubley

unread,
Jun 6, 2012, 8:57:26 AM6/6/12
to
Thanks for the confirmation that it is not restricted to V8.4.

> FWIW, this client had HUGE directory files on the order of many thousand
> blocks. They have installed this patch (part of UPDATE V17) and are now
> busy cleaning up their directories. Drudge work but it has to be done.
>

Since this patch does not actually fix anything, but just detects the
problem, we still don't know what the root cause is which is rather
worrying.

I wonder if this problem has come about as part of a previous patch or
if it's present in the base V8.3 code. It seems to have been dormant
for a long time if it's the latter.

abrsvc

unread,
Jun 6, 2012, 9:54:58 AM6/6/12
to
It is not unusual for bugs to surface even in seemingly stable software. Based upon what I have read thus far, the combination of large directory files with many versions seems to trigger the problem or at least make it more likely. Without a true understanding of the actual trigger, finding the problem is not easy. I recall a microcode bug for the 11/780 that was not found until VMS V3 running Backup. This meant that the bug had existed for 3+ years without being seen or found. Still a bug (although never "fixed").

Dan

Paul Sture

unread,
Jun 6, 2012, 10:11:13 AM6/6/12
to
On Wed, 06 Jun 2012 08:57:45 +0000, Simon Clubley wrote:

> A new patch kit has been released by HP to help them diagnose reported
> problems with directory corruption on V8.4. The full announcement
> requires a HP portal login, so I am not sure how much I can post, but
> this is the relevant part of the annoucement:
>
> In some cases, the directory records were not in alphabetical sequence.

>
> Has anyone seen this on V8.3 ?
>
> (I am running V8.3 Alpha and have not seen this yet.)

I came across this problem back in 1997 on a cluster, and put it down to
a mixture of ancient hardware and a cluster using a mixture of version
which wasn't on the officially supported list.

I forget the exact details but a dump of the directory showed that an
entry for file ZZZ.DAT had landed up somewhere in the middle of the "C*"
entries. The result was that I couldn't access files whose entries were
further up the directory. IIRC I ended up something like deleting the
directory and doing an ANALYZE/REPAIR on the disk, though I'm sure it
wasn't as simple as that (I dropped a shadowset member, did a BACKUP/
PHYSICAL and worked on that copy for safety, then repeated the actions on
the live disk). At no point did I believe I was going to lose the data
irretrievably, but I was lucky enough to have the problem reported by a
user as soon as it happened.

FWIW this involved files I *had* to get back, for the payroll team had
spent all day bashing in data and the deadline for sending the results to
the bank was looming.

--
Paul Sture

Keith Parris

unread,
Jun 6, 2012, 2:55:37 PM6/6/12
to
On 6/6/2012 2:57 AM, Simon Clubley wrote:
> A new patch kit has been released by HP to help them diagnose reported
> problems with directory corruption on V8.4.

With the help of diagnostic images provided to customers who experienced
the directory corruption problem, OpenVMS Engineering has a potential
fix in the works. After feedback from these customers on the
effectiveness of the fix, an official patch kit containing the fix will
be released.

Simon Clubley

unread,
Jun 7, 2012, 7:28:28 AM6/7/12
to
Thanks for this update Keith.

Do we know when the problem was introduced ?

If it was a recent patch kit which I have not installed, then I don't
need to install the new patch kit when it arrives. If it's a long
standing latent problem in the base V8.3 code, then that's a different
matter. :-)

Thanks,

VAXman-

unread,
Jun 7, 2012, 7:46:46 AM6/7/12
to
In article <jqq38r$h1k$1...@dont-email.me>, Simon Clubley <clubley@remove_me.eisner.decus.org-Earth.UFP> writes:
>On 2012-06-06, Keith Parris <keithparris...@yahoo.com> wrote:
>> On 6/6/2012 2:57 AM, Simon Clubley wrote:
>>> A new patch kit has been released by HP to help them diagnose reported
>>> problems with directory corruption on V8.4.
>>
>> With the help of diagnostic images provided to customers who experienced
>> the directory corruption problem, OpenVMS Engineering has a potential
>> fix in the works. After feedback from these customers on the
>> effectiveness of the fix, an official patch kit containing the fix will
>> be released.
>
>Thanks for this update Keith.
>
>Do we know when the problem was introduced ?
>
>If it was a recent patch kit which I have not installed, then I don't
>need to install the new patch kit when it arrives. If it's a long
>standing latent problem in the base V8.3 code, then that's a different
>matter. :-)

Simon,

The site I'm working with that has been getting the directory corruption had
a plain vanilla V8.3 Alpha installation with just a few compilers. Nothing
had been patched until the UPDATE V17 to install the F11X diagnostic patch.
So, it looks like you'll be patching when/if the directory corruption bug is
fixed.

Paul Sture

unread,
Jun 7, 2012, 9:14:00 AM6/7/12
to
On Wed, 06 Jun 2012 12:57:26 +0000, Simon Clubley wrote:

>>
> Since this patch does not actually fix anything, but just detects the
> problem, we still don't know what the root cause is which is rather
> worrying.
>
> I wonder if this problem has come about as part of a previous patch or
> if it's present in the base V8.3 code. It seems to have been dormant for
> a long time if it's the latter.
>

This reminds me of something I came across in the early V6.2 era.

The problem was demonstrated by a procedure which attempted to address
the 32767 version number limit for a directory which contained tens of
thousands of spool files. It used the way BACKUP processes files in
reverse order of version number. E.g.

$ dir/da login.com

Directory SYS$SYSDEVICE:[PAUL]

LOGIN.COM;17 2-APR-2012 17:02:50.39
LOGIN.COM;16 29-JUL-2010 20:08:29.94

$ backup login.com [.temp]*.*.0/log
%BACKUP-S-CREATED, created SYS$SYSDEVICE:[PAUL.TEMP]LOGIN.COM;1
%BACKUP-S-CREATED, created SYS$SYSDEVICE:[PAUL.TEMP]LOGIN.COM;2
$ dir /da [.temp]

Directory SYS$SYSDEVICE:[PAUL.TEMP]

LOGIN.COM;2 29-JUL-2010 20:08:29.94
LOGIN.COM;1 2-APR-2012 17:02:50.39

Note that the highest version created is the oldest file.

This procedure then did the reverse to put the files back, but starting
with version 1.

What I found was that one day was that it got gaps in the version
numbers. I can't remember now but BACKUP might have gone into some kind
of endless loop.

I did submit an SPR with a reproducer, and the last I heard of it was
that OVMS Engineering had managed to reproduce the problem. I changed
jobs shortly afterwards so have no idea if the SPR got fixed.

To fix our problem I wrote a bit of DCL which used ;-0 to find the lowest
version of a given file, rename it to ;1, and so on, so as far as the SPR
went, it dropped to a low priority for us.

--
Paul Sture

Simon Clubley

unread,
Jun 7, 2012, 1:00:03 PM6/7/12
to
Yes, it certainly looks that way. :-)

Thanks, Brian (and everyone else who answered).
0 new messages