I've had essentially the same MAIL.MAI file since 1992. It has moved
across several physical disks, VAX and ALPHA machines and versions of
VMS. I've occasionally done a COMPRESS (the last time on 6-APR-2009) or
PURGE/RECLAIM.
A long time ago, to decrease the number of MAIL$* files per directory, I
used the FILE command to move messages to several new MAIL files in
several directories (but on the same disk).
So far, so good.
Yesterday, I did
MAIL> DIR MAIL/START=99999
and didn't see, as expected, some recent messages, but rather the last
one was from 13-SEP-2009. After
MAIL> DIR MAIL
I got
%MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
-RMS-F-CHK, bucket format check failed for VBN = 27201
and the same error with another command:
MAIL> LAST/EDIT
%MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
-RMS-F-CHK, bucket format check failed for VBN = 27201
I then made a BACKUP of the file. I haven't touched the backup copy
(MAIL.BCK). I then RENAMEd the MAIL.MAI to MAI.SAV. I then sent a mail
to myself so that a new MAIL.MAI file was created. I haven't noticed
any problems with this new mail file.
The disk in question shows no errors.
A few minutes ago, I did
MAIL> SET FILE DISK$USER:[HELBIG.MAIL.MAIL]MAIL.SAV
(i.e. I'm now using the RENAMEd MAIL.MAI which was having problems).
Now,
MAIL> DIR MAIL/START=99999
works as expected, as does
MAIL> LAST/EDIT
Questions:
What could have caused this problem?
Why did the problem go away after I RENAMEd the MAIL.MAI file?
Should I try out MAIL.BCK (the BACKUP copy of the corrupt file) with
some read-only commands, or should I not touch it at all for safety's
sake until I have more information?
I don't think I have done anything unsupported. My configuration might
not be optimal in terms of performance, but could the problems have
arisen by overstepping some (inofficial) boundaries such as the size of
MAIL.MAI (it was 28985 blocks), the number of files in the directory
where it resides, the number of messages etc?
I emptied my MAIL.MAI file (created yesterday) by deleting or moving all
the files in all of the folders. I then RENAMEd the MAIL.SAV file back
to MAIL.MAI. It now works as expected, i.e. no errors.
Possibly relevant information: The disk in question is a 3-member shadow
set. For a bit over a month, it has been reduced to a 2-member shadow
set because a node in the cluster hosting the 3rd member died (problem
with the power supply) and I haven't gotten around to replacing it. Both
the remaining members are on the same node. I now see the following
entry in the operator log from that node:
***************
%%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.76 %%%%%%%%%%%
$22$DKA100: (DANEEL) has been removed from shadow set.
%%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.83 %%%%%%%%%%%
In other words, a few hours after I noticed the problem, one of the
members of the shadow set where the MAIL.MAI file resides was removed
(so that I am now down to one member). Obviously, my first priority now
will be getting this shadow set back to three members. (Fortunately, I
have enough identical disks so that I can replace some should they be
physically defective. They are SX910800N. I started out with a pair of
these (used when I got them) back in 2002 and for several years they ran
continuously with no problem. Back in the spring one disk then failed
(whether permanently damaged I don't know) and I replaced it with a new
(to me) disk and shortly after added a third disk to the shadow set.
This was a few months ago. This ran without problems (as far as I know)
until the node hosting one member got the problem with the power supply.
That was more than a month ago, so I've had the two-node shadow set
(both members are those which I put in back in the spring) since then
(until yesterday).
I'm sending this message now and will then investigate more. Any help
is appreciated.
> ***************
> %%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.76 %%%%%%%%%%%
> $22$DKA100: (DANEEL) has been removed from shadow set.
>
> %%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.83 %%%%%%%%%%%
I privately mounted this disk /NOWRITE. No problem.
Unfortunately, MAIL can't show me anything since it apparently wants
WRITE access even if I just want to READ something. :-(
Since the disk mounted privately OK, I'll try to bring it back into the
shadow set and see what happens.
> ***************
> %%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.76 %%%%%%%%%%%
> $22$DKA100: (DANEEL) has been removed from shadow set.
>
> %%%%%%%%%%% OPCOM 23-NOV-2009 16:13:51.83 %%%%%%%%%%%
What are the possible reasons why such a disk is removed from the shadow
set and is there any way for me to find out more? Keep in mind that the
surviving member of the shadow set is on the same node (and on the same
controller and in the same expansion box).
Too late, I started to reply before reading through the end ! Oh ! No !
> %MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
> -RMS-F-CHK, bucket format check failed for VBN = 27201
This is an indexed file problem (RMS). It has nothing to do with your
disks (shadowed or not).
It is far more likely to have been caused by poor software such as the
POP and IMAP servers that come with TCPIP Services.
ANA/RMS MAIL.MAI may give you more information on the extent of the
damage. There is also ANA/RMS/INTERACTIVE which lets you "walk" though
the indexed file structure.
If you:
TYPE MAIL.MAI/OUTPUT=MAIL.NEW
does it also generate the error message ? You might be able to recreate
it this way, and then CONVERT MAIL.NEW MAIL.MAI/FDL=MAIL.FDL
Another possibility would be to use DCL to read (sequentially) your
MAIL.MAI file using an alternate key, write to a sequential output file
and use CONVERT to get it back to indexed format.
In my case, MAIL.MAI didn't get corrupt, but the IMAP server screw up
message listings which the client kept trying to reload. Only solution
was to move the whole mailbox to another location and start from
scratch, and use DECW$MAIL to access older mail messafges from older
location. That is one of the main reasons I am moving away from VMS.
Lastly, you could ana/RMS/FDL/OUTPUT=MAIL.FDL MAIL.MAI
then CONVERT/FDL=MAIL.FDL MAIL.MAI MAIL.NEW
If this works, you could just put MAIL.NEW as MAIL.MAI and your problem
would be solved.
OK, I mounted the disk back into the shadow set, and the shadow copy got
up to a few percent before I saw:
%%%%%%%%%% OPCOM 24-NOV-2009 09:44:23.06 %%%%%%%%%%%
Message from user SYSTEM on DANEEL
%SHADOW_SERVER-E-SSRVTERMCPY, terminating copy operation on device _DSA510:
at LBN: 1942592, ID number: 75000511.
%%%%%%%%%%% OPCOM 24-NOV-2009 09:44:23.07 %%%%%%%%%%%
Message from user SYSTEM on DANEEL
%SHADOW_SERVER-E-SSRVTRMSTS, reason for termination of operation on device
_DSA510: ABORT, abort
Unfortunately, I don't know WHY it aborted.
OK, time to swap out the disk.
Let this be a warning to everyone who thought I was paranoid because I
extended the shadow set to 3 members. :-|
> > %MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
> > -RMS-F-CHK, bucket format check failed for VBN = 27201
>
> This is an indexed file problem (RMS). It has nothing to do with your
> disks (shadowed or not).
>
> It is far more likely to have been caused by poor software such as the
> POP and IMAP servers that come with TCPIP Services.
Never in my entire life have I used POP or IMAP. I am justifyably proud
of this achievement. The mail files are accessed exclusively via VMS
MAIL.
So the first opening line is BS. When you compress, you get a brand
spanking new mail file with a brand spanking new (odd expression that
is!) RMS indexed structure.
You messages are the same, but the file is not.
> MAIL> DIR MAIL/START=99999
:
> MAIL> DIR MAIL
>
> I got
>
> %MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
> -RMS-F-CHK, bucket format check failed for VBN = 27201
Those two commands both equally make RMS read the file by alternate
key 1, the folder key. They would touch the very same buckets, as RMS
(unfortunately) does not have a function to just read the index. It
always touches the data buckets also.
So one time reading bucket 27201 it recevieved different results than
an onther timer.
That means either a hardware error, or a shadow set out of sync.
> I then made a BACKUP of the file. I haven't touched the backup copy
> (MAIL.BCK).
Fine, but may have potentially backed up somewhat random data.
What you should have done is DUMP/HEADER/BLOC=COUNT=0 to figure out
the mapping of VBN 27201 and the 4 blocks following (assuming a
default 5 block bucket size).
Then dump the LOGICAL blocks behind those LBN, FOR EACH SHADOW
MEMBER.
Of course you'd also want to DUMP/BLO=(COUN=5,STAR=27201), but
>> I then RENAMEd the MAIL.MAI to MAI.SAV.
Good. Better than copy.
>> MAIL> SET FILE DISK$USER:[HELBIG.MAIL.MAIL]MAIL.SAV
> MAIL> DIR MAIL/START=99999
> works as expected, as does
> MAIL> LAST/EDIT
That's much like the first command working and the second did not.
It apparently has 2 sources, and you happen to pick the right one
again.
> What could have caused this problem?
Hardware, or shadowing, or a VBN cache going awry, or Multiple -
Allocated block style disk corruption?
RMS indexed files is one of the few (only) tools to perform a handful
of checks before using any data just read. As such they are often
'canaries in the coal mine' for underlying problems. Other files may
just silently return random data.
> Why did the problem go away after I RENAMEd the MAIL.MAI file?
It didn't. It only seemed that way.
> Should I try out MAIL.BCK (the BACKUP copy of the corrupt file) with
> some read-only commands, or should I not touch it at all for safety's
> sake until I have more information?
You can fart around with a copy as much as you like. Why go easy?
You have the backup right?
> could the problems have
> arisen by overstepping some (inofficial) boundaries such as the size of
> MAIL.MAI (it was 28985 blocks),
There are no boundaries. 30,000 blocks may seem big for an Emial files
to some, but it is not, and as an RMS indexed file to hold that
strucure it is very small.
I daily work with RMS indexed file multiple thousands of times bigger
than that.
>> I then RENAMEd the MAIL.SAV file back to MAIL.MAI. It now works as expected, i.e. no errors.
Now _that_ is scary, about 1000 times more scary than starting to work
on the backup. You are now working on the very blocks that have been
know
> Possibly relevant information: The disk in question is a 3-member shadow
> set.
... a yeah!
A simple CONVERT or Email COMPRESS may be all that is needed IF the
VBN corruption is not in a data bucket.
If that fails, you can either try to assemble the file from records
read around the dad spot, or try to fix up the broken bucket(s). There
is likely more than 1 broken bucket.
If you want help with that, then you need to provide a DUMP for the
(whole) bucket and maybe a block before an after.
Good luck!
Hein.
> > %MAIL-E-READERR, error reading DISK$USER:[HELBIG.MAIL.MAIL]MAIL.MAI
> > -RMS-F-CHK, bucket format check failed for VBN = 27201
>
> This is an indexed file problem (RMS). It has nothing to do with your
> disks (shadowed or not).
It certainly could, because in a shadow set there are multiple sources
to read from.
>
> It is far more likely to have been caused by poor software such as the
> POP and IMAP servers that come with TCPIP Services.
B*&& S^%~.
All of such software uses RMS to manipulate the Email files.
.
> ANA/RMS MAIL.MAI may give you more information on the extent of the
> damage. There is also ANA/RMS/INTERACTIVE which lets you "walk" though
> the indexed file structure.
Right!
> TYPE MAIL.MAI/OUTPUT=MAIL.NEW
> does it also generate the error message ?
Right, but a simple $SEARC/STAT will do that work better, providing
counts.
You might be able to recreate
> it this way, and then CONVERT MAIL.NEW MAIL.MAI/FDL=MAIL.FDL
Right
> Another possibility would be to use DCL to read (sequentially) your
> MAIL.MAI file using an alternate key, write to a sequential output file
> and use CONVERT to get it back to indexed format.
Why not reduce the number of moving parts and read by primary key?
When you get the -CHK error, take the key from the last succesfull
read and binary search between it and 'now' to find a next starting
place.
The fact that the key is binary, makes that tricky, but not
impossible.
> Lastly, you could ana/RMS/FDL/OUTPUT=MAIL.FDL MAIL.MAI
> then CONVERT/FDL=MAIL.FDL MAIL.MAI MAIL.NEW
Right, except that the ANALYZE will likely fail.
So just ANAL/RMS any other MAIL.MAI and adjust the allocations (or
record counts if you also use EDIT/FDL/NOINTER)
Or use a tool like my FDL$GENERATE. Unlike ANAL/RMS that does not read
the blocks, it just uses the prologue information.
Cheers,
Hein.
Yeah, it needs to keep track of which email are new and which
have been read. It does this by moving it out of the NEWMAIL
folder, which requires write access. And even if you're not
reading any new mail, mail doesn't know that when it opens
the file.
> So one time reading bucket 27201 it recevieved different results than
> an onther timer.
> That means either a hardware error, or a shadow set out of sync.
>
[...]
>
> >> I then RENAMEd the MAIL.MAI to MAI.SAV.
>
> Good. Better than copy.
>
> >> MAIL> SET FILE DISK$USER:[HELBIG.MAIL.MAIL]MAIL.SAV
> > MAIL> DIR MAIL/START=99999
> > works as expected, as does
> > MAIL> LAST/EDIT
>
> That's much like the first command working and the second did not.
> It apparently has 2 sources, and you happen to pick the right one
> again.
>
> > What could have caused this problem?
>
> Hardware, or shadowing, or a VBN cache going awry, or Multiple -
> Allocated block style disk corruption?
>
> RMS indexed files is one of the few (only) tools to perform a handful
> of checks before using any data just read. As such they are often
> 'canaries in the coal mine' for underlying problems. Other files may
> just silently return random data.
>
> > Why did the problem go away after I RENAMEd the MAIL.MAI file?
>
> It didn't. It only seemed that way.
>
[...]
> > Possibly relevant information: The disk in question is a 3-member shadow
> > set.
>
> ... a yeah!
>
[...]
> Hein.
Just as an aside, this reminds me of when I did a standalone backup of
a system disk that was a member of a two-member shadow set. I used /
RECORD. Then when bringing the system up with both disks, a DIRECTORY
command would show the backup date oscillating between two values with
each file. So I did a copy operation and all was well. Is there some
better way to have done this?
AEF
There are always issues when shadowing the system disk. I think the
method that is known to work is to do the standalone backup on the
first volume, then boot the system and let shadowing bring the second
volume up to date. (Which is a lot like what you did.)
As I admittedly opined (or pronounced [!]) before:
It is amazing how involved things get when you try to do something
that at first seems very simple.
AEF
> On Nov 24, 3:02=A0am, hel...@astro.multiCLOTHESvax.de (Phillip Helbig---
> undress to reply) wrote:
> > I've had essentially the same MAIL.MAI file since 1992. =A0It has moved
> > across several physical disks, VAX and ALPHA machines and versions of
> > VMS. =A0I've occasionally done a COMPRESS (the last time on 6-APR-2009) o=
> r
> > PURGE/RECLAIM.
>
> So the first opening line is BS. When you compress, you get a brand
> spanking new mail file with a brand spanking new (odd expression that
> is!) RMS indexed structure.
> You messages are the same, but the file is not.
Right; that's what I meant.
> So one time reading bucket 27201 it recevieved different results than
> an onther timer.
> That means either a hardware error, or a shadow set out of sync.
I suspect the latter since I later noticed that a member had been
expelled. The mail file seems fine on the remaining member.
Fortunately I have enough spare disks and will replace it tomorrow. (I
could have done it today, but the disk in the expansion box was the
wrong size (slightly fewer total blocks) so I'll have to get out the
screwdriver.
> >> I then RENAMEd the MAIL.SAV file back to MAIL.MAI. It now works as
> expected, i.e. no errors.
>
> Now _that_ is scary, about 1000 times more scary than starting to work
> on the backup. You are now working on the very blocks that have been
> know
By the time I renamed it back, I was down to just one member in the
shadow set.