Help with solving an error message on a cpio backup

23 views
Skip to first unread message

David Font

unread,
May 12, 2004, 8:51:48 PM5/12/04
to
I am trying to solve an error message with OSR5 site performing a 'relative'
cpio backup of the Operating System "/" filesystem.

This is a OSR505 system, that was upgraded nearly 2 years ago to OSR506
- It is patched with rs506a, oss635a, oss636a, oss639a.
- There are 2 hard drives which are IDE IBM 30GB.
- The tape drive is a Tandberg 4GB/8GB attached to SMDS-type SCSI adaptor
installed with the 'bhba' driver.

When performing the / filesystem backup, the commands are as follows:

cd /
find . -mount | cpio -ocv | compress -H | dd of=/dev/rct0 conv=sync

This has worked before at this site and at many other sites over the years,
on OSR502 - OSR507.

This site however reports a failure about 5 minutes into the backup, the
error, noted below, on descriptor 2 is intercepted by the script running the
cpio:

dd: write error: No such device or address (error 6)

This error is occuring in directory:
/usr/lib/custom/custom/DBCache/SCO:OpenServerCD

As the user had just decided to purchase their own media, I initially
considered there was a media error or a media type incompatibility. The
reason I originally thought this was because I sent one of my tapes down and
the cpio to that tape had worked.

However there is only a finite number of times the user can go back to their
retailer and request a replacement media, and the matter is dragging on and
causing the user exasperation and for me concern in case there is a future
catastrophic failure.

This site successfully performs 'cpio' backups of other directories and
filesystems using the same commands, just the "/" filesystem backup fails.

Does anybody know what (error 6) actually means?
Am I looking in the wrong direction by blaming the quality of the media.?
Is it possibly a disk error?

Thanks in advance
Dave


Bela Lubkin

unread,
May 13, 2004, 3:30:40 AM5/13/04
to sco...@xenitec.ca
David Font wrote:

> I am trying to solve an error message with OSR5 site performing a 'relative'
> cpio backup of the Operating System "/" filesystem.
>
> This is a OSR505 system, that was upgraded nearly 2 years ago to OSR506
> - It is patched with rs506a, oss635a, oss636a, oss639a.
> - There are 2 hard drives which are IDE IBM 30GB.
> - The tape drive is a Tandberg 4GB/8GB attached to SMDS-type SCSI adaptor
> installed with the 'bhba' driver.
>
> When performing the / filesystem backup, the commands are as follows:
>
> cd /
> find . -mount | cpio -ocv | compress -H | dd of=/dev/rct0 conv=sync
>
> This has worked before at this site and at many other sites over the years,
> on OSR502 - OSR507.
>
> This site however reports a failure about 5 minutes into the backup, the
> error, noted below, on descriptor 2 is intercepted by the script running the
> cpio:
>
> dd: write error: No such device or address (error 6)

A cursory look at the "Stp" source makes me think that this must be due
to a tape "end of medium" condition. That would be where the drive
reports to the driver that it's seen a media-specific "the end of this
tape is coming up soon" marker on the tape.

You shouldn't see that 5 minutes into the backup; at least, it seems
unlikely.

I don't have any definitive answers on this. But let's look at your
backup command for a moment:

> find . -mount | cpio -ocv | compress -H | dd of=/dev/rct0 conv=sync

The actual writing to the tape is being done by `dd`. You haven't
specified a block size, so dd will use its default, 512 bytes. That is
almost certainly a horribly inefficient block size for that drive (or
almost any tape drive ever made). As a result, the drive probably
spends a lot of time "shoeshining" -- seeking back and forth on the tape
because it's lost track of where it was.

Shoeshining shortens the life of both the drive and the media. It also
reduces the effective size of the media, because after each write-seek
cycle, the write head is positioned somewhere after the previous write
-- you'd like it to be _immediately_ after, but in practice, most drives
waste a _lot_ of space between writes.

You're also using "conv=sync". `dd` loops, reading a chunk of its block
size (here, 512 bytes), then writing that chunk out. Without
"conv=sync", if one of those reads comes up short (say it only got 120
bytes), it writes a short block to match. With "conv=sync", it writes
the full block size; in my example, the 120 bytes it successfully read
plus 392 bytes of 0. You're compressing your data stream. If dd ever
actually did this, the inserted 0 bytes would corrupt the compressed
data stream; you would be unable to uncompress from the tape.

What you really want to be doing is reblocking the `compress` output
into chunks large enough to make the tape drive happy, and absolutely
_not_ using "conv=sync". Something more like:

find . -mount | cpio -ocv | compress -H | dd of=/dev/rct0 obs=16k

This uses the implied input block size (ibs) of 512 bytes. When ibs !=
obs, dd reblocks. You wouldn't want to use "bs=16k", which sets both
ibs and obs; when they're the same, dd doesn't reblock, it just copies
the result of each read directly out as a same-size write. Since you're
reading from a pipe, you'll never get more than 5K in a single chunk;
not enough to make the drive happy, and an odd size that might make it
especially unhappy.

When dd is reblocking, nothing prevents the last block from being short.
If the drive is especially picky, that might be a problem. You may
originally have added "conv=sync" because the very last block was being
written short, aggravating some particular drive. Padding the end of
the file wouldn't hurt your compressed stream, and would mollify the
drive. To handle that, you could use dd _twice_ in the pipeline, once
to create large blocks and a second time to pad the final short block;
something like:

... | dd obs=4k | dd ibs=4k obs=16k conv=sync

"conv=sync" pads _input_ blocks, so the final write is guaranteed to be
an exact multiple of 4K.

Next, you could set the block size being used by the tape _driver_ by
using `tape -a 16384 setblk`. Sometimes this matters, sometimes it
doesn't.

I suspect that the overall solution to your problem lies somewhere in
the general area of block sizes. As you can see, this may be a bit
tricky to figure out.

Automatically figuring out this sort of stuff is one of the many
benefits of the various "super-tar" commercial backup programs. They
also have extended diagnostic capabilities and very capable tech support
staff. Aside from whatever technical issue, you are creating this
problem for yourself by trying to build a "sophisticated" backup
procedure out of the crude tools at hand. There are other problems with
the way you're compressing (for instance, a single bad bit on the
tape guarantees that every file past that point will be completely
unreadable).

> This error is occuring in directory:
> /usr/lib/custom/custom/DBCache/SCO:OpenServerCD
>
> As the user had just decided to purchase their own media, I initially
> considered there was a media error or a media type incompatibility. The
> reason I originally thought this was because I sent one of my tapes down and
> the cpio to that tape had worked.
>
> However there is only a finite number of times the user can go back to their
> retailer and request a replacement media, and the matter is dragging on and
> causing the user exasperation and for me concern in case there is a future
> catastrophic failure.
>
> This site successfully performs 'cpio' backups of other directories and
> filesystems using the same commands, just the "/" filesystem backup fails.
>
> Does anybody know what (error 6) actually means?
> Am I looking in the wrong direction by blaming the quality of the media.?
> Is it possibly a disk error?

It sounds like it's in the tape subsystem (drive, driver, or bad
orders).

Who's paying for all this time? Grab a free test install from
www.backupedge.com, www.lonetar.com, www.ctar.com, www.bru.com.
Whichever one you try, it will probably do the job much faster and take
up less tape; and if it fails, the failure will give you a lot more
information about what went wrong.

>Bela<

Mike Brown

unread,
May 13, 2004, 10:22:34 PM5/13/04
to

You might try running a backup into /dev/null just to check that everything
is reading from the hard drive correctly.

The real answer is to buy one of the supertars. You get bit level verify,
much faster restore of a single file, the ability to make bootable floppies/CD
for a bare metal recovery, AND a utility like edge.sizer that will write
out and report the maximum data a particular tape will handle. Great for
testing suspect SCSI controller, tape drive and media combinations.

Mike

--
Michael Brown

The Kingsway Group

David Font

unread,
May 13, 2004, 11:53:12 PM5/13/04
to
The user has accepted they bought a from a batch of dud tapes. I sent one of
my tapes (same brand) and they work.

However I am experimenting with suggestions to improve throughtput
efficiency using dd/cpio.

I currently use similar commands when doing cpio backups on UW711. For
instance I use obs=64k piped into the dd of=/dev/rct0 command with bs=64k.

Dave
"Bela Lubkin" <be...@sco.com> wrote in message
news:20040513073...@sco.com...

Bela Lubkin

unread,
May 14, 2004, 3:48:22 AM5/14/04
to sco...@xenitec.ca
David Font wrote:

> The user has accepted they bought a from a batch of dud tapes. I sent one of
> my tapes (same brand) and they work.
>
> However I am experimenting with suggestions to improve throughtput
> efficiency using dd/cpio.
>
> I currently use similar commands when doing cpio backups on UW711. For
> instance I use obs=64k piped into the dd of=/dev/rct0 command with bs=64k.

I was going to say that this probably didn't work like you thought, but
some tests show that actually it does -- on UW7. Observe:

UW7$ dd if=/dev/zero obs=4k count=1024 | dd bs=4k of=junk
1024+0 records in
128+0 records out
128+0 records in
128+0 records out
UW7$ dd if=/dev/zero obs=64k count=1024 | dd bs=64k of=junk
1024+0 records in
8+0 records out
8+0 records in
8+0 records out
UW7$ dd if=/dev/zero obs=128k count=1024 | dd bs=128k of=junk
1024+0 records in
4+0 records out
0+8 records in
0+8 records out
OSR5$ dd if=/dev/zero obs=4k count=1024 | dd bs=4k of=junk
1024+0 records in
128+0 records out
128+0 records in
128+0 records out
OSR5$ dd if=/dev/zero obs=64k count=1024 | dd bs=64k of=junk
1024+0 records in
8+0 records out
0+64 records in
0+64 records out
OSR5$ dd if=/dev/zero obs=128k count=1024 | dd bs=128k of=junk
1024+0 records in
4+0 records out
0+64 records in
0+64 records out

Each final file was 524288 bytes long. Notice the status report from
the 2nd `dd` in each example. Notice that both 4K tests, and the 64K
test on UW7, show "n+0" records in and out. The UW7 128K test and the
OSR5 64K & 128K tests show "0+n" records in and out.

What's happening is that the size of the kernel pipe buffer on UW7
appears to be 64K, and on OSR5 it is 8K (I mistakenly implied it was 5K
in my previous post). In both 128K tests, the first `dd` writes 128K at
a time, but those writes get blocked in the middle (when the pipeline
fills). The reader can read a maximum of one pipe buffer at a time. On
UW7, that's 64K at a time -- an exact match for the 64K test, but only
half-size reads in the 128K test. Because ibs and obs are the same, dd
doesn't reblock; it ends up making 8 64K writes. Those are reported as
"8+0" in the 64K test, but "0+8" in the 128K test: 0 full-sized writes
and 8 partials.

OSR5's 8K pipe buffer is sufficient for the 4K test. At 64K and 128K we
see "0+64" records. Dividing the total 512K by 64, we can see that the
pipe buffer is 8K.

All of which goes to show that if you're trying to reblock for efficient
tape access, you can use `dd` this way for sizes up to 64K on UW7, but
only 8K on OSR5. You can reblock to a larger size by deliberately using
different ibs and obs:

OSR5$ dd if=/dev/zero count=1024 | dd obs=128k of=junk
1024+0 records in
1024+0 records out
1024+0 records in
4+0 records out
OSR5$ dd if=/dev/zero count=1024 | dd ibs=8k obs=128k of=junk
1024+0 records in
1024+0 records out
10+149 records in
4+0 records out

In the first example I allowed ibs to be the default 512 bytes. In the
second example I used 8k. Notice that the 2nd `dd` says there were 10
complete 8K records, as well as a bunch of partials. That's due to the
vagaries of process scheduling. If the writer process got to run
unimpeded for a while, it might have filled the entire 8K pipe buffer.
Then the reader process's 8K read would return a full 8K. Most of the
time the processes alternate faster than that.

A flaw with this scheme is that if the input isn't an exact multiple of
the obs, the last block will be short:

OSR5$ dd if=/dev/zero count=1023 | dd obs=128k of=junk
1023+0 records in
1023+0 records out
1023+0 records in
3+1 records out

If you were only doing this for performance, that doesn't matter. If
you were doing it because the tape drive insists on a specific fixed
block size (some do), it's a problem. Which makes me think that `dd`
should have a flag like "conv=opad" or "conv=osync" that means to pad
that last partial block to the full obs. But it doesn't.

I'm pretty sure that on OSR5, you'll get more mileage from `tape -a
setblk`. And even more from a super-tar.

>Bela<

David Font

unread,
May 16, 2004, 10:43:25 PM5/16/04
to
Yes I initially tried 16k and I was going to drop down to 8k then 4k to see
what difference it made. Thanks for the insight.

Dave
"Bela Lubkin" <be...@sco.com> wrote in message

news:20040514074...@sco.com...

Reply all
Reply to author
Forward
0 new messages