Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using Linux for data archival

3 views
Skip to first unread message

Cyber Punk

unread,
Jan 23, 2009, 7:45:38 PM1/23/09
to
I currently have lots of data that I'd like to archive with varying
degrees of reliability. The data I have consists of:

1) Important documents - needs to be encrypted and redundantly stored.
I can use Truecrypt for encryption.

2) Large files of non-essential data I'd just like to have easily
accessible. DVD rips, isos, music.

3) Many small files of non-essential data, such as website wgets or
saved webpages.

Can anyone recommend:
1) The best Linux filesystem to use for such data; doesn't corrupt
easily, remains quick for a few large files/many small files, less
prone to data fragmentation.

2) What open source data archival software I should use that has on
average a high compression ratio and a recovery record to help recover
most/all of the archive in the event of data corruption.

3) Whether it is better to store data as tarfiles & compressed with a
recovery record, or uncompressed without being tarred and no recovery
record.

4) One insiduous way hard drives fail is that files start
disappearing. Is there a way of getting Linux to report missing
files?

5) My file book keeping was less than perfect and sometimes I have
multiple copies of files of the same name. Is there a way of copying
everything into a large hard drive but getting Linux to only overwrite
clashing file names if they are newer?

6) Whether statistically speaking with some thought of cost, one is
better off with RAID arrays or just backing up key data to DVDs.

Thanks.

terryc

unread,
Jan 23, 2009, 8:48:01 PM1/23/09
to
On Fri, 23 Jan 2009 16:45:38 -0800, Cyber Punk wrote:

> I currently have lots of data that I'd like to archive with varying
> degrees of reliability.

Explain what you mean by archive?

Multiple copies to Cd/DVd not sufficent?
Use different brands if you are really worried.

Robert Heller

unread,
Jan 23, 2009, 9:30:42 PM1/23/09
to
At Fri, 23 Jan 2009 16:45:38 -0800 (PST) Cyber Punk <cyberpu...@googlemail.com> wrote:

>
> I currently have lots of data that I'd like to archive with varying
> degrees of reliability. The data I have consists of:
>
> 1) Important documents - needs to be encrypted and redundantly stored.
> I can use Truecrypt for encryption.

Sure.

>
> 2) Large files of non-essential data I'd just like to have easily
> accessible. DVD rips, isos, music.
>
> 3) Many small files of non-essential data, such as website wgets or
> saved webpages.
>
> Can anyone recommend:
> 1) The best Linux filesystem to use for such data; doesn't corrupt
> easily, remains quick for a few large files/many small files, less
> prone to data fragmentation.

Data fragmentation is a non-issue for all Linux file systems.
Generally all Linux filesystems perform reasonably well until they
become close to being full.

>
> 2) What open source data archival software I should use that has on
> average a high compression ratio and a recovery record to help recover
> most/all of the archive in the event of data corruption.
>
> 3) Whether it is better to store data as tarfiles & compressed with a
> recovery record, or uncompressed without being tarred and no recovery
> record.

Depends on the data.

>
> 4) One insiduous way hard drives fail is that files start
> disappearing. Is there a way of getting Linux to report missing
> files?

I've never known this to happen ever on a Linux system. And with a
proper RAID-5 system it is a non-issue.

>
> 5) My file book keeping was less than perfect and sometimes I have
> multiple copies of files of the same name. Is there a way of copying
> everything into a large hard drive but getting Linux to only overwrite
> clashing file names if they are newer?
>
> 6) Whether statistically speaking with some thought of cost, one is
> better off with RAID arrays or just backing up key data to DVDs.

If the data is 'dead', i.e. not something you would be using regularly,
back it up to CD-Rs/DVD-Rs and store the media someplace safe. Print
proper labels and get a supply of jewel cases (to better physically
protect the media). If it is 'live' something you will be accessing
regularly, put it on a RAID array. Whether as a tarball or untared
depends on what sort of immediaicy is involved. Less immediate data can
live 'tared and feathered', more immediate laid out normally.

>
> Thanks.
>

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Download the Model Railroad System
http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows
hel...@deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/

The Natural Philosopher

unread,
Jan 24, 2009, 4:07:42 AM1/24/09
to
What I have here, may work for you.

I have a debian linux server, which contains ALL my data that I don't
want to lose, and serves three desktop machines. Using SMB. Its a 6 year
old chassis with very little RAM and no screen at all.

Even my mail clients store all the mail on it, and I did toy with web
stuff, but decided a list of bookmarks and browsing history wasn't that
important.

It has a second hard drive, and every night a cron job rdiff updates the
second drive to be a copy of the first.

I looked at burning dvds, but the cost of doing that was after a short
while, more expensive than the second hard drive. And I had filename issues.

The second drive is twice as big as the first one. When the first one
fills up, I will make the second one the main data disk, get one twice
as big again, and use that as backup.

The advantages of doing this are:-

- Having a file server integrates well with the desktop machines: if you
get in the habit of using the server for everything, there is no extra
action required to ensure the data is on there.

- using the second hard drive to autoback the first, is - if automated -
a huge boon. Unlike RAID, you actually have a *copy* of everything, so
if you screw up a file as I did yesterday, the original is in the backup
for last night..no need to find a backup and do anything untarrish.

- If either of the disks on the server go pear shaped, you have the
other. RAID can itself go pear shaped. I always prefer mirroring to RAID
if the data is slow moving enough. use RAID to keep a fast moving data
handling machine on a 24x7 uptime..don't use it to preserver archival
material.

- the data is instantly accessible. No need to store DVDs, find them,
insert and fiddle.


If you are truly paranoid, get a friend who is likewise, and back up
each others data over the Internet. That works even if your machines get
stolen.


Douglas Mayne

unread,
Jan 26, 2009, 4:42:07 PM1/26/09
to

I having been using multiple backup strategies. Optical based media are
cheap and have a long shelf life (once burned and verified.) I have
switched to using DVDs now for a few reasons; read more below. Magnetic
media, such as portable USB HD have great capacity and fast read/write
times. The potential downside is they are more fragile than optical media
and (reportedly) there are "drive stick" issues if the drive goes too
long without being used. Both optical and magnetic have there place, IMO.

Optical...
I have switched to DVD-R optical media now because I am finally convinced
that is equally reliable to CD-R. The "sandwich" design of a DVD offers
more built in protection. CD-R's "film" is a layer applied directly to the
surface and can be damaged. DVD's can write data in 4G chunks, which is
much better than CD-R's 700 MB. (Use udf structures to gain the ability
to write files larger than 2G.) The media are priced about the same per
disc, whether CD-R or DVD-R. I stick with brand name media, which
currently goes for about $0.30/disk (on sale). Maybe, CD-R are a bit
cheaper, but in that same ballpark.

Encryption...
As far as encryption, linux's built in device mapper encryption works for
me; YMMV. It is very flexible and adapts nicely to entire volumes or
loopback files. One thing which can be done fairly easily is to "span"
across multiple 4G chunks using device mapper tables. This allows the
backup to be written to an encrypted container of a larger size than the
media allows, and without requiring a tedious "split" command.

Caution: The remainder of this post is long, and may be tending towards
being "off-toptic."

Example...
A user would like to backup 20G of data. That will fit on 5 x 4G blocks.
The following is an example for createing a 20G encrypted container which
is to "span" across 5 DVD-Rs.

Disclaimer: Use extreme caution whenever working directly with partitions
or as the root user. Be sure you are familiar with all commands, and
that they are appropriate for your system. Surprises when backing up
are never fun!

Now, to begin with the example, the steps are as follows:

1. Allocate the five blocks on a local magentic disk with adequate
free space.

# dd if=/deve/zero of=b1 bs=1024 count=0 seek=4000000
# dd if=/deve/zero of=b2 bs=1024 count=0 seek=4000000
# dd if=/deve/zero of=b3 bs=1024 count=0 seek=4000000
# dd if=/deve/zero of=b4 bs=1024 count=0 seek=4000000
# dd if=/deve/zero of=b5 bs=1024 count=0 seek=4000000

2. Setup loopbacks pointing to new 4G sub-blocks.

# losetup /dev/loop1 b1
# losetup /dev/loop2 b2
# losetup /dev/loop3 b3
# losetup /dev/loop4 b4
# losetup /dev/loop5 b5

3. Setup outer container using a device mapper table.
3.a. Create a file describing the container:
# vi tab
0 8000000 linear /dev/loop1 0
8000000 8000000 linear /dev/loop2 0
16000000 8000000 linear /dev/loop3 0
24000000 8000000 linear /dev/loop4 0
32000000 8000000 linear /dev/loop5 0

3.b. Create the container.

# cat tab | dmsetup create container

4. Generate encryption parameters and secret keys to be used for
your encrypted container.

Caution: whenever using encryption, take care not to lock
yourself out of your own data!

Familiarize yourself with the cryptsetup command, and its
parameters. man cryptsetup. It requires several parameters;
I'll skip ahead by assuming that the following environment
variables have been set according to your desires:
HASH
CIPHER
KL
and that the file "key" is initialized with the key

# cat key | cryptsetup -h $HASH -c $CIPHER -s $KL create econtainer /dev/mapper/container

5. Format your encrypted container. I usually just use ext2.

# mke2fs /dev/mapper/econtainer

6. Mount the encrypted container.

# mount /dev/mapper/econtainer /mnt/efs

7. Fill the encrypted container with the data to be encrypted/backed
up. Not shown. Mostly, I just use tar to backup. Choose the backup tool
that meets your needs.

Note that there are a couple of advantages to using a loopback filesystem,
instead of iso9660, etc. First, there will be no limitation with
directory depth, filenaming convention, or with respect to unix file
permissions. Since I chose the ext2 filesystem, all of that information will be
embedded and stored efficiently (even for large counts of small files, etc.)

8. When your backup is complete, then begin "tearing down" the encrypted
container. This requires reversing the previous steps.
a. unmont the encrypted container, /mnt/efs
b. release the device mapper object, econtainer
c. release the device mapper object, container
d. release the loopback devices, (/dev/loop1 to /dev/loop5)

9. (optional) Compute md5sums of every block (b1.md5 to b5.md5) and
the overall block (20G). If these values are available, then
the media can be easily checked in the future to verify that
the backup is still readable, etc. The sums enable this verification
without requiring that the block be assembled or be decrypted.


10. Assemble your dvd images, in prep for mkisofs.

# mkdir 2009-dvd-037
# mv b1 2009-dvd-037
# mv b1.md5 2009-dvd-037
# mkdir 2009-dvd-038
# mv b2 2009-dvd-038
# mv b2.md5 2009-dvd-038
# mkdir 2009-dvd-039
# mv b3 2009-dvd-039
# mv b3.md5 2009-dvd-039
# mkdir 2009-dvd-040
# mv b4 2009-dvd-041
# mv b4.md5 2009-dvd-039
# mkdir 2009-dvd-042
# mv b5 2009-dvd-042
# mv b5.md5 2009-dvd-042

Note: along with the block data itself (b1,b2,b3,b4,b5) I include
a gpg message which holds the encryption parameters. The
message is encrypted for "multiple recipients." The recipient
list should include the users who are authorized to decrypt the
backup. And in addition to the encryption parameters, the
encrypted message contains more information about what was
backed up. Also, some tables and scripts which are useful
in to anyone restoring the backup. For example, I include
the table with all of the "linear" statements, etc.

11. Write (and verify) your optical images. Not shown. I use
mkisofs and growisofs.

Warning: hopefully, I didn't make too many mistakes describing
this process. It isn't really too bad, once you get used to doing it
and have some scripts to help in the process.

--
Douglas Mayne

0 new messages