Here's an outline of how I would do this on Linux. Can any Windows
advocates provide a way to do something like this on Windows?
1. I'd mount each CD, and do this:
find /mnt/cdrom -type f -print0 | xargs -0 md5sum > disc-N.sums
cut -b1-32 disc-N.sums | sort | uniq > disc-N.sums2
2. I'd get sums for all the files on my hard disk:
find / -type f -print0 | xargs -0 md5sum > disk.sums
3. Find all the files that exist on at least 3 of the CDs:
cat disc-*.sums2 | sort | uniq -c | sed -e '/ [12] /d' | cut -b9- > on3
grep -f on3 disk.sums | cut -b35-
Or let us change this a little. I want the list of all files on my disk
that aren't on at least 3 CDs. For that, we just have to change step 3.
3. Find all files that aren't on at least 3 CDs:
cat disc-*.sums2 | sort | uniq -c | sed -n -e '/ [12] /p' | cut -b9- > not3
grep -f not3 disk.sums | cut -b35-
Note that the above procedure is not making use of the file names or sizes.
It is checking the file content, so that if my file foo on the hard disk is
on CD1 under the name bar, CD2 under the name spam, and CD3 under the name
bletch, it will be found as being on 3 CDs. And if my file foo2 on the disk
is NOT on the CDs, but I do have files named foo2 on CD1, CD2, and CD3 that
are all the same size as my foo2 on the disk, the procedure will not be
fooled.
(Something similar would work on OS X. It uses the Berkeley md5 command
instead of md5sum, which has a different output format, so some of the cut
commands would have to be replaced, but other than that, it would be the
same, as it would be on pretty much any Unix).
Is this something for a non-techie? Probably not. But on Linux, and techie
with basic shell scripting skill could turn this into a simple script that a
non-techie could use easily.
PS: there is a bug in the above procedure: a case it won't handle well.
Anyone spot it?
--
--Tim Smith
On Thu, 22 Jun 2006 03:26:11 -0000,
Tim Smith <reply_i...@mouse-potato.com> wrote:
> Suppose you've got several CDs that you've burned over the years, containing
> backups or archives of your files. You would like to find a list of all the
> files currently on your hard disc that are archived on at least 3 different
> CDs.
>
> Here's an outline of how I would do this on Linux. Can any Windows
> advocates provide a way to do something like this on Windows?
>
> 1. I'd mount each CD, and do this:
>
> find /mnt/cdrom -type f -print0 | xargs -0 md5sum > disc-N.sums
> cut -b1-32 disc-N.sums | sort | uniq > disc-N.sums2
>
> 2. I'd get sums for all the files on my hard disk:
>
> find / -type f -print0 | xargs -0 md5sum > disk.sums
>
Just a note, better exclude /proc and probably /sys from this, probably
a good idea to do the same for /tmp and most of /var.
> 3. Find all the files that exist on at least 3 of the CDs:
>
> cat disc-*.sums2 | sort | uniq -c | sed -e '/ [12] /d' | cut -b9- > on3
> grep -f on3 disk.sums | cut -b35-
>
> Or let us change this a little. I want the list of all files on my disk
> that aren't on at least 3 CDs. For that, we just have to change step 3.
>
> 3. Find all files that aren't on at least 3 CDs:
>
> cat disc-*.sums2 | sort | uniq -c | sed -n -e '/ [12] /p' | cut -b9- > not3
> grep -f not3 disk.sums | cut -b35-
>
> Note that the above procedure is not making use of the file names or sizes.
> It is checking the file content, so that if my file foo on the hard disk is
> on CD1 under the name bar, CD2 under the name spam, and CD3 under the name
> bletch, it will be found as being on 3 CDs. And if my file foo2 on the disk
> is NOT on the CDs, but I do have files named foo2 on CD1, CD2, and CD3 that
> are all the same size as my foo2 on the disk, the procedure will not be
> fooled.
>
> (Something similar would work on OS X. It uses the Berkeley md5 command
> instead of md5sum, which has a different output format, so some of the cut
> commands would have to be replaced, but other than that, it would be the
> same, as it would be on pretty much any Unix).
>
> Is this something for a non-techie? Probably not. But on Linux, and techie
> with basic shell scripting skill could turn this into a simple script that a
> non-techie could use easily.
>
> PS: there is a bug in the above procedure: a case it won't handle well.
> Anyone spot it?
>
a couple in fact, the /proc issues (wait to you try doing an md5sum on
/proc/kmem... although it might be fine now, it used to be a kerblooiey
kind of problem) also, -type f will only find "real" files, soft links,
and fifo/pipes will be ignored. (which is good in the case of the fifo,
as otherwise, md5would hang there, waiting for input)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
iD8DBQFEmjpTd90bcYOAWPYRAnQ/AKCGiYXU9YThktS3bnm31+X4sTNTBACfXOL/
lyt0IBl1fuUN35b6U4fCUuo=
=iBt3
-----END PGP SIGNATURE-----
--
Jim Richardson http://www.eskimo.com/~warlock
You know you're in trouble when the Russians are adding safety features
to your design.
Maciej Cegłowski on Buran, the Space Shuttle clone
This is more what I had in mind. The golden tidbits of actual use.
--
Where are we going?
And why am I in this handbasket?