The script is fairly sparse, but it does some minimal error checking and reporting.
There is plenty of room for improvement!
<begin script>
#!/bin/bash
#
# Usage: dedup.scr "source_dir" "dest_dir"
#
# Purpose: Deduplicate files in directories A and B (with hard-links).
# Directory A is the potential source directory,
# Directory B is the potential dest directory.
#
# Return Value:
# 0 : no errors
# 1 : improper usage
# 2 : bad source or bad dest directory
# 3 : check some problem creating file in dest directory (un-writable?)
# 4 : check some problem creating hardlink in dest directory
# 5 : check some problem removing existing file on dest directory
# 6 : check some problem making hardlink at dest directory
#
# Method:
# Name matches are candidates for md5sum computation checks.
# md5sum matches are deemed "complete matches." They are removed
# from the target, and hardlinked back to the match in the source.
#
# Copyright (C) 2010 Douglas D. Mayne
#
# This program is licensed under the General Public License. See this file
# for terms and conditions: http://www.gnu.org/licenses/gpl-2.0.html
#
[ ! $# -eq 2 ] && exit 1
[ ! -d "$1" ] && exit 2
[ ! -d "$2" ] && exit 2
T1=$(tempfile -pdedup.)
T2=$(tempfile -pdedup.)
T3=$(tempfile -pdedup.)
T4=$(tempfile -pdedup.)
T5=$(tempfile -pdedup.)
T6=$(tempfile -pdedup.)
T7=$(tempfile -pdedup.)
T8=$(tempfile -pdedup.)
A1=$(cd "$1" && pwd)
A2=$(cd "$2" && pwd)
(cd "$A1" && find . -type f) >$T1
(cd "$A2" && find . -type f) >$T2
cat $T1 $T2 | sort | uniq -d >$T3
C_NAME_INT=$(wc -l <$T3)
if [ $C_NAME_INT -eq 0 ];then
echo No name matches. No Deduplication.
exit 0
fi
#
# Verify we can make a hardlink on target.
#
j=$(basename $T8)
k=$(head -1 <$T3)
(cd "$A2";
echo Begin testing...try to create a hardlink at target.
touch $j;
[ ! -e $j ] && exit 3
rm $j
mkdir $j
cp -al "${A1}/${k}" "$j"
k=$(basename $k)
[ ! -e ${j}/${k} ] && exit 4
rm -fr $j
echo Successfully tested creating hardlink at target.
)
RV=$?
[ $RV -ne 0 ] && exit $RV
(IFS=$'\n';cd "$A1" && for i in $(cat $T3);do
md5sum $i
done) >$T4
(IFS=$'\n';cd "$A2" && for i in $(cat $T3);do
md5sum $i
done) >$T5
cat $T4 $T5 | sort | uniq -d | sed 's/^.\{32\} //' | sort >$T6
[ ${DEBUG:=0} -ne 0 ] && cat $T6 && exit 0
MC=$(wc -l <$T6)
if [ $MC -eq 0 ];then
echo No complete matches. No deduplication.
else
echo $MC complete matches. Beginning deduplication.
(IFS=$'\n';cd "$A2" && for i in $(cat $T6);do
rm "$i" && echo "$i"
[ -e "$i" ] && echo "$i" >>$T7
done)
EC1=$(wc -l <$T7)
if [ $EC1 -ne 0 ];then
echo Error Count: $EC1
echo Existing files could not be deleted as listed in this file, $T7
exit 5
fi
(IFS=$'\n';cd "$A1" && for i in $(cat $T6);do
ln "$i" "${A2}/${i}"
[ ! -e "$i" ] && echo "$i" >>$T8
done)
EC2=$(wc -l <$T8)
if [ $EC2 -ne 0 ];then
echo Error Count: $EC2
echo Hard links could not be created for the files as listed here, $T8
exit 6
else
echo Ended deduplication. No errors.
fi
fi
rm $T1 $T2 $T3 $T4 $T5 $T6 $T7 $T8
exit 0
<end script>
--
Douglas Mayne
> # Purpose: Deduplicate files in directories A and B (with hard-links).
> # Directory A is the potential source directory,
> # Directory B is the potential dest directory.
I see the three line "Purpose" comment in the code, but I'm still not
clear what the script is intended to do. If you want useful critique I
think this needs to be better explained.
* Is it to scan directory A and hardlink files? If so, what's B for?
* Is it to scan A and copy/link files into B, hardlinking those targets
when they're duplicates? If so, does that mean that A remains pristine
(untouched)?
* Why are these "potential" source/dest directories?
* What's your definition of "duplicate" files? Same contents? Or also
same timestamp, permissions, and owner/group?
Chris
> I am probably asking for it, but here is a script submitted for your
> comments/improvements. BTW, I have only done minimal testing with it;
> USE AT YOUR OWN RISK!
>
> The script is fairly sparse, but it does some minimal error checking and
> reporting. There is plenty of room for improvement!
>
I had to laugh at today's xkcd: http://xkcd.com/844/
I fell into the trap that I could write "quick" good code. So, here is an
updated version with a few more sanity checks included. Hopefully, this
version is closer to working, and I won't have to "throw it all out
and start over."
<begin script>
#!/bin/bash
#
# Usage: dedup.scr "source_dir" "dest_dir"
#
# Purpose: Deduplicate files in directories A and B (with hard-links).
# Directory A is the potential source directory,
# Directory B is the potential dest directory.
#
# Return Value:
# 0 : no errors
# 1 : improper usage
# 2 : bad source and/or bad dest directory
# 3 : check some problem creating file in dest directory (un-writable?)
# 4 : check some problem creating hardlink in dest directory
# 5 : check some problem removing existing file on dest directory
# 6 : check some problem making hardlink at dest directory
#
# Method:
# Name matches are candidates for md5sum computation checks.
# md5sum matches are deemed "complete matches." They are removed
# from the target, and hardlinked back to the match in the source.
#
# Copyright (C) 2010, 2011 Douglas D. Mayne
# Last modified: 2011-01-07 11:07 MST
#
# This program is licensed under the General Public License. See this file
# for terms and conditions: http://www.gnu.org/licenses/gpl-2.0.html
#
[ $# -ne 2 ] && exit 1
[ ! -d "$1" ] && exit 2
[ ! -d "$2" ] && exit 2
A1=$(cd "$1" && pwd)
[ $? -ne 0 ] && exit 2
A2=$(cd "$2" && pwd)
[ $? -ne 0 ] && exit 2
[ "$A1" == "$A2" ] && exit 2
#
# Temp files
#
T1=$(tempfile -pdedup.)
T2=$(tempfile -pdedup.)
T3=$(tempfile -pdedup.)
T4=$(tempfile -pdedup.)
T5=$(tempfile -pdedup.)
T6=$(tempfile -pdedup.)
T7=$(tempfile -pdedup.)
T8=$(tempfile -pdedup.)
T9=$(tempfile -pdedup.)
#
# begin subshell, cleanup temp files after exit
#
(echo Beginning deduplication.
(cd "$A1" && find . -type f) >$T1
(cd "$A2" && find . -type f) >$T2
cat $T1 $T2 | sort | uniq -d >$T3
C_NAME_INT=$(wc -l <$T3)
if [ $C_NAME_INT -eq 0 ];then
echo No name matches. No Deduplication.
exit 0
fi
#
# Verify we can make a hardlink on target.
#
j=$(basename $T8)
k="$(head -1 <$T3)"
(IFS=$'\n';cd "$A2";RV=3
echo Begin testing...try to create a hardlink at target.
touch "$j";
[ ! -e "$j" ] && exit $RV
rm "$j"
mkdir "$j"
cp -al "${A1}/${k}" "$j"
k="$(basename $k)"
if [ -e "${j}/${k}" ];then
echo Successfully tested creating hardlink at target.
RV=0
else
echo Error: Unsuccessful tests for creating hardlink at target.
RV=4
fi
rm -fr "$j"
exit $RV
)
RV=$?
[ $RV -ne 0 ] && exit $RV
(IFS=$'\n';cd "$A1" && for i in $(cat $T3);do
md5sum "$i"
done) >$T4
(IFS=$'\n';cd "$A2" && for i in $(cat $T3);do
md5sum "$i"
done) >$T5
cat $T4 $T5 | sort | uniq -d | sed 's/^.\{32\} //' | sort >$T6
[ ${DEBUG:=0} -ne 0 ] && cat $T6 && exit 0
MC=$(wc -l <$T6)
if [ $MC -eq 0 ];then
echo No complete matches. No deduplication.
exit 0
fi
echo Found $MC complete matches.
(IFS=$'\n';cd "$A2" && for i in $(cat $T6);do
rm "$i" && echo "$i"
[ -e "$i" ] && echo "$i" >>$T7
done)
EC1=$(wc -l <$T7)
if [ $EC1 -ne 0 ];then
echo Error: Some existing files could not be deleted at target dir.
echo Check $EC1 files as listed in this file, $T7
#
# Don't just bail out if EC1 is non-zero.
# To avoid introducing damage to dest directory, go ahead and create
# hardlinks for files that were just deleted.
#
cat $T6 $T7 | sort | uniq -u >$T9
else
cat $T6 >$T9
fi
(IFS=$'\n';cd "$A1" && for i in $(cat $T9);do
ln "$i" "${A2}/${i}"
[ ! -e "${A2}/${i}" ] && echo "$i" >>$T8
done)
EC2=$(wc -l <$T8)
if [ $EC2 -ne 0 ];then
echo Error: Some hardlinks could not be created at target dir.
echo Check $EC2 files as listed in this file, $T8
fi
if [ $EC1 -eq 0 ] && [ $EC2 -eq 0 ];then
exit 0
else
[ $EC2 -ne 0 ] && exit 6
exit 5
fi
)
RV=$?
#
# We left subshell, now clean up. Save T7, T8 if necessary.
#
rm $T1 $T2 $T3 $T4 $T5 $T6 $T9
if [ $RV -eq 0 ];then
rm $T7 $T8
echo Ended deduplication and cleaned up. No Errors.
else
echo Exiting with error $RV
fi
exit $RV
> Douglas Mayne <no...@invalid.com> wrote:
>> I am probably asking for it, but here is a script submitted for your
>> comments/improvements.
>
>> # Purpose: Deduplicate files in directories A and B (with hard-links).
>> # Directory A is the potential source directory, # Directory B is
>> the potential dest directory.
>
> I see the three line "Purpose" comment in the code, but I'm still not
> clear what the script is intended to do. If you want useful critique I
> think this needs to be better explained.
>
The program does "post process" file-level deduplication. The general idea
is discussed here:
<quoting wikipedia:>
Post-process deduplication
With post-process deduplication, new data is first stored on the storage
device and then a process at a later time will analyze the data looking
for duplication. The benefit is that there is no need to wait for the
hash calculations and lookup to be completed before storing the data
thereby ensuring that store performance is not degraded. Implementations
offering policy-based operation can give users the ability to defer
optimization on "active" files, or to process files based on type and
location. One potential drawback is that you may unnecessarily store
duplicate data for a short time which is an issue if the storage system
is near full capacity.
<end quote>
This program works by identifying potential matches and then verifying if
it an exact match. If so, it "deduplicates" by deleting the file at
directory b, and hard linking back to an identical file in directory a.
>
> * Is it to scan directory A and hardlink files? If so, what's B for? *
> Is it to scan A and copy/link files into B, hardlinking those targets
> when they're duplicates? If so, does that mean that A remains pristine
> (untouched)?
> * Why are these "potential" source/dest directories? * What's your
> definition of "duplicate" files? Same contents? Or also same timestamp,
> permissions, and owner/group?
>
Files are deemed equivalent if they have the name (defined by relative
path name) and if they have the same md5sum to check for binary
equivalence.
>
> Chris
>
Note: Comments inline.
Thanks for your reply. I have just posted a revised version. I am finding
that there are no simple, quick and dirty programs anymore ;)
The genesis of this program was trying to reduce the amount of storage
used when tracking the "current" branch of Slackware. The Slackware
mirrors only keep the "latest" version of the distribution, and earlier
versions are lost. That makes it difficult to roll back to a known
working state, etc. The main part of the distribution (binaries) are
about 1G and incremental changes are usually very small.
BTW, there are other techniques which may work better than the post
analysis method used in this program. But those other techniques
generally require doing right way from the outset. The technique in this
program allows for "after the fact" deduplication, as explained in the
wikipedia article.
--
Douglas Mayne
Curiosity begs me ask why the names have to match if the content is the same.
> Thanks for your reply. I have just posted a revised version. I am finding
> that there are no simple, quick and dirty programs anymore ;)
:-)
Chris
> Douglas Mayne <no...@invalid.com> wrote:
>> Files are deemed equivalent if they have the name (defined by relative
>> path name) and if they have the same md5sum to check for binary
>> equivalence.
>
> Curiosity begs me ask why the names have to match if the content is the
> same.
>
I assume there are implementations of deduplication which look for
matches based on the file hash value and its size. I assume they work by
sorting through a table with the (size,hash) parameters looking for
matches. It certainly could work that way, but my method opted to use the
relative pathname as the starting point and also requiring it to be
constant on both trees being compared. For me and my data sets, that was
a good simplifying assumption; it also kept the solution withing reach of
a simple bash script.
>
>
>> Thanks for your reply. I have just posted a revised version. I am
>> finding that there are no simple, quick and dirty programs anymore ;)
>
> :-)
>
> Chris
>
Note: Comment inline.
There is one basic fact that should be remembered about hardlinked files
vs. multiple independent copies. When deduplication is employed the user
should be careful because a change to /* any */ propogates to /* all */.
That may not be what you want! This deduplication technique has the most
advantage when used to store data that will be treated as a "constant."
The areas where it has the most advantage is when used to track
differential snapshots, such the file repository "point in time" snapshot
that I mentioned.
Also AIUI, there are dedupliation methods which work at the "block level"
which do not have the above caveat/limitation. Those system deduplicate
block on the fly, and will create independent copies using a copy-on-
write (COW) mechanism for any changed copy. For example, I think that is
a feature of the ZFS filesystem. For sure, it sounds like a complicated
mechanism to get right!
--
Douglas Mayne
> I am probably asking for it, but here is a script
> submitted for your comments/improvements.
> BTW, I have only done minimal testing with it;
> USE AT YOUR OWN RISK!
>
> The script is fairly sparse, but it does some minimal
> error checking and reporting.
> There is plenty of room for improvement!
My experience with deduplication.
The background. I made around 15,000 pictures/videos of my
daughter in five years. Yeah, digital camera (doesn't)
help. The problem is that while I had the original
(more or less intact) tree, I had multiple copies of it,
some with different directory layout and/or filenames,
some of them were not exact copies (missing dirs/files),
some were hardlinked, some had extra "processed" junk in
it (which I already forgot what it was), and even worse,
my computer had a (then) undetected memory error, and
some of the copies were not exact copies. All in all,
I had something like 10-15 trees and over 200 thousands
files. The whole mess was a couple of hundred gigabytes
(maybe half a TB). Unfortunately, I also didn't completely
trust the original tree as the full tree, so somehow I
had to dedup the whole mess.
Due to the size of this mess, a simple visual comparison or
something simple was simply out of question. I could have
built a "database" of extracted data, and as I modified the
original trees, I would have updated the "database". But
this would have been extremely error-prone and slow.
Finally I solved the problem in this way: First I wrote
a script which extracted every available info from the
files: dimensions, orientation, md5sum, picture/video
creation date, if the file is damaged, etc. I put all these
extracted info into extended attributes to each file. My
script was smart enough to extract only those infos which
were not already extracted, so if the script was killed,
I could restart and it would only process the rest. This
script ran for around a day (really niced).
Once done, I had another script which either exported all
the data from the extended attributes to filenames or
removed them. So a file normally called hospital.jpeg became
hospital.jpeg__-date-2005_10_07_10_03_27-geometry-1600x1200-\
md5-c8f0e3a09f0069a8cfcf5594e469c299-orientation-portrait-\
error-no-..jpeg.
I ensured that each attribute name/value pair would
match this regexp: "-$name-[^-]*" as can be seen in
this example. If a file was removed, then all the
associated data disappeared. If a file was renamed/moved,
then all the associated data stayed with it. If I didn't
need an attribute anymore (errors for example), then I
just simply modified the script to ignore that attribute
from then on.
Then yet another script (all in all, I had something like
20 helper scripts) summarized each tree. This process
looked like "find $x -name '*_date_*' > $x.list. Then all
those other helper scripts simply did text comparisons on
the list files and spitted out the results, for example if
one tree was wholly included in another tree, or how much
was the difference among the trees, etc. I could even find
all the bad copies by relying on the same creation-date
and different md5sum. Yet another script told me how many
differences (and where) were among this miscopied files.
When I was satisfied with a result, I removed/moved/merged
some trees. Then I ran the summarizing script again and
did another testing.
I spent a few sparse weeks deduping the whole mess, and
now I am the proud owner of three identical trees on three
different disks.
Vilmos
Messy. I once wrote a script which computed the md5 hash
for all files in the given directories, and then, for
all hashes which occurred more than a single time, look
at the files and hardlink all these files to the one with
the oldest timestamp, which was presumed to be the original.
Since then I have implemented this algorithm a few times when needed
and the old script was not available, each time putting new
scripting skills to use.
This was my solution to preserve space across trees without changing
them.
To just remove files from trees B, C, D.. if they are in A, I today
have this simple helper script:
## START ###############################################################
#!/bin/sh
##
## return 0 if a given file $2 has its md5 hash listed in precomputed hashlist
## file $1, otherwise return 1
##
##
## this script's purpose is to aid in deduping stuff:
##
## to remove files from a hierarchy if their hashes are listed in another
## hierarchy's md5 hashlist file:
##
## find $DUPEDIRS -type f -exec $SCRIPT $HASHFILE \{\} \; -exec rm -v \{\} \;
##
## Author: Thomas Keusch <bsd...@gedankenverbrechen.org>
[ $# -ge 2 ] || { echo "${0##*/} MD5FILE FILE"; exit 1; }
MD5FILE=$1; shift;
FILE="$*"
[ -f "$MD5FILE" -a -f "$FILE" ] || exit 1
md5sum "$FILE" | cut -b-32 | grep -f- "$MD5FILE"
## END #################################################################
Regards
Thomas
> Messy. I once wrote a script which computed the md5 hash
> for all files in the given directories, and then, for
> all hashes which occurred more than a single time, look
> at the files and hardlink all these files to the one with
> the oldest timestamp, which was presumed to be the original.
>
Does the dedup functionality in ZFS save you some trouble? ....if of
course, ZFS is an option.
Also, maybe useful: lessfs, sdfs, opendedup
[No personal experience with those....]
--
Rahul
Re-inventing the wheel. I remember when schweikh3 was a new IOCCC
winner. Current code is here:
http://www.schweikhardt.net/samefile/index.html
> My experience with deduplication.
>
> The background. I made around 15,000 pictures/videos of my
> daughter in five years. Yeah, digital camera (doesn't)
> help.
Piker. I have about 38k photos on flickr, and many more not uploaded.
> The problem is that while I had the original
> (more or less intact) tree, I had multiple copies of it,
> some with different directory layout and/or filenames,
> some of them were not exact copies (missing dirs/files),
> some were hardlinked, some had extra "processed" junk in
> it (which I already forgot what it was), and even worse,
> my computer had a (then) undetected memory error, and
> some of the copies were not exact copies. All in all,
> I had something like 10-15 trees and over 200 thousands
> files. The whole mess was a couple of hundred gigabytes
> (maybe half a TB). Unfortunately, I also didn't completely
> trust the original tree as the full tree, so somehow I
> had to dedup the whole mess.
I have been experimenting with the phash library (www.phash.org)
for "perception" hashes of photos. It seems very promising, showing
that resized and recompressed images are in fact very similar, but
it totally fails on rotate 90 degrees. It is giving me much better
results than say, pdiff (pdiff.sourceforge.net) or compare from
imagemagick. (To be fair, pdiff is intended for a different purpose,
namely automated testing of rendering software.)
The phash code is not polished, and comes with an outdated hash
tree that has now been split into a separate project, requiring
some fiddling to update.
Elijah
------
hates dealing with C++ but phash is in C++
> I have been experimenting with the phash library (www.phash.org)
> for "perception" hashes of photos. It seems very promising, showing
> that resized and recompressed images are in fact very similar, but
> it totally fails on rotate 90 degrees.
A good decade ago I wrote such a program in shell with ImageMagick.
The whole thing worked something like this:
* Remove around 15% frame from the image to remove any
possibly introduced titles, frames, etc.
* Resize the remaining image to something like 4x4.
* Reduce the colors to 2 (or 4?).
* Convert the remaining "image" to something data only (no headers).
IIRC, I converted whatever left to xpm and removed anything not
representing a pixel.
* Do a (binary) compare on whatever left.
Despite of its primitiveness, I managed to find many duplicate
images. Of course, there were some spectacular positive/negative
failures, but I was surprised how well this solution still worked.
Vilmos