calculate md5sums?

Janek Kozicki

unread,

Aug 2, 2007, 6:17:00 AM8/2/07

to hardlink-py

Hi,

I'm using rsnapshot.org and I wanted to save a bit more space with
hardlinks, however what I already have is heavily hardlinked. It
occupies 270 GB on HDD (if not the hardlinks it would occupy about
2500 GB, because I have 10 copies of whole directory tree with slight
changes between daily and weekly snapshots).

hardlink.py has hard time with this because of so huge number of
files, it ran overnight comparing files. Used lots of memory and
finally I interrupted it.

I had the idea to speed up the comparison process, by first
calculating md5sums of ALL used inodes (not files!). Later the usual
loop starts and md5sums are used as a part of the hash, just like file
size, date or name.

Janek Kozicki

unread,

Aug 2, 2007, 8:23:27 AM8/2/07

to hardlink-py

David Cantrell said: (by the date of Thu, 2 Aug 2007 12:12:09
+0100)

> On Wed, Aug 01, 2007 at 06:56:37PM +0200, Janek Kozicki wrote:
>
> > this one looks promising, I'm testing it now:
> > http://hardlinkpy.googlecode.com/svn/trunk/hardlink.py
>
> Cool. Please let us know how well it works!

it worked fine on a small sample: about 5 GB (few selected
directories).

Next I left it overnight to work on a whole /.snapshots/ (280 GB -
would
be 2500 GB if not the hardlinks ;) and in the morning it didn't even
went past a single hourly.0 (it started with this one). It filled
about 600 MB of RAM with accumulated data about files, and I
interrupted it with ^C.

This program works in following way (if I understand correctly, I'm
not good at python):

1 run a single recursive loop on all dirs/files
1.1 for each file store its size (and name, and time, and
permissions,
depending on options that you pass to it) in RAM
1.2 check in RAM if there was any other file scanned before with
the
same size (and name and time and permissions)
1.3 if yes - compare the two files to check if they are identical,
and eventually store them in RAM as a candidates for hardlink.

2 when the first loop is over: hardlink all files found in the first
loop. It's done in a way that always a *file* that uses an inode with
smaller count of references is deleted.

It bothers me that this program does not calculate md5sums. In point
1.3 while performing binary comparison of two files it is possible to
simultaneously calculate the md5sum (because the files are being
read). Those md5sums could be later stored in memory to eliminate
unnecessary comparisons. Of course when we approach a new file we
must compare it with at least one candidate (based on size and name
and time), but later we have its md5sum, because single comparison
was done, thus some of later comparisons can be eliminated.

Also I'm not sure but I suspect that this script does not use the
fact that if some file is found similar with other file (both having
different inodes), the in fact ALL the files that use those two
inodes are identical. And hardlinking them should affect ALL files
using one of the inodes, so that the inode is effectively deleted
(instead of just decreasing the number of references to this inode
by 1).

I'm not a python programmer, I planned to learn python, but not now.
So I'll not start hacking this script right now. Eiter later or
someone of you will do it :)

I googled a bit more, and couldn't find anything as good as this one.

--
Janek Kozicki

John Villalovos

unread,

Aug 2, 2007, 2:22:50 PM8/2/07

to hardl...@googlegroups.com

On 8/2/07, Janek Kozicki <cos...@gmail.com> wrote:
> Next I left it overnight to work on a whole /.snapshots/ (280 GB -
> would
> be 2500 GB if not the hardlinks ;) and in the morning it didn't even
> went past a single hourly.0 (it started with this one). It filled
> about 600 MB of RAM with accumulated data about files, and I
> interrupted it with ^C.

The only thing I can think of is maybe you have circular references?
I don't think my script handles that at the moment.

I scan over 500GB without a problem. Though I have probably larger files.

> It bothers me that this program does not calculate md5sums. In point
> 1.3 while performing binary comparison of two files it is possible to
> simultaneously calculate the md5sum (because the files are being
> read). Those md5sums could be later stored in memory to eliminate
> unnecessary comparisons. Of course when we approach a new file we
> must compare it with at least one candidate (based on size and name
> and time), but later we have its md5sum, because single comparison
> was done, thus some of later comparisons can be eliminated.

I tried saving MD5SUMS but during my testing it did not improve performance.

> Also I'm not sure but I suspect that this script does not use the
> fact that if some file is found similar with other file (both having
> different inodes), the in fact ALL the files that use those two
> inodes are identical. And hardlinking them should affect ALL files
> using one of the inodes, so that the inode is effectively deleted
> (instead of just decreasing the number of references to this inode
> by 1).

That could be true. Basically it works on a first come, first link
basis, which is probably not the best solution.

John

Janek Kozicki

unread,

Aug 2, 2007, 3:25:51 PM8/2/07

to hardlink-py

On Aug 2, 8:22 pm, "John Villalovos" <sodar...@gmail.com> wrote:
[..]

> > It filled about 600 MB of RAM with accumulated data about files, and I
> > interrupted it with ^C.
>
> The only thing I can think of is maybe you have circular references?
> I don't think my script handles that at the moment.
>
> I scan over 500GB without a problem. Though I have probably larger files.

This is a backup of three debian etch boxes: their /home /usr /etc /
var
directories. Currently it occupies 282GB:

# rsnapshot du
221G /backup/.sync
460M /backup/hourly.0/
4.9G /backup/hourly.1/
9.1G /backup/hourly.2/
577M /backup/hourly.3/
4.2G /backup/hourly.4/
611M /backup/hourly.5/
25G /backup/daily.0/
13G /backup/daily.1/
3.3G /backup/daily.2/
625M /backup/monthly.0/
282G total

This count is calculated taking into account that there are hardlinks
inside. If it didn't - each directory would be 282 GB. The file count
in
each snapshot easily reaches 500 000 files. So in fact I have about
5 000 000 files here in all directories. 282GB/500 000 gives on
average 600kB per file.

Now. I'm explaining this, because I don't understand what do you
mean by circular references. Rsnapshot takes backup copies every
four hours, the copies are rotated down - hourly.0 is moved to
hourly.1 and hourly.6 is deleted. Once per day hourly.6 is moved to
daily.0 instead of being deleted, and so on. So, the "real" files are
in /.sync/ all the others (hourly.N, daily.N, ..) have mainly files
hardlinked to /.sync/ - except for some files that were deleted
from /.sync/ during an rsync operation.

I don't know if this qualifies as circular references.

> I tried saving MD5SUMS but during my testing it did not improve performance.

do you have a version somewhere of your script, that uses
md5sums? In SVN history maybe? It would be nice if this was optional
to use (like you have currently options to take into account file time
and name).

> > Also I'm not sure but I suspect that this script does not use the
> > fact that if some file is found similar with other file (both having
> > different inodes), the in fact ALL the files that use those two
> > inodes are identical. And hardlinking them should affect ALL files
> > using one of the inodes, so that the inode is effectively deleted
> > (instead of just decreasing the number of references to this inode
> > by 1).
>
> That could be true. Basically it works on a first come, first link
> basis, which is probably not the best solution.

Yes... especially in my scenario, where everything is so heavily
hardlinked
together. Imagine that those files are the same:

/.sync/linuxbox1/etc/services
/.sync/linuxbox2/etc/services
/.sync/linuxbox3/etc/services

and your script will find this out. But then.. it will need to
rediscover
this again for all the other snapshots:

/hourly.0/linuxbox1/etc/samba/services
/hourly.0/linuxbox2/etc/samba/services
/hourly.0/linuxbox3/etc/samba/services

/hourly.1/linuxbox1/etc/samba/services
/hourly.1/linuxbox2/etc/samba/services
/hourly.1/linuxbox3/etc/samba/services

...

/monthly.0/linuxbox1/etc/samba/services
/monthly.0/linuxbox2/etc/samba/services
/monthly.0/linuxbox3/etc/samba/services

While in fact the ALL those files are using the same inode:

/.sync/linuxbox1/etc/services
/hourly.0/linuxbox1/etc/samba/services
/hourly.1/linuxbox1/etc/samba/services
...
/monthly.0/linuxbox1/etc/samba/services

So in fact I'm using here just three inodes (one inode for
each linuxboxN), and I save space only after ALL files are
rehardlinked, and two unnecessary inodes are deleted.

Well in this example the file "/etc/services" is small, but it's not
the
case with whole /usr directory. Or some other data, it grows
to GB of possibly saved space ;-)

Thanks for your script, I'll try to not interrupt it this time
and let it run a bit longer :)

Janek Kozicki

John Villalovos

unread,

Aug 2, 2007, 4:10:09 PM8/2/07

to hardl...@googlegroups.com

On 8/2/07, Janek Kozicki <cos...@gmail.com> wrote:

> Thanks for your script, I'll try to not interrupt it this time
> and let it run a bit longer :)

Well I really don't think it should take as long as you're seeing.
Usually for me it finishes in a few minutes.

I will try to respond to the rest of your email later, when I have some time.

John

Janek Kozicki

unread,

Aug 3, 2007, 3:42:58 AM8/3/07

to hardlink-py

Ah I understand what you mean by recursive symlinks - it could
be done with directories. like /usr/usr -> /usr which allows
infinitely
deep path.

I'm not sure if I have any such links, I should check this...

However which hardlink.py is running I didn't notice a pathname
growing and growing. All the files being compared have
reasonable names and paths.

> Well I really don't think it should take as long as you're seeing.
> Usually for me it finishes in a few minutes.

First I ran it on all /backup/*/*/usr/ directories (the 1st * is a
snapshot name, 2nd * is a Nth linuxbox name). It finished
after 6 hours, and saved me 11 GB.

Then I ran it on all /backup/*/*/home and /backup/*/*/usr dirs.
It's still running (about 3 hours) and now it consumes 700 MB of ram.
I'm not interrupting though, we will see how it performs...

> I will try to respond to the rest of your email later, when I have some time.

don't worry I have a job to do also ;)

Janek Kozicki

unread,

Aug 4, 2007, 5:41:55 AM8/4/07

to hardlink-py

(I've send this post also to rsnapshot-discuss mailing list)

Before running hardlink.py I had 282 GB used:

# rsnapshot du
221G /backup/.sync
460M /backup/hourly.0/
4.9G /backup/hourly.1/
9.1G /backup/hourly.2/
577M /backup/hourly.3/
4.2G /backup/hourly.4/
611M /backup/hourly.5/
25G /backup/daily.0/
13G /backup/daily.1/
3.3G /backup/daily.2/
625M /backup/monthly.0/
282G total

Then I ran hardlink.py twice. On the first run as argument I gave all
directories /backup/*/*/usr (so I wanted hardlink to find out that
the boxes being backupped use debian etch - the same binaries). The
run took 6 hours and saved me 11 GB.

Next I ran hardlink.py on whole /backup dir. The run took about one
day. Consumed 1 GB of RAM and 2 GB of swap. At the end it was going
really slow due to heavy HDD trashing in the swap. It saved me about
6 GB. At the end of this run hardlink.py printed this statistics:

Directories : 1174891
Regular files : 14382099
Comparisons : 1956909
Hardlinked this run : 1055814
Total hardlinks : 13245974
Bytes saved this run : 112725367154 (104.984 gibibytes)
Total bytes saved : 2307874638453 (2149.376 gibibytes)
Total run time : 82048.8372171 seconds

And also rsnapshot says following:

216G /backup/.sync
460M /backup/hourly.0/
549M /backup/hourly.1/
4.5G /backup/hourly.2/
5.8G /backup/hourly.3/
561M /backup/hourly.4/
1.1G /backup/hourly.5/
25G /backup/daily.0/
7.6G /backup/daily.1/
3.2G /backup/daily.2/
625M /backup/monthly.0/
264G total

So I saved about 18 GB in total.

Concluding: if you have about 3 GB of ram you can expect hardlink.py
to finish the job in about 10 hours (when you run it on whole
snapshot directory). This HDD here achieved about 70 MB/sec in hdparm.

Now I will try to run it only on /backup/.sync/ and /backup/hourly.0/
(before rotating the backups, after 'rsnapshot_sync' - calling
from "cmd_postexec"). I expect that it will be faster on just two dirs