I tried to write a file dupe finder. For this to work, I created an
improved File::Stat, like this:
class File::StatWithSha < File::Stat
attr_reader :filename, :read
def initialize fn
@filename=File.expand_path fn
@read = 0
super fn
end
def sha1sum
return @sha1sum if @sha1sum ||= nil
warn "Calculating sha1sum for #@filename"
chunk = nil
fs = 0
d = Digest::SHA1.new
File.open(filename) {|f|
begin
while chunk = f.sysread(1048576)
fs += chunk.length
d.update(chunk)
end
rescue EOFError
warn "\nResult is #{d} #{fs} <=> #{self.size}"
return @sha1sum = d
rescue e
warn "Holy shit! #{e}"
end
}
warn "Oh my god!"
exit
end
def inspect; @filename;end
end
When under windows, it fails with both ruby1.8.2 and ruby1.8.4
irb(main):006:0> fws.sha1sum
Calculating sha1sum for F:/private/prg/ruby/g2.rb
Chunk is 2113
Result is c75de1a39ce389e7e198c97345ffad52b074e5e9 2113 <=> 2210
=> c75de1a39ce389e7e198c97345ffad52b074e5e9
Under linux it works fine.
Anyway, how should I calculate the sha1sum of a BIG file, just using
ruby?
Probably you should open the files with "rb" instead of letting it
default to "r".
> Anyway, how should I calculate the sha1sum of a BIG file, just using
> ruby?
>
For finding dups, I wonder if it's useful to compare checksums unless
you've already computed them in advance. I notice that Ruby's own
FileUtils.install checks filea == fileb by simply comparing the files
until it finds a difference or gets to EOF.
It depends. If you want to find duplicates in a set of files then using
the digest as hash key can make finding duplicates much faster. OTOH if
you can detect candidates by looking at other attributes (size,
mtime...) then the additional overhead for the checksum calculation
might slow things down. It depends - as always. :-)
Btw, I don't see a reason to use sysread in this scenario. read will do.
Kind regards
robert
> For finding dups, I wonder if it's useful to compare checksums unless
> you've already computed them in advance. I notice that Ruby's own
> FileUtils.install checks filea == fileb by simply comparing the files
> until it finds a difference or gets to EOF.
Well, first I'd like to partition files based on filesize. And after
that, I compare them.
If you have more than 2 files having the same size, it's better to
calculate sha1sum for all the files involved once. And, if you'd like
to live on the safe side, you can compare by content the files having
the same sha1sum.
And, you can improve caching sha1sums (say in a file in every
directory).