Beyond Compare Checksum

0 views

Skip to first unread message

Alacoque Whitchurch

unread,

Aug 5, 2024, 7:29:40 AM8/5/24

to downcurbdengio

ACRC is a 4-byte mathematical checksum of a file's contents. If a pair of files have different CRCs, then you can be sure their contents are different. If they have matching CRCs, then it's likely (but not certain) that their contents match. If you happen to have CRCs already calculated for files, then comparing them is quite fast. But here's the catch: Beyond Compare has to read every byte of each file to calculate the CRCs, so why not do the (more accurate) binary comparison instead? Binary comparisons can also be faster, because they stop reading each file as soon as a difference is found.

CRC comparisons are most useful when used in conjunction with the Snapshot feature. Snapshots can't hold the entire file, but they can hold CRCs along with other folder data. You can compare live data against a snapshot with a CRC check and be confident that if a file has been corrupted you'd find it.

The behavior of CRC comparisons depends on the FTP server you're connecting to. If the FTP server supports it, the server will generate the CRCs and only transfer the CRC values. If the server doesn't support it, Beyond Compare will have to transfer the entire file and calculate the CRC locally. If you see the command XCRC in the log, the CRC values are being generated by the server.

If you have 1,000,0000 source files, you suspect they are all the same, and you want to compare them what is the current fasted method to compare those files? Assume they are Java files and platform where the comparison is done is not important. cksum is making me cry. When I mean identical I mean ALL identical.

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."

Most people in their responses are ignoring the fact that the files must be compared repeatedly. Thus the checksums are faster as the checksum is calculated once and stored in memory (instead of reading the files sequentially n times).

Assuming that the expectation is that the files will be the same (it sound like that's the scenario), then dealing with checksums/hashes is a waste of time - it's likely that they'll be the same and you'd have to re-read the files to get the final proof (I'm also assuming that since you want to "prove ... they are the same", that having them hash to the same value is not good enough).

If that's the case I think that the solution proposed by David is pretty close to what you'd need to do. A couple things that could be done to optimize the comparison, in increasing level of complexity:

if you have control over the output have the program creating the files / output create an md5 on the fly and embed it in the file or output stream or even pipe the output through a program that creates the md5 along the way and stores it along side the data somehow, point is to do the calculations when the bytes are already in memory.

if you can't pull this off then like others have said, check file sizes then do a straight byte by byte comparison on same sized files, i don't see how any sort of binary division or md5 calculation is any better than a straight comparison, you will have to touch every byte to prove equality any way you cut it so you might as well cut the amount of computation needed per byte and gain the ability to cut off as soon as you find a mis-match.

First compare the file lengths of all million. If you have a cheap way to do so, start with the largest files. If they all pass that then compare each file using a binary division pattern; this will fail faster on files that are similar but not the same. For information on this method of comparison see Knuth-Morris-Pratt method.

There are a number of programs that compare a set of files in general to find identical ones. FDUPES is a good one: Link. A million files shoudln't be a a problem, depending on the exact nature of the input. I think that FDUPES requires Linux, but there are other such programs for other platforms.

Anyway, the general idea is to start by checking the sizes of the files. Files that have different sizes can't be equal, so you only need to look at groups of files with the same size. Then it gets more complicated if you want optimal performance: If the files are likely to be different, you should compare small parts of the files, in the hope of finding differences early, so you don't have to read the rest of them. If the files are likely to be identical, though, it can be faster to read through each file to calculate a checksum, because then you can read sequentially from the disk instead of jumping back and forth between two or more files. (This assumes normal disks, so SSD:s may be different.)

In my benchmarks when trying to make a faster program it (somewhat to my surprise) turned out to be faster to first read through each file to calculate a checksum, and then if the checksums were equal, compare the files directly by reading a blocks alternately from each file, than to just read blocks alternately without the previous checksum calculations! It turned out that when calculating the checksums, Linux cached both files in main memory, reading each file sequentially, and the second reads were then very fast. When starting with alternating reads, the files were not (physically) read sequentially.

Some people have expressed surprise end even doubt that it could be faster to read the files twice than reading them just once. Perhaps I didn't manage to explain very clearly what I was doing. I am talking about cache pre-loading, in order to have the files in disk cache when later accessing them in a way that would be slow to do on the physical disk drive. Here is a web page where I have tried to explain more in detail, with pictures, C code and measurements.

I don't think hashing is going to be faster than byte by byte comparisons. The byte by byte comparison can be optimized a bit by pipelining the reading and comparision of the bytes, also multiple sections of the file could be compared in parallel threads. It would be go something like this:

Why reinvent the wheel? How about a third party app? Granted it doesn't have APIs but I don't imagine you put your self in this situation often. I like this app doublekiller just make a backup before you start. :) It's fast and free!

Use a for loop to check if any of these sizes are the same.if they are the same size, compare a byte of one file to a byte of the other file. If the two bytes are the same, move onto the next byte. If a difference is found, return that the files are different.

I have experimented with comparing MD5 hashes of files rather than going through byte for byte, and I have found that identical files are often missed with this method, however it is significantly faster.

I would first create a database table with columns pathname and sha_1 of file_contents,

all the files and store the pathName and sha_1,

then upon consequent storing put it in a database,

sha_1 file check if sha_1 exists in db,

if in db,

output to a log that that file existed with pathname,

do whatever with it lol create a symlink.

upon file upload implement it in your validation,

This CRC implementation is relatively standard. Depending on the file size as well, the hash function may be different. This implementation DOES not contain a lookup table, which you would want.

This is the problem exactly. To attempt to get accurate ETA or a full % complete, the initial startup time to calculate that would be immense. Starting from a cancelled restore would waste this calculation. It would have to scan every file checksum locally, download all the remote chunks and compare. This is a huge waste of time and resources - which IMHO is why the way it is - and the way similar software has worked for decades.

I was using rclone yesterday and the ETA said 2 minutes for near an hour because it calculated the size of the remaining files, but because there were 60,000 of them it did not take into account the latency of starting all of those transfers.

Not entirely clear, but could you be getting at the fact that Duplicacy does incremental, delta-like, backups which means only the changed parts get uploaded? And that a restore might be doing the reverse and rely on a known state of that file?

I performed some restore tests and noted that the actual content of the folder i am restoring into is not deleted, so the files I restore are added to the files that already are in the destination folder.

A bit of topic, but ability to mount snapshots is way overdue. I like the way Kopia implemented it (also golang app btw) - you run kopia mount --browse And bam - Finder opens with a folder full of snapshot, just like Time Machine presents its.

If the -ignore-owner option is specified, the restore procedure will not attempt to restore the original user/group id ownership on restored files (all restored files will be owned by the current user); this can be useful when restoring to a new or different machine.